Cluster computing

Saturday, December 23, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information
The correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. The residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate - the residual, more acceptable.
I want to mention that curve fitting improves with higher degrees such as a quadratic over linear but this does not mean we go to as high a degree as possible. The model we build tried to minimize the residuals. If we can do this with lesser degrees, then that is acceptable
When we compare two models, we follow certain rules. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a tradeoff between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.

Friday, December 22, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information
The correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. The residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate - the residual, more acceptable.
I want to mention that curve fitting improves with higher degrees such as a quadratic over linear but this does not mean we go to as high a degree as possible. The model we build tried to minimize the residuals. If we can do this with lesser degrees, then that is acceptable.
#codingexercise
Print Hosoya's triangle
Hosoya's triangle is a triangular arrangement of numbers based on Fibonacci numbers where the sum of the two numbers in the left diagonal or the right diagonal results in the number at the current position.
The recurrence relation is
H(0,0) = H(1,0) = H(1,1) = H(2,1) = 1
and
H(n,j) = H(n-1,j) + H(n-2, j)
= H(n-1, j-1) + H(n-2,j-2)
And H(n,i) = F(i+1) x F ( n - i + 1)

int GetHosoya(int r, int c)
{
if (r == 0 && c == 0) return 1;
if (r == 1 && c == 0) return 1;
if (r == 1 && c == 1) return 1;
if (r == 2 && c == 1) return 1;
if (r > c)
return GetHosoya(r - 1, c) + GetHosoya(r - 2, c);
if (r == c)
return GetHosoya(r - 1, c - 1) + GetHosoya(r - 2, c - 2);
return 0;
}

Thursday, December 21, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models.
A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval.
We were saying that a model that is made linear looks easy to interpret but it is not accurate especially when transforming non-linear models. We will talk more about this process but first let us consider the m-dimensional space.
Usually non-linear models are solved with an objective function such as the chi-square error function. The gradients of the chi-square function with respect to the parameters must approach zero at the minimum of the objective function. In the steepest descent method, the minimization proceeds iteratively until it stops decreasing.

The error function depends on the model parameters. The M-dimensions can be considered a surface of which we seek a minimum. If the dimensions are large in number, the model may be complex and the surface may have more than one minimum. The optimal minimization will strive to find the true global minimum of the error surface and not just a local minimum. In order to converge to the global minimum, one technique is to vary the starting points or the initial guesses and compare the resulting model parameters.
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information. A model may not be accurate and yet it might systematically differ from the data. In this case, the correlation between the model and the data will be high. This correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. These residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate also called the residual more acceptable.

Find the maximum product of an increasing subsequence
double GetProductIncreasingSubsequence(List<double> A, int n)
{
Debug.Assert(A.All(x => x >= 0));
var products = new List<double>();
for (int i = 0; i < n; i++)
products.Add(A[i]);
Debug.Assert(products.Length == A.Length);

for (int i = 1; i < n; i++)
for (int j = 0; j < i; j++)
if (A[i] > A[j] &&
products[i] < (products[j] * A[i]))
products[i] = products[j] * A[i];

return products.Max();
}
A: 2 0 4

P: 2 0 8

Wednesday, December 20, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids.
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. These facts should not be violated.
It should be noted that confidence interval is tied to the model. If the model were transformed from non-linear to linear, the confidence interval varies. It is often cited as oversimplification because it violates assumptions. This leads to more error in the model parameter values.

We were saying that a model based on linearization is easy but it is not accurate especially when transforming non-linear models.
Usually non-linear models are solved with an objective function such as the chi-square error function. The gradients of the chi-square function with respect to the parameters must approach zero at the minimum of the objective function. In the steepest descent method, the minimization proceeds iteratively until it stops decreasing.

#codingexercise
Find the square root of a number
Double sqrt(double n)
{
Double x = n;
Double y = 1;
Double threshold = 1E-5;
while (x – y > threshold)
{
x = (x+y)/2;
y= n/x;
}
return x;
}

Tuesday, December 19, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise. We can use the Chi-square merit function for this purpose.
Next we review confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids. Generally we pick the region that is reasonably compact and centered around the point of reference of the parameter space. This region or band of data can be described as y = prctile(x, [5,95])
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. Similarly, Poisson's ratio cannot turn out to exceed 0.5. This is the ratio of the proportional decrease in width to the ratio of the proportional increase in length as a material is stretched. Other examples might put both an upper and lower bound on the parameters. These facts should not be violated.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions We also mentioned that projections can be initiated from any sentinel values to split the range in the three arrays. What sentinel values we choose does not depend only the start and the end of an array. Given any range of interest, it can be projected on all the three arrays to split the respective ranges into array specific and non-overlapping subarrays as well as overlapping with the given range. This is very useful to shrink the indices for performing computations related to a given range.

Also, note that range division is not the only benefit. We can also approximate computations in all three arrays by performing a similar operation within a projection.

Monday, December 18, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg.There are two ways by which we can select the appropriate model. The first is by observing trend lines which correspond to some well known mathematical formula. The second is on the observation of underlying physical processes which contribute towards the system. These physical interpretations contribute to model parameters.
In order to fit the data points, a model may use least squares of errors. the errors called residuals may be both positive or negative which result in inaccurate measure. Instead the squares of the errors can be minimized to give a better fit.
We used the least squares error minimization to fit the data points. Another way to do this is using Maximum likelihood estimation. This method asks: "Given my set of model parameters, what is the probability that this data set occurred ?" This translates as likelihood for the parameters given the data.
The chi-square error measure and maximum likelihood estimation have a relation between the two.For a Gaussian distribution, the probability of the data set coming from the model parameters involves minimizing the negative natural log of probability which is the chi-square function of weighted residuals. Furthermore, if the variance is uniform, then the chi-square function yields the sum of squared residuals.
Linear regression analysis strikes a relationship between a dependent variable and an independent variable. The best values of slope and intercept are found by taking partial derivatives of the error function with respect to them. Regression parameters are different from correlation coefficient. The latter describes the strength of association between two variables. It is therefore symmetric for the pair of variables unlike regression parameters.
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions In the same spirit we continue that projections can be initiated from any sentinel values.

Sunday, December 17, 2017

#codingexercise
we are given three sorted arrays and we want to find one element from each array such that they are closest to each other. One of the ways to do this was explained this way: We could also traverse all three arrays while keeping track of maximum and minimum difference encountered with the candidate set of three elements. We traverse by choosing one element at a time in any one array by incrementing the index of the array with the minimum value.
By advancing only the minimum element, we make sure the sweep is progressive and exhaustive.
List<int> GetClosest(List<int> A, List<int> B, List<int> C)
{
var ret = new List<int>();
int i = 0;
int j = 0;
int k = 0;
int dif f = INT_MAX;
while ( i < A.Count && j < B.Count && k < C.Count)
{
var candidates = new List<int>() { A[i], B[j], C[k] };
int range = Math.Abs(candidates.Min() - candidates.Max());
if ( range < diff)
{
diff = range;
ret = candidates.ToList();
}
if (range == 0) return ret;
if (candidates.Min() == A[i])
{
i++;
} else if (candidates.Min() == B[j])
{
j++;
} else {
k++;
}
}
return ret;

}
We don't have to sweep for average because we could enumerate all the elements of the array to find the global average. then we can find the elements closest to it in the other two arrays.
List<int> GetAveragesFromThreeSortedArrays(List<int> A, List<int> B, List<int> C)
{
var combined = A.Union(B).Union(C).ToList();
int average = combined.Avg();
return GetClosestToGiven(A, B, C, average);
}
List<int> GetClosestToGiven(List<int> A, List<int> B, List<int>C, int value)
{
assert (A.Count > 0 && B.Count > 0 && C.Count > 0);
var ret = new List<int>();
ret.Add(GetClosest(A, value)); // using binary search
ret.Add(GetClosest(B, value));
ret.Add(GetClosest(C, value));
return ret;
}
Here we simply do a binary search on a sorted list to find the element closest to the given value. The value could lie either before or after the given value in the sorted sequence.