Cluster computing

Thursday, December 21, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models.
A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval.
We were saying that a model that is made linear looks easy to interpret but it is not accurate especially when transforming non-linear models. We will talk more about this process but first let us consider the m-dimensional space.
Usually non-linear models are solved with an objective function such as the chi-square error function. The gradients of the chi-square function with respect to the parameters must approach zero at the minimum of the objective function. In the steepest descent method, the minimization proceeds iteratively until it stops decreasing.

The error function depends on the model parameters. The M-dimensions can be considered a surface of which we seek a minimum. If the dimensions are large in number, the model may be complex and the surface may have more than one minimum. The optimal minimization will strive to find the true global minimum of the error surface and not just a local minimum. In order to converge to the global minimum, one technique is to vary the starting points or the initial guesses and compare the resulting model parameters.
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information. A model may not be accurate and yet it might systematically differ from the data. In this case, the correlation between the model and the data will be high. This correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. These residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate also called the residual more acceptable.

Find the maximum product of an increasing subsequence
double GetProductIncreasingSubsequence(List<double> A, int n)
{
Debug.Assert(A.All(x => x >= 0));
var products = new List<double>();
for (int i = 0; i < n; i++)
products.Add(A[i]);
Debug.Assert(products.Length == A.Length);

for (int i = 1; i < n; i++)
for (int j = 0; j < i; j++)
if (A[i] > A[j] &&
products[i] < (products[j] * A[i]))
products[i] = products[j] * A[i];

return products.Max();
}
A: 2 0 4

P: 2 0 8

Wednesday, December 20, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids.
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. These facts should not be violated.
It should be noted that confidence interval is tied to the model. If the model were transformed from non-linear to linear, the confidence interval varies. It is often cited as oversimplification because it violates assumptions. This leads to more error in the model parameter values.

We were saying that a model based on linearization is easy but it is not accurate especially when transforming non-linear models.
Usually non-linear models are solved with an objective function such as the chi-square error function. The gradients of the chi-square function with respect to the parameters must approach zero at the minimum of the objective function. In the steepest descent method, the minimization proceeds iteratively until it stops decreasing.

#codingexercise
Find the square root of a number
Double sqrt(double n)
{
Double x = n;
Double y = 1;
Double threshold = 1E-5;
while (x – y > threshold)
{
x = (x+y)/2;
y= n/x;
}
return x;
}

Tuesday, December 19, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise. We can use the Chi-square merit function for this purpose.
Next we review confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids. Generally we pick the region that is reasonably compact and centered around the point of reference of the parameter space. This region or band of data can be described as y = prctile(x, [5,95])
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. Similarly, Poisson's ratio cannot turn out to exceed 0.5. This is the ratio of the proportional decrease in width to the ratio of the proportional increase in length as a material is stretched. Other examples might put both an upper and lower bound on the parameters. These facts should not be violated.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions We also mentioned that projections can be initiated from any sentinel values to split the range in the three arrays. What sentinel values we choose does not depend only the start and the end of an array. Given any range of interest, it can be projected on all the three arrays to split the respective ranges into array specific and non-overlapping subarrays as well as overlapping with the given range. This is very useful to shrink the indices for performing computations related to a given range.

Also, note that range division is not the only benefit. We can also approximate computations in all three arrays by performing a similar operation within a projection.

Monday, December 18, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg.There are two ways by which we can select the appropriate model. The first is by observing trend lines which correspond to some well known mathematical formula. The second is on the observation of underlying physical processes which contribute towards the system. These physical interpretations contribute to model parameters.
In order to fit the data points, a model may use least squares of errors. the errors called residuals may be both positive or negative which result in inaccurate measure. Instead the squares of the errors can be minimized to give a better fit.
We used the least squares error minimization to fit the data points. Another way to do this is using Maximum likelihood estimation. This method asks: "Given my set of model parameters, what is the probability that this data set occurred ?" This translates as likelihood for the parameters given the data.
The chi-square error measure and maximum likelihood estimation have a relation between the two.For a Gaussian distribution, the probability of the data set coming from the model parameters involves minimizing the negative natural log of probability which is the chi-square function of weighted residuals. Furthermore, if the variance is uniform, then the chi-square function yields the sum of squared residuals.
Linear regression analysis strikes a relationship between a dependent variable and an independent variable. The best values of slope and intercept are found by taking partial derivatives of the error function with respect to them. Regression parameters are different from correlation coefficient. The latter describes the strength of association between two variables. It is therefore symmetric for the pair of variables unlike regression parameters.
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions In the same spirit we continue that projections can be initiated from any sentinel values.

Sunday, December 17, 2017

#codingexercise
we are given three sorted arrays and we want to find one element from each array such that they are closest to each other. One of the ways to do this was explained this way: We could also traverse all three arrays while keeping track of maximum and minimum difference encountered with the candidate set of three elements. We traverse by choosing one element at a time in any one array by incrementing the index of the array with the minimum value.
By advancing only the minimum element, we make sure the sweep is progressive and exhaustive.
List<int> GetClosest(List<int> A, List<int> B, List<int> C)
{
var ret = new List<int>();
int i = 0;
int j = 0;
int k = 0;
int dif f = INT_MAX;
while ( i < A.Count && j < B.Count && k < C.Count)
{
var candidates = new List<int>() { A[i], B[j], C[k] };
int range = Math.Abs(candidates.Min() - candidates.Max());
if ( range < diff)
{
diff = range;
ret = candidates.ToList();
}
if (range == 0) return ret;
if (candidates.Min() == A[i])
{
i++;
} else if (candidates.Min() == B[j])
{
j++;
} else {
k++;
}
}
return ret;

}
We don't have to sweep for average because we could enumerate all the elements of the array to find the global average. then we can find the elements closest to it in the other two arrays.
List<int> GetAveragesFromThreeSortedArrays(List<int> A, List<int> B, List<int> C)
{
var combined = A.Union(B).Union(C).ToList();
int average = combined.Avg();
return GetClosestToGiven(A, B, C, average);
}
List<int> GetClosestToGiven(List<int> A, List<int> B, List<int>C, int value)
{
assert (A.Count > 0 && B.Count > 0 && C.Count > 0);
var ret = new List<int>();
ret.Add(GetClosest(A, value)); // using binary search
ret.Add(GetClosest(B, value));
ret.Add(GetClosest(C, value));
return ret;
}
Here we simply do a binary search on a sorted list to find the element closest to the given value. The value could lie either before or after the given value in the sorted sequence.

Saturday, December 16, 2017

We were discussing the benefits for client-server separation of computing so that client specific improvements can happen independent of the server. While server side user controls, grids and views have been tremendously popular and continue to evolve with Model-View-Controller, Model-View-ViewModel techniques along with templates and rendering engines, they are only for the convenience of organization and the developers because all the code and the resources are available all at once for mix and match in any page to be rendered. On the other hand, client side javascripting can take care of most of these request and also have their own libraries while working in a platform independent and browser independent way. If we give the resources to the client side library and ask the client to display the page, the server doesn't need to do them while facilitating developments on the client side. While HTML5 and client side scripts have preferred brevity in their source, the mix and match processing when offloaded to the client can still be brief as long as all the resources are available as a bundle for the duration of the page or session.

One of the advantages of client side scripting is the significant amount of support for development that we get from browser and plugins. Some security is better enforced from the server side but allowing computations on public resources can easily be dedicated to the client side.

Let us look at few of the security concerns in the client side technologies. For example, most of the scripts on the client side is done with javascript. One of the most common concerns with Javascript is that it can be injected by third parties. Another concern is that it is often used to append or prepare html with string concatenations. Even server side html injection has been shown to be vulnerable but client side is notorious. Second browsers have to enforce cross origin resource sharing which previously allowed cross domain Ajax requests. Third the processing is utterly visible to all. The XHR API shows not only the data that is exchange but also the calls made. A view page source is enough to show how cluttered the client side can get
One of the most common vulnerability seem with client side scripting is that input and injection attacks are not thwarted by way of mitigation. Although HTTP headers are used to provide some amount of security, they rely on the best practice from both origination and destination Even SQL can be injected directly through the user interface. Locale and geographical customizations often make it harder to enforce a rule much less to talk about user agents, browsers, apps and devices. While HTML5, JSON and CSS have each addressed this with their own separation of concerns and brevity, much of the script is still left to the programmer to organize. Technologies on in this end tried to innovate on functionalities with flash and ActiveX but users had to disable some of these for the sake of security. Instead of we could come up with newer organization, more brevity and separation of concerns within resources and scripts so that we have far more security and less surface.

Friday, December 15, 2017

If a web page is viewed as an html, text usually appears as span within a div. These div only indicate the position but their content is decided by the rendering engine such as Razr. The literals in the html are looked up based on a identifier that acts as its alias. The same alias may be used for the translations of the string in different languages. Since the (alias, locale) pair is unique, we know that there will be a unique translation for the string. When the view is prepared, these literals are pulled from the database. Usually this preparation happens on the server side. If it were assets such as css, javascript or static html, then they could be served by the content delivery network which ensures that they are as geographically close as possible to the user. String are resources too but they don't appear as a resource on the CDN because there is a computation involved. But technically each string may be placed in its own static file and the sections of the html that display the content may instead serve the literal via loading a file. Round trips from a CDN are generally not considered expensive but given that there may be many literals on a page, this could become expensive. Each CDN may also need to keep its own copy of the literals. This means we don't have the single point maintenance on the server side as we did earlier. Moreover, updates and deletes of the string involves making changes to all CDN. But it is not impossible to offload the literals as a resource so that html can load it on the client side without having the server to incur the work.
The above discussion merely marks a benefit for client-server separation of computing so that client specific improvements can happen independent of the server. While server side user controls, grids and views have been tremendously popular and continue to evolve with Model-View-Controller, Model-View-ViewModel techniques along with templates and rendering engines, they are only for the convenience of organization and the developers because all the code and the resources are available all at once for mix and match in any page to be rendered. On the other hand, client side javascripting can take care of most of these request and also have their own libraries while working in a platform independent and browser independent way. If we give the resources to the client side library and ask the client to display the page, the server doesn't need to do them while facilitating developments on the client side. While HTML5 and client side scripts have preferred brevity in their source, the mix and match processing when offloaded to the client can still be brief as long as all the resources are available as a bundle for the duration of the page or session.