Cluster computing

Tuesday, December 19, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise. We can use the Chi-square merit function for this purpose.
Next we review confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids. Generally we pick the region that is reasonably compact and centered around the point of reference of the parameter space. This region or band of data can be described as y = prctile(x, [5,95])
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. Similarly, Poisson's ratio cannot turn out to exceed 0.5. This is the ratio of the proportional decrease in width to the ratio of the proportional increase in length as a material is stretched. Other examples might put both an upper and lower bound on the parameters. These facts should not be violated.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions We also mentioned that projections can be initiated from any sentinel values to split the range in the three arrays. What sentinel values we choose does not depend only the start and the end of an array. Given any range of interest, it can be projected on all the three arrays to split the respective ranges into array specific and non-overlapping subarrays as well as overlapping with the given range. This is very useful to shrink the indices for performing computations related to a given range.

Also, note that range division is not the only benefit. We can also approximate computations in all three arrays by performing a similar operation within a projection.

Monday, December 18, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg.There are two ways by which we can select the appropriate model. The first is by observing trend lines which correspond to some well known mathematical formula. The second is on the observation of underlying physical processes which contribute towards the system. These physical interpretations contribute to model parameters.
In order to fit the data points, a model may use least squares of errors. the errors called residuals may be both positive or negative which result in inaccurate measure. Instead the squares of the errors can be minimized to give a better fit.
We used the least squares error minimization to fit the data points. Another way to do this is using Maximum likelihood estimation. This method asks: "Given my set of model parameters, what is the probability that this data set occurred ?" This translates as likelihood for the parameters given the data.
The chi-square error measure and maximum likelihood estimation have a relation between the two.For a Gaussian distribution, the probability of the data set coming from the model parameters involves minimizing the negative natural log of probability which is the chi-square function of weighted residuals. Furthermore, if the variance is uniform, then the chi-square function yields the sum of squared residuals.
Linear regression analysis strikes a relationship between a dependent variable and an independent variable. The best values of slope and intercept are found by taking partial derivatives of the error function with respect to them. Regression parameters are different from correlation coefficient. The latter describes the strength of association between two variables. It is therefore symmetric for the pair of variables unlike regression parameters.
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions In the same spirit we continue that projections can be initiated from any sentinel values.

Sunday, December 17, 2017

#codingexercise
we are given three sorted arrays and we want to find one element from each array such that they are closest to each other. One of the ways to do this was explained this way: We could also traverse all three arrays while keeping track of maximum and minimum difference encountered with the candidate set of three elements. We traverse by choosing one element at a time in any one array by incrementing the index of the array with the minimum value.
By advancing only the minimum element, we make sure the sweep is progressive and exhaustive.
List<int> GetClosest(List<int> A, List<int> B, List<int> C)
{
var ret = new List<int>();
int i = 0;
int j = 0;
int k = 0;
int dif f = INT_MAX;
while ( i < A.Count && j < B.Count && k < C.Count)
{
var candidates = new List<int>() { A[i], B[j], C[k] };
int range = Math.Abs(candidates.Min() - candidates.Max());
if ( range < diff)
{
diff = range;
ret = candidates.ToList();
}
if (range == 0) return ret;
if (candidates.Min() == A[i])
{
i++;
} else if (candidates.Min() == B[j])
{
j++;
} else {
k++;
}
}
return ret;

}
We don't have to sweep for average because we could enumerate all the elements of the array to find the global average. then we can find the elements closest to it in the other two arrays.
List<int> GetAveragesFromThreeSortedArrays(List<int> A, List<int> B, List<int> C)
{
var combined = A.Union(B).Union(C).ToList();
int average = combined.Avg();
return GetClosestToGiven(A, B, C, average);
}
List<int> GetClosestToGiven(List<int> A, List<int> B, List<int>C, int value)
{
assert (A.Count > 0 && B.Count > 0 && C.Count > 0);
var ret = new List<int>();
ret.Add(GetClosest(A, value)); // using binary search
ret.Add(GetClosest(B, value));
ret.Add(GetClosest(C, value));
return ret;
}
Here we simply do a binary search on a sorted list to find the element closest to the given value. The value could lie either before or after the given value in the sorted sequence.

Saturday, December 16, 2017

We were discussing the benefits for client-server separation of computing so that client specific improvements can happen independent of the server. While server side user controls, grids and views have been tremendously popular and continue to evolve with Model-View-Controller, Model-View-ViewModel techniques along with templates and rendering engines, they are only for the convenience of organization and the developers because all the code and the resources are available all at once for mix and match in any page to be rendered. On the other hand, client side javascripting can take care of most of these request and also have their own libraries while working in a platform independent and browser independent way. If we give the resources to the client side library and ask the client to display the page, the server doesn't need to do them while facilitating developments on the client side. While HTML5 and client side scripts have preferred brevity in their source, the mix and match processing when offloaded to the client can still be brief as long as all the resources are available as a bundle for the duration of the page or session.

One of the advantages of client side scripting is the significant amount of support for development that we get from browser and plugins. Some security is better enforced from the server side but allowing computations on public resources can easily be dedicated to the client side.

Let us look at few of the security concerns in the client side technologies. For example, most of the scripts on the client side is done with javascript. One of the most common concerns with Javascript is that it can be injected by third parties. Another concern is that it is often used to append or prepare html with string concatenations. Even server side html injection has been shown to be vulnerable but client side is notorious. Second browsers have to enforce cross origin resource sharing which previously allowed cross domain Ajax requests. Third the processing is utterly visible to all. The XHR API shows not only the data that is exchange but also the calls made. A view page source is enough to show how cluttered the client side can get
One of the most common vulnerability seem with client side scripting is that input and injection attacks are not thwarted by way of mitigation. Although HTTP headers are used to provide some amount of security, they rely on the best practice from both origination and destination Even SQL can be injected directly through the user interface. Locale and geographical customizations often make it harder to enforce a rule much less to talk about user agents, browsers, apps and devices. While HTML5, JSON and CSS have each addressed this with their own separation of concerns and brevity, much of the script is still left to the programmer to organize. Technologies on in this end tried to innovate on functionalities with flash and ActiveX but users had to disable some of these for the sake of security. Instead of we could come up with newer organization, more brevity and separation of concerns within resources and scripts so that we have far more security and less surface.

Friday, December 15, 2017

If a web page is viewed as an html, text usually appears as span within a div. These div only indicate the position but their content is decided by the rendering engine such as Razr. The literals in the html are looked up based on a identifier that acts as its alias. The same alias may be used for the translations of the string in different languages. Since the (alias, locale) pair is unique, we know that there will be a unique translation for the string. When the view is prepared, these literals are pulled from the database. Usually this preparation happens on the server side. If it were assets such as css, javascript or static html, then they could be served by the content delivery network which ensures that they are as geographically close as possible to the user. String are resources too but they don't appear as a resource on the CDN because there is a computation involved. But technically each string may be placed in its own static file and the sections of the html that display the content may instead serve the literal via loading a file. Round trips from a CDN are generally not considered expensive but given that there may be many literals on a page, this could become expensive. Each CDN may also need to keep its own copy of the literals. This means we don't have the single point maintenance on the server side as we did earlier. Moreover, updates and deletes of the string involves making changes to all CDN. But it is not impossible to offload the literals as a resource so that html can load it on the client side without having the server to incur the work.
The above discussion merely marks a benefit for client-server separation of computing so that client specific improvements can happen independent of the server. While server side user controls, grids and views have been tremendously popular and continue to evolve with Model-View-Controller, Model-View-ViewModel techniques along with templates and rendering engines, they are only for the convenience of organization and the developers because all the code and the resources are available all at once for mix and match in any page to be rendered. On the other hand, client side javascripting can take care of most of these request and also have their own libraries while working in a platform independent and browser independent way. If we give the resources to the client side library and ask the client to display the page, the server doesn't need to do them while facilitating developments on the client side. While HTML5 and client side scripts have preferred brevity in their source, the mix and match processing when offloaded to the client can still be brief as long as all the resources are available as a bundle for the duration of the page or session.

Thursday, December 14, 2017

We were discussing SDK and command line interface CLI options in addition to the APIs which improve the customer base.
Let us now talk about locale. Every user belongs to a geography and hence has a preferred language for the login page. When the user is displayed a web page, most of the strings on the web page are translated to a language of his or her choice. These translations can be made available with the user interface as language packs so that the corresponding translations may show up on the page. Care must be taken that all translations exist in every locale supported by the user interface. Also, when a language specified is invalid or a translation could not be found by an identifier, the corresponding default strings from english language can be displayed.

Wednesday, December 13, 2017

We were discussing SDK and command line interface CLI options in addition to the APIs which improve the customer base. Tests and automations can make use of commands and scripts to drive the service. This is a very powerful technique for administrators as well as for day to day usage by end users. Command line usage has one additional benefit - it can be used for troubleshooting and diagnostics. While SDKs provide an additional layer and are hosted on the client side, command line interface may be available on the server side and provide fewer variables to go wrong. With detailed logging and request-response capture CLI and SDK help ease the calls made to the services.

Let us now consider applying the SDK and CLI to something that Is inherently UI specific – say the login page. The login page has a few more restrictions that other user interface probably don't have. First, it is central to the company where the login happens. People login because they trust the page they are on. This means the login page is inherently tied to the domain the web application hosted in. Even if a third party application wants to allow its user to login, it will redirect the users to login on the company page. Since a user interface is involved in specifying username and password, it might not seem likely to delegate it to a command line or sdk because that would involve sending the credentials in the clear. However transmitting the credentials can be done over encrypted channel as it is routed from one service to another. Besides we have never intended the user to provide the credentials anywhere other than the login page. The suggestion is merely to decouple the UI from being in the response body to being in a standalone web view that calls a service Since both the service and the UI need to available and hosted on the same domain, it doesn't matter whether the credentials get passed from the UI to the service via a form submit or via a relay from one service to another. The UI and service for the signin are merely shifting boundaries between the app and the service.

Second users are accustomed to not disclosing their credentials. These work differently from key-value pairs. Many cloud providers use key-value pair in their SDKs and CLI. They know that their services such as compute and storage will be used in applications and services and therefore they find it easier to ship SDKs and CLIs. A user interface is a frontend only to a user and not an application or a digital asset. Therefore, providing an SDK or CLI does not hold as much appeal as a login page. At the same time, separating out the view for the user and focusing on the internal architecture with the help of SDK and CLI helps bring the best practice from the cloud to the retail.

Lastly users are comforted knowing that they are signing in to a look and feel that they are familiar with. Therefore a login page is a frontend that users expect not to change. By the very same argument, if we pull the user interface out of the API response and let the UI be part of an MVC that makes internal service call under-neath, then we are less likely to make changes to the login page as the service gets revisioned. Moreover, the customizations that we need to make for third party, clients and end-users now don't need to run as deep as the service.

#codingexercise

We were discussing finding closest numbers between the elements of three sorted integer arrays.

Another usage of finding the closest numbers in these arrays to a given input value is to discover overlapping range of sorted numbers. In order to do this on number line, we could choose max(start1,start2,start3) and min(end1,end2,end3) for (start,end) pair for every array. And by finding the closest of the min and max values, we can determine the overlapping range projection on any one of three arrays. Then we can choose the projection with the smallest number of elements

Determining the overlapping region projections helps shrink the search space for any operations involving any or all three arrays. It may not be useful for sum, average, mode or median but it is helpful for partitioning the number line to determine the presence of a number in any array. If the number is not in the overlapping range then we exclude one of the arrays and binary search for it outside the projections in other two arrays.

One such usage is goes back to determine the closest numbers of a given value. If the number is not included in the projections, we know that we can search in a smaller space