Sunday, December 31, 2017

We resume our discussion on Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters
The assumed a0 initial parameter set together with noise can be used to generate new data sets at the same values of the independent  axis coordinate as the actual data set has with that axis. The synthetic data aligns with the actual data at these points of reference on the line. Now the interesting part is that the new dataset has the same relationship to a0 as the actual data has with the true parameter set. 
We can generate many such data sets and they are called synthetic data sets. For each such dataset, we can fit a model and obtain the corresponding parameter set. This yields one data point for the difference between the parameter set  corresponding to the synthetic data from the initial parameter set. The simulation can be run thousands of times to generate sufficient data points. This leads to a probability distribution which is M-dimensional.  We can even calculate the standard deviation of the original parameter set after fit using the cumulative data point differences.
#codingexercise 
Given two natural number n and m, find the number of ways in which the numbers that are greater than or equal to m can be added to get the sum n.
for example, n = 3 and m = 1
We we have three ways : 1 + 1 + 1, 1 + 2, 3
int GetCountWays(int n, int m)
{
    int [,] dp = new int[n+2, n+2];
    dp [0, n+1] = 1;

    for (int k = n; k > m; k--) {
       for (int i = 0;  i <= n; i++) {
            dp[I, k] = dp [I, k+1];
            If (I-  k >= 0)
                 dp [i,k] = dp [i,k] + dp [i-k,k];
        }
    }
   return dp [n, m];
}

Saturday, December 30, 2017

In continuation of the hashing of stacktrace exceptions discussed in the previous post:
#! /usr/bin/python
import hashlib
hashes = {}
distinct_stacks = []
def hash(data):
      # unix command md5sum
      # print("StackTrace with %s bytes and md5=%s" % (len(data),hashlib.md5(data).hexdigest()))
      return hashlib.md5(data).hexdigest()
      # StackTrace with 1244 bytes and md5=1f369406ec1243e5bc3aead95e97f380

# strip a stack before comparision, canonicalize if necessary:
def prepare(stk):
      if stk:
         lines = stk.split('\n')
         ret = [x.split('(')[0].strip().replace(" ","") for x in lines]
         if ret:
            return ret
     
     
def compare(stk1, stk2):
     lines1 = stk1.split('\n')
     lines2 = stk2.split('\n')
     if (len(lines1) == 0 or len(lines2) == 0):
        return false
     min = len(lines1)
     if len(lines2) < min:
         min = len(lines2)
     count = 0
     for i in range(min):
         line2 = lines2[i].split('(')[i].strip().replace(" ", "")
         line1 = lines1[i].split('(')[i].strip().replace(" ", "")
         if line1.lower() != line2.lower():
            return false
         if count > 2:
            return true
         count=count+1
         return False

def count(stk, hashes, distinct_stacks):
       prepared = prepare(stk)
       if prepared:
            prepared = '\n'.join(prepared)
            hashed = hash(prepared)
            if hashed in hashes:
               hashes[hashed] += 1
            else:
               distinct_stacks += [prepared]
               hashes[prepared] = 1

def parse(filename, hashes, distinct_stacks):
      with open(filename, 'r') as fin:
              lines = fin.readlines()
              for line in lines:
                    if "Exception:" in line:
                        stk = '\n'.join(line.split("at"))
                        count(stk, hashes, distinct_stacks)


###
>>> hashes = {}
>>> distinct_stacks = []
>>> stackhasher.parse('stacktrace.txt', hashes, distinct_stacks)
>>>
>>> hashes
{'com.devdaily.sarah.tests.TrySuccessFailure$AlsException:Bummer!\n': 1}
>>> distinct_stacks
['com.devdaily.sarah.tests.TrySuccessFailure$AlsException:Bummer!\n']
>>>
###

Friday, December 29, 2017

Display of Stack-Trace Statistics in daily runs and production environments: 
Introduction: Log Analysis for error detection and reporting is an important activity for any commercial deployment. Together with powerful log scan and search techniques, log analysis both manual and automated are useful for the smooth running of any environment. Many in house and out of box commercial log analysis tools come with rich interfaces in the form of query language operators as well as dashboards to help the end users. However, the use of these query tools and their frequency is left to the discretion of the user. Most investigations based on log are on a case by case basis and therefore do not cover the overall hashes and counts of exception stack-trace. This write-up introduces automated stack-trace hasher from log scans as a useful addition to any statistics. 
Description: The conventional mode of query displaying graphs based on exception introduces the notion of filters that select exceptions matching one or more attributes usually in the form of keywords. Existing dashboards then serve up these selections in the same manner that they do with all exceptions on a continuous timeline.  
On the other hand, exceptions occurring in one code path often manifest themselves more than once. Depending on the frequency, these exceptions and their count indicate health of the component in the service that is logging these exceptions. While the dashboard for archived exceptions maybe available to view with pan and zoom, generally the statistics between service upgrades are missing. These can be easily solved with exception stack-trace hashing and taking snapshots of their running counts between upgrades and including it in the summaries. Such health indicators will be very helpful even in test runs of the service in various environments. Moreover, unnoticed exceptions can now get visibility and defect tracking may open up defects against the component based on these exceptions  
Design:  
 As the exceptions arrive, they are hashed and counted. This results in a frequency table that then gets displayed. We utilize both the exception frequency and inverse exception occurrence lookup for best usage.  
Languages: Sample stack trace hashing utilities are also available in Python and Java such as exhash. Therefore, taking a set of class, methods and optionally offsets as frame entries and creating a hash from a stack-trace using only class and methods will help with this tally. 
Conclusion: Utilizing a stack trace hasher in addition to test run reports or service monitoring will be helpful as a health indicator for the different components in a service. 
#codingexercise
Find the Newman-Shanks-Williams prime
This is defined by the recurrence relation:
F0 = 1
F1 = 1
Fn =  2 * Fn-1 + Fn-2
double GetNSWPrime(double n)
{
    if (n < 0) 
        return 0;

    if (n == 0 || n == 1)
        return 1;

    return 2 * GetNSWPrime(n - 1) + GetNSWPrime(n - 2);

}
>>> def p(n):
...    if (n==0) or (n==1):
...       return 1
...    return 2*p(n-1)+p(n-2)
...
>>> p(3)
7
>>> p(5)
41
>>> p(7)
239
>>>
Prime numbers initially are 1 or 2  units apart. The recurrence happens to be defined only on n-1 and n-2.

Thursday, December 28, 2017

We were reviewing Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters

#codingexercise
Find the Delannoy number
This number describes the number of paths from the southwest corner (0,0) of a rectangular grid to the northeast corner (m,n) using only single steps north, northeast or east
In a 3 by 3 square there are 63 paths from bottom left to top right corner using the above method.
The recursion for this is calculated as
D(m,n) = 1 if m == 0 or n == 0
             = D(m-1, n) + D(m-1, n-1) + D(m, n-1) for the three different paths permitted

int GetDelannoy(int n, int m)
{
    if (m == 0 || n == 0)
        return 1;

    return GetDelannoy(m - 1, n) +
               GetDelannoy(m - 1, n - 1) +
               GetDelannoy(m, n - 1);
}
>>> def d(m,n):
...     if ((m==0) or (n==0)):
...         return 1
...     return d(m-1,n)+d(m-1,n-1)+d(m,n-1)
...
>>> print d(3,3)
63
>>> print d(4,4)
321
The recursion tree has a maximum depth of m + n

Wednesday, December 27, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
We will review Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing hte chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters. #codingexercise
Find the Lobb Number. This counts the number of ways n+m open parantheses can be arranged to form the start of a valid sequence of balanced parantheses.
double GetLobb(double n, double m)
{
return ((2 * m + 1)  * GetNChooseK(2 * n , m + n)) / (m+n+1);
}
double GetNChooseK(double n, double k)
{
if (k <0 || k > n) return 0;
if (n < 0) return 0;
return Factorial(n) / (Factorial(n-k) * Factorial(k));
}
double Factorial(double n)
{
 if (n <= 1) return 1;
 return n * Factorial(n-1);
}
double GetNChooseKDP(double n, double k)
{
if (n < 0 || k < 0) return 0;
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

The NChooseK is also the binomial coefficient.

Tuesday, December 26, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.

#codingexercise
Find the Eulerian number E(n,m). This is the number of permutations of the numbers from 1 to n in which exactly m elements are greater than the previous elements.
For example, n = 3, m = 1 and there are 4 permutations in which exactly 1 element is greater than the previous element. These are :
1 2 3  => 1,2 and 2,3 resulting in count 2 ( invalid)
1 3 2 => 1,3 resulting in count 1 (valid)
2 1 3 => 1,3 resulting in count 1 (valid)
2 3 1 => 2,3 resulting in count 1 (valid)
3 1 2 => 1,2 resulting in count 1 (valid)

int GetEulerian(int n, int m)
{
    if (m >= n || n == 0)
        return 0;

    if (m == 0)
        return 1;

    return (n - m) * GetEulerian(n - 1, m - 1) +
           (m + 1) * GetEulerian(n - 1, m);

}

we can easily try this for n = 3, m = 1
(3,1)
2 x (2,0) + 2 x (2,1)
2 + 2 x ( 1x (1,0) +2x (1,1))
2 + 2
4
The recursive formula  uses the notion that either the m reduces or doesn't as n reduces.

Monday, December 25, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
#codingexercise
Find the Entringer number for a given n,k
An Entringer number is the number of permutations where we have a number formed from the permutation of digits 1 to n+1 such that the first digit starts with k+1 and the digits following that initially fall and then alternatively rise and fall. For example. E(4,2) are
32415
32514
31425
31524
int GetCountEntringer(int n, int k)
{
if (n == 0 && k == 0) return 1;
if (k == 0) return 0;
return GetCountEntringer(n, k-1) + GetCountEntringer(n-1, n-k);
}
From Wolfram definition of Entringer
(4,1)  (3,2)
(4,0)  (3,3) (3,1) (2,1)
 0   (3,2)(3,0) (3,0)(2,2)(2,0)(1,1)
 0   (3,1)(2,1) 0 0 (2,1)(1,0) 0 (1,0)(0,0)
0    (3,0)(2,2) (2,0)(1,1)0 0 (2,0)(1,1) 0 0 0 1
0    0 (2,1)(1,0)0(1,0)(0,0) 0 0 0 (1,0)(0,0) 0 0 0 1
0   0 (2,0) (1,1) 0 0 0 1 0 0 0 0 1 0 0 0 1
0   0  0 (1,0)(0,0) 0 0 0 1 0 0 0 0 1 0 0 0 1
0   0  0  0   1 0 0 0 1 0 0 0 0 1 0 0 0 1

Sunday, December 24, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
#codingexercise
Find the number of increasing subsequences of size k in a given array.
For example, if 1 2 3 4, the increasing subsequence of size 2 are 1 2, 1 3, 1 4, 2 3 , 2 4 , 3 4
The recurrence is therefore
dp[i][j] = 1, where i = 1 and 1 <= j <= n
dp[i][j] = sum(dp[i-1][j]), where 1 < i <= k, i <= j <= n and A[m] < A[j] for (i-1) <= m < j

int GetCountIncreasingSubsequences(int A[], int n, int k)
{
    int [,] dp = new int[k, n];
    int  sum = 0;

    for (int i = 0; i < n; i++)
        dp[0, i] = 1;

    for (int size = 1; size < k; size++) {
       for (int i = size; i < n; i++) {
            dp[size, i] = 0;
            for (int j = size - 1; j < i; j++) {
                if (A[j] < A[i])
                    dp[size, i] += dp[size - 1, j];
            }
        }
    }

    for (int i = k - 1; i < n; i++)
        sum += dp[k - 1, i];

   return sum;
}

The above method also memoizes the counts computed for sizes smaller than k so we can easily lookup the results for anything less than k.

Saturday, December 23, 2017

We were discussing  Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information
The correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. The residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate - the residual, more acceptable.
I want to mention that curve fitting improves with higher degrees such as a quadratic over linear but this does not mean we go to as high a degree as possible. The model we build tried to minimize the residuals. If we can do this with lesser degrees, then  that is acceptable
When we compare two models, we follow certain rules. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a tradeoff between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.

Friday, December 22, 2017

We were discussing  Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information
The correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. The residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate - the residual, more acceptable.
I want to mention that curve fitting improves with higher degrees such as a quadratic over linear but this does not mean we go to as high a degree as possible. The model we build tried to minimize the residuals. If we can do this with lesser degrees, then  that is acceptable.
#codingexercise
Print Hosoya's triangle
Hosoya's triangle is a triangular arrangement of numbers based on Fibonacci numbers where the sum of the two numbers in the left diagonal or the right diagonal results in the number at the current position.
The recurrence relation is
H(0,0) = H(1,0) =  H(1,1) = H(2,1) = 1
and
H(n,j) = H(n-1,j) + H(n-2, j)
           = H(n-1, j-1) + H(n-2,j-2)
And H(n,i) = F(i+1) x F ( n - i + 1)

int GetHosoya(int r, int c)
{
    if (r == 0 && c == 0) return 1;
    if (r == 1 && c == 0) return 1;
    if (r == 1 && c == 1) return 1;
    if (r == 2 && c == 1) return 1;
    if (r > c)
        return GetHosoya(r - 1, c) + GetHosoya(r - 2, c);
    if (r == c)
        return GetHosoya(r - 1, c - 1) + GetHosoya(r - 2, c - 2);
    return 0;
}

Thursday, December 21, 2017

We were discussing  Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models.
 A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters  This region is called the confidence interval.
We were saying that a model that is made linear looks easy to interpret but it is not accurate especially when transforming non-linear models. We will talk more about this process but first let us consider the m-dimensional space.
Usually non-linear models are solved with an objective function such as the chi-square error function. The gradients of the chi-square function with respect to the parameters must approach zero at the minimum of the objective function. In the steepest descent method, the minimization proceeds iteratively until it stops decreasing.
The error function depends on the model parameters. The M-dimensions can be considered a surface of which we seek a minimum. If the dimensions are large in number, the model may be complex and the surface may have more than one minimum.  The optimal minimization will strive to find the true global minimum of the error surface and not just a local minimum. In order to converge to the global minimum, one technique is to vary the starting points or the initial guesses and compare the resulting model parameters.
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information. A model may not be accurate and yet it might systematically differ from the data. In this case, the correlation between the model and the data will be high. This correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. These residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate also called the residual more acceptable.

Find the maximum product of an increasing subsequence
double GetProductIncreasingSubsequence(List<double> A, int n)
{
   Debug.Assert(A.All(x => x  >= 0));
    var products = new List<double>();
    for (int i = 0; i < n; i++)
        products.Add(A[i]);
    Debug.Assert(products.Length == A.Length);

    for (int i = 1; i < n; i++)
        for (int j = 0; j < i; j++)
            if (A[i] > A[j] && 
               products[i] < (products[j] * A[i]))
                products[i] = products[j] * A[i];

    return products.Max();
}
A: 2 0 4

P: 2 0 8

Wednesday, December 20, 2017

We were discussing  Bootstrap method, confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters  This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids. 
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. These facts should not be violated.
It should be noted that confidence interval is tied to the model. If the model were transformed from non-linear to linear, the confidence interval varies. It is often cited as oversimplification because it violates assumptions. This leads to more error in the model parameter values.


We were saying that a model based on linearization is easy but it is not accurate especially when transforming non-linear models.
Usually non-linear models are solved with an objective function such as the chi-square error function. The gradients of the chi-square function with respect to the parameters must approach zero at the minimum of the objective function. In the steepest descent method, the minimization proceeds iteratively until it stops decreasing.

#codingexercise 
Find the square root of a number 
Double sqrt(double n)
{
Double x = n;
Double y = 1;
Double threshold = 1E-5;
while (x – y > threshold)
{
x = (x+y)/2;
y= n/x;
}
return x;
}

Tuesday, December 19, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise. We can use the Chi-square merit function for this purpose.
Next we review confidence intervals and accuracy of model parameters. A model may have parameters that correspond to m dimensions. Since each of these dimensions can allow variations, the probability distribution is a function defined on M-dimensional space. With this probability distribution, we choose a region that has a high percentage of the total distribution relative to the selected model parameters  This region is called the confidence interval. Depending on the degree of distribution chosen, confidence intervals may be mentioned in levels as percentages. The region shape of the distribution may also be mentioned such as ellipsoids. Generally we pick the region that is reasonably compact and centered around the point of reference of the parameter space. This region or band of data can be described as y = prctile(x, [5,95])
As we work with model parameters, some rules of thumb come into view. For example, if we define a progressive rate constant, it cannot turn out to be negative. Similarly, Poisson's ratio cannot turn out to exceed 0.5. This is the ratio of the proportional decrease in width to the ratio of the proportional increase in length as a material is stretched. Other examples might put both an upper and lower bound on the parameters. These facts should not be violated.
#codingexercise 
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions We also mentioned that projections can be initiated from any sentinel values to split the range in the three arrays. What sentinel values we choose does not depend only the start and the end of an array. Given any range of interest, it can be projected on all the three arrays to split the respective ranges into array specific and non-overlapping subarrays as well as overlapping with the given range. This is very useful to shrink the indices for performing computations related to a given range.

Also, note that range division is not the only benefit. We can also approximate computations in all three arrays by performing a similar operation within a projection.

Monday, December 18, 2017

Today we are resuming the discussion on model fitting and error estimation by Kleinstein and Hershberg.There are two ways by which we can select the appropriate model. The first is by observing trend lines which correspond to some well known mathematical formula. The second is on the observation of underlying physical processes which contribute towards the system. These physical interpretations contribute to model parameters.
 In order to fit the data points, a model may use least squares of errors. the errors called residuals may be both positive or negative which result in inaccurate measure. Instead the squares of the errors can be minimized to give a better fit.
We used the least squares error minimization to fit the data points. Another way to do this is using Maximum likelihood estimation. This method asks: "Given my set of model parameters, what is the probability that this data set occurred ?"  This translates as likelihood for the parameters given the data.
The chi-square error measure and maximum likelihood estimation have a relation between the two.For a Gaussian distribution, the probability of the data set coming from the model parameters involves minimizing the negative natural log of probability which is the chi-square function of weighted residuals. Furthermore, if the variance is uniform, then the chi-square function yields the sum of squared residuals.
Linear regression analysis strikes a relationship between a dependent variable and an independent variable. The best values of slope and intercept are found by taking partial derivatives of the error function with respect to them. Regression parameters are different from correlation coefficient. The latter describes the strength of association between two variables. It is therefore symmetric for the pair of variables unlike regression parameters.
If we are not inclined towards error estimation then we can attempt Bootstrap method. This method uses actual data sets with its N points to generate synthetic data with just as many points. The synthetic differs from the original with N being the fraction of the original points replaced with duplicated originals. Since the order of data points does not matter, the estimation can take place with actual measurement noise.
#codingexercise
We talked about finding closest elements to a given value across three arrays. We introduced the notion of projections to find non-overlapping regions In the same spirit we continue that projections can be initiated from any sentinel values.

Sunday, December 17, 2017

#codingexercise
we are given three sorted arrays and we want  to find one element from each array such that they are closest to each other. One of the ways to do this was explained this way: We could also traverse all three arrays while keeping track of maximum and minimum difference encountered with the candidate set of three elements. We traverse by choosing one element at a time in any one array by incrementing the index of the array with the minimum value.
By advancing only the minimum element, we make sure the sweep is progressive and exhaustive.
List<int> GetClosest(List<int> A, List<int> B, List<int> C)
{
var ret = new List<int>();
int i = 0;
int j = 0;
int k = 0;
int dif f = INT_MAX;
while ( i < A.Count && j < B.Count && k < C.Count)
{
   var candidates = new List<int>() { A[i], B[j], C[k] };
   int range = Math.Abs(candidates.Min() - candidates.Max());
   if ( range < diff)
   {
       diff = range;
       ret = candidates.ToList();
   }
   if (range == 0) return ret;
   if (candidates.Min() == A[i])
   {
       i++;
   } else if (candidates.Min() == B[j])
   {
       j++;
   } else {
      k++;
   }
}
return ret;

}
We don't have to sweep for average because we could enumerate all the elements of the array to find the global average. then we can find the elements closest to it in the other two arrays.
List<int> GetAveragesFromThreeSortedArrays(List<int> A, List<int> B, List<int> C)
{
var combined = A.Union(B).Union(C).ToList();
int average = combined.Avg();
return GetClosestToGiven(A, B, C, average);
}
List<int> GetClosestToGiven(List<int> A, List<int> B, List<int>C, int value)
{
assert (A.Count > 0 && B.Count > 0 && C.Count > 0);
var ret = new List<int>();
ret.Add(GetClosest(A, value)); // using binary search
ret.Add(GetClosest(B, value));
ret.Add(GetClosest(C, value));
return ret;
}
Here we simply do a binary search on a sorted list to find the element closest to the given value. The value could lie either before or after the given value in the sorted sequence.

Saturday, December 16, 2017



We were discussing the benefits for client-server separation of computing so that client specific improvements can happen independent of the server. While server side user controls, grids and views have been tremendously popular and continue to evolve with Model-View-Controller, Model-View-ViewModel techniques along with templates and rendering engines, they are only for the convenience of organization and the developers because all the code and the resources are available all at once for mix and match in any page to be rendered. On the other hand, client side javascripting can take care of most of these request and also have their own libraries while working in a platform independent and browser independent way. If we give the resources to the client side library and ask the client to display the page, the server doesn't need to do them while facilitating developments on the client side. While HTML5 and client side scripts have preferred brevity in their source, the mix and match processing when offloaded to the client can still be brief as long as all the resources are available as a bundle for the duration of the page or session.

One of the advantages of client side scripting is the significant amount of support for development that we get from browser and plugins. Some security is better enforced from the server side but allowing computations on public resources can easily be dedicated to the client side.

Let us look at few of the security concerns in the client side technologies. For example, most of the scripts on the client side is done with javascript. One of the most common concerns with Javascript is that it can be injected by third parties. Another concern is that it is often used to append or prepare html with string concatenations. Even server side html injection has been shown to be vulnerable but client side is notorious. Second browsers have to enforce cross origin resource sharing which previously allowed cross domain Ajax requests. Third the processing is utterly visible to all. The XHR API shows not only the data that is exchange but also the calls made. A view page source is enough to show how cluttered the client side can get
One of the most common vulnerability seem with client side scripting is that input and injection attacks are not thwarted by way of mitigation. Although HTTP headers are used to provide some amount of security, they rely on the best practice from both origination and destination  Even SQL can be injected directly through the user interface. Locale and geographical customizations often make it harder to enforce a rule much less to talk about user agents, browsers,  apps and devices. While HTML5,  JSON and CSS have each addressed this with their own separation of concerns and brevity, much of the script is still left to the programmer to organize. Technologies on in this end tried to innovate on functionalities with flash and ActiveX but users had to disable some of these for the sake of security. Instead of we could come up with newer organization, more brevity and separation of concerns within resources and scripts so that we have far more security and less surface.

Friday, December 15, 2017

If a web page is viewed as an html, text usually appears as span within a div. These div only indicate the position but their content is decided by the rendering engine such as Razr. The literals in the html are looked up based on a identifier that acts as its alias. The same alias may be used for the translations of the string in different languages. Since the (alias, locale) pair is unique, we know that there will be a unique translation for the string. When the view is prepared, these literals are pulled from the database. Usually this preparation happens on the server side. If it were assets such as css, javascript or static html, then they could be served by the content delivery network which ensures that they are as geographically close as possible to the user. String are resources too but they don't appear as a resource on the CDN because there is a computation involved. But technically each string may be placed in its own static file and the sections of the html that display the content may instead serve the literal via loading a file. Round trips from a CDN are generally not considered expensive but given that there may be many literals on a page, this could become expensive. Each CDN may also need to keep its own copy of the literals. This means we don't have the single point maintenance on the server side as we did earlier. Moreover, updates and deletes of the string involves making changes to all CDN. But it is not impossible to offload the literals as a resource so that html can load it on the client side without having the server to incur the work.
The above discussion merely marks a benefit for client-server separation of computing so that client specific improvements can happen independent of the server. While server side user controls, grids and views have been tremendously popular and continue to evolve with Model-View-Controller, Model-View-ViewModel techniques along with templates and rendering engines, they are only for the convenience of organization and the developers because all the code and the resources are available all at once for mix and match in any page to be rendered. On the other hand, client side javascripting can take care of most of these request and also have their own libraries while working in a platform independent and browser independent way. If we give the resources to the client side library and ask the client to display the page, the server doesn't need to do them while facilitating developments on the client side. While HTML5 and client side scripts have preferred brevity in their source, the mix and match processing when offloaded to the client can still be brief as long as all the resources are available as a bundle for the duration of the page or session.

Thursday, December 14, 2017

We were discussing  SDK and command line interface CLI options in addition to the APIs which improve the customer base.
Let us now talk about locale. Every user belongs to a geography and hence has a preferred language for the login page. When the user is displayed a web page, most of the strings on the web page are translated to a language of his or her choice. These translations can be made available with the user interface as language packs so that the corresponding translations may show up on the page. Care must be taken that all translations exist in every locale supported by the user interface. Also, when a language specified is invalid or a translation could not be found by an identifier, the corresponding default strings from english language can be displayed.
If a web page is viewed as an html, text usually appears as span within a div. These div only indicate the position but their content is decided by the rendering engine such as Razr. The literals in the html are looked up based on a identifier that acts as its alias. The same alias may be used for the translations of the string in different languages. Since the (alias, locale) pair is unique, we know that there will be a unique translation for the string. When the view is prepared, these literals are pulled from the database. Usually this preparation happens on the server side. If it were assets such as css, javascript or static html, then they could be served by the content delivery network which ensures that they are as geographically close as possible to the user. String are resources too but they don't appear as a resource on the CDN because there is a computation involved. But technically each string may be placed in its own static file and the sections of the html that display the content may instead serve the literal via loading a file. Round trips from a CDN are generally not considered expensive but given that there may be many literals on a page, this could become expensive. Each CDN may also need to keep its own copy of the literals. This means we don't have the single point maintenance on the server side as we did earlier. Moreover, updates and deletes of the string involves making changes to all CDN. But it is not impossible to offload the literals as a resource so that html can load it on the client side without having the server to incur the work.

Wednesday, December 13, 2017

We were discussing  SDK and command line interface CLI options in addition to the APIs which improve the customer base. Tests and automations can make use of commands and scripts to drive the service. This is a very powerful technique for administrators as well as for day to day usage by end users. Command line usage has one additional benefit - it can be used for troubleshooting and diagnostics. While SDKs provide an additional layer and are hosted on the client side, command line interface may be available on the server side and provide fewer variables to go wrong. With detailed logging and request-response capture CLI and SDK help ease the calls made to the services. 
Let us now consider applying the SDK and CLI to something that Is inherently UI specific – say the login page. The login page has a few more restrictions that other user interface probably don't have. First, it is central to the company where the login happens. People login because they trust the page they are on. This means the login page is inherently tied to the domain the web application hosted in. Even if a third party application wants to allow its user to login, it will redirect the users to login on the company page. Since a user interface is involved in specifying username and password, it might not seem likely to delegate it to a command line or sdk because that would involve sending the credentials in the clear. However transmitting the credentials can be done over encrypted channel as it is routed from one service to another. Besides we have never intended the user to provide the credentials anywhere other than the login page. The suggestion is merely to decouple the UI from being in the response body to being in a standalone web view that calls a service Since both the service and the UI need to available and hosted on the same domain, it doesn't matter whether the credentials get passed from the UI to the service via a form submit or via a relay from one service to another. The UI and service for the signin are merely shifting boundaries between the app and the service.  
Second users are accustomed to not disclosing their credentials. These work differently from key-value pairs. Many cloud providers use key-value pair in their SDKs and CLI. They know that their services such as compute and storage will be used in applications and services and therefore they find it easier to ship SDKs and CLIs. A user interface is a frontend only to a user and not an application or a digital asset. Therefore, providing an SDK or CLI does not hold as much appeal as a login page. At the same time, separating out the view for the user and focusing on the internal architecture with the help of SDK and CLI helps bring the best practice from the cloud to the retail. 
Lastly users are comforted knowing that they are signing in to a look and feel that they are familiar with. Therefore a login page is a frontend that users expect not to change. By the very same argument, if we pull the user interface out of the API response and let the UI be part of an MVC that makes internal service call under-neath, then we are less likely to make changes to the login page as the service gets revisioned. Moreover, the customizations that we need to make for third party, clients and end-users now don't need to run as deep as the service. 


#codingexercise


We were discussing finding closest numbers between the elements of three sorted integer arrays. 

Another usage of finding the closest numbers in these arrays to a given input value is to discover overlapping range of sorted numbers. In order to do this on number line, we could choose max(start1,start2,start3) and min(end1,end2,end3) for (start,end) pair for every array. And by finding the closest of the min and max values, we can determine the overlapping range projection on any one of three arrays. Then we can choose the projection with the smallest number of elements 

Determining the overlapping region projections helps shrink the search space for any operations involving any or all three arrays. It may not be useful for sum, average, mode or median but it is helpful for partitioning the number line to determine the presence of a number in any array. If the number is not in the overlapping range then we exclude one of the arrays and binary search for it outside the projections in other two arrays.


  1. One such usage is goes back to determine the closest numbers of a given value. If the number is not included in the projections, we know that we can search in a smaller space