Sunday, December 31, 2017

We resume our discussion on Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters
The assumed a0 initial parameter set together with noise can be used to generate new data sets at the same values of the independent  axis coordinate as the actual data set has with that axis. The synthetic data aligns with the actual data at these points of reference on the line. Now the interesting part is that the new dataset has the same relationship to a0 as the actual data has with the true parameter set. 
We can generate many such data sets and they are called synthetic data sets. For each such dataset, we can fit a model and obtain the corresponding parameter set. This yields one data point for the difference between the parameter set  corresponding to the synthetic data from the initial parameter set. The simulation can be run thousands of times to generate sufficient data points. This leads to a probability distribution which is M-dimensional.  We can even calculate the standard deviation of the original parameter set after fit using the cumulative data point differences.
#codingexercise 
Given two natural number n and m, find the number of ways in which the numbers that are greater than or equal to m can be added to get the sum n.
for example, n = 3 and m = 1
We we have three ways : 1 + 1 + 1, 1 + 2, 3
int GetCountWays(int n, int m)
{
    int [,] dp = new int[n+2, n+2];
    dp [0, n+1] = 1;

    for (int k = n; k > m; k--) {
       for (int i = 0;  i <= n; i++) {
            dp[I, k] = dp [I, k+1];
            If (I-  k >= 0)
                 dp [i,k] = dp [i,k] + dp [i-k,k];
        }
    }
   return dp [n, m];
}

Saturday, December 30, 2017

In continuation of the hashing of stacktrace exceptions discussed in the previous post:
#! /usr/bin/python
import hashlib
hashes = {}
distinct_stacks = []
def hash(data):
      # unix command md5sum
      # print("StackTrace with %s bytes and md5=%s" % (len(data),hashlib.md5(data).hexdigest()))
      return hashlib.md5(data).hexdigest()
      # StackTrace with 1244 bytes and md5=1f369406ec1243e5bc3aead95e97f380

# strip a stack before comparision, canonicalize if necessary:
def prepare(stk):
      if stk:
         lines = stk.split('\n')
         ret = [x.split('(')[0].strip().replace(" ","") for x in lines]
         if ret:
            return ret
     
     
def compare(stk1, stk2):
     lines1 = stk1.split('\n')
     lines2 = stk2.split('\n')
     if (len(lines1) == 0 or len(lines2) == 0):
        return false
     min = len(lines1)
     if len(lines2) < min:
         min = len(lines2)
     count = 0
     for i in range(min):
         line2 = lines2[i].split('(')[i].strip().replace(" ", "")
         line1 = lines1[i].split('(')[i].strip().replace(" ", "")
         if line1.lower() != line2.lower():
            return false
         if count > 2:
            return true
         count=count+1
         return False

def count(stk, hashes, distinct_stacks):
       prepared = prepare(stk)
       if prepared:
            prepared = '\n'.join(prepared)
            hashed = hash(prepared)
            if hashed in hashes:
               hashes[hashed] += 1
            else:
               distinct_stacks += [prepared]
               hashes[prepared] = 1

def parse(filename, hashes, distinct_stacks):
      with open(filename, 'r') as fin:
              lines = fin.readlines()
              for line in lines:
                    if "Exception:" in line:
                        stk = '\n'.join(line.split("at"))
                        count(stk, hashes, distinct_stacks)


###
>>> hashes = {}
>>> distinct_stacks = []
>>> stackhasher.parse('stacktrace.txt', hashes, distinct_stacks)
>>>
>>> hashes
{'com.devdaily.sarah.tests.TrySuccessFailure$AlsException:Bummer!\n': 1}
>>> distinct_stacks
['com.devdaily.sarah.tests.TrySuccessFailure$AlsException:Bummer!\n']
>>>
###

Friday, December 29, 2017

Display of Stack-Trace Statistics in daily runs and production environments: 
Introduction: Log Analysis for error detection and reporting is an important activity for any commercial deployment. Together with powerful log scan and search techniques, log analysis both manual and automated are useful for the smooth running of any environment. Many in house and out of box commercial log analysis tools come with rich interfaces in the form of query language operators as well as dashboards to help the end users. However, the use of these query tools and their frequency is left to the discretion of the user. Most investigations based on log are on a case by case basis and therefore do not cover the overall hashes and counts of exception stack-trace. This write-up introduces automated stack-trace hasher from log scans as a useful addition to any statistics. 
Description: The conventional mode of query displaying graphs based on exception introduces the notion of filters that select exceptions matching one or more attributes usually in the form of keywords. Existing dashboards then serve up these selections in the same manner that they do with all exceptions on a continuous timeline.  
On the other hand, exceptions occurring in one code path often manifest themselves more than once. Depending on the frequency, these exceptions and their count indicate health of the component in the service that is logging these exceptions. While the dashboard for archived exceptions maybe available to view with pan and zoom, generally the statistics between service upgrades are missing. These can be easily solved with exception stack-trace hashing and taking snapshots of their running counts between upgrades and including it in the summaries. Such health indicators will be very helpful even in test runs of the service in various environments. Moreover, unnoticed exceptions can now get visibility and defect tracking may open up defects against the component based on these exceptions  
Design:  
 As the exceptions arrive, they are hashed and counted. This results in a frequency table that then gets displayed. We utilize both the exception frequency and inverse exception occurrence lookup for best usage.  
Languages: Sample stack trace hashing utilities are also available in Python and Java such as exhash. Therefore, taking a set of class, methods and optionally offsets as frame entries and creating a hash from a stack-trace using only class and methods will help with this tally. 
Conclusion: Utilizing a stack trace hasher in addition to test run reports or service monitoring will be helpful as a health indicator for the different components in a service. 
#codingexercise
Find the Newman-Shanks-Williams prime
This is defined by the recurrence relation:
F0 = 1
F1 = 1
Fn =  2 * Fn-1 + Fn-2
double GetNSWPrime(double n)
{
    if (n < 0) 
        return 0;

    if (n == 0 || n == 1)
        return 1;

    return 2 * GetNSWPrime(n - 1) + GetNSWPrime(n - 2);

}
>>> def p(n):
...    if (n==0) or (n==1):
...       return 1
...    return 2*p(n-1)+p(n-2)
...
>>> p(3)
7
>>> p(5)
41
>>> p(7)
239
>>>
Prime numbers initially are 1 or 2  units apart. The recurrence happens to be defined only on n-1 and n-2.

Thursday, December 28, 2017

We were reviewing Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters

#codingexercise
Find the Delannoy number
This number describes the number of paths from the southwest corner (0,0) of a rectangular grid to the northeast corner (m,n) using only single steps north, northeast or east
In a 3 by 3 square there are 63 paths from bottom left to top right corner using the above method.
The recursion for this is calculated as
D(m,n) = 1 if m == 0 or n == 0
             = D(m-1, n) + D(m-1, n-1) + D(m, n-1) for the three different paths permitted

int GetDelannoy(int n, int m)
{
    if (m == 0 || n == 0)
        return 1;

    return GetDelannoy(m - 1, n) +
               GetDelannoy(m - 1, n - 1) +
               GetDelannoy(m, n - 1);
}
>>> def d(m,n):
...     if ((m==0) or (n==0)):
...         return 1
...     return d(m-1,n)+d(m-1,n-1)+d(m,n-1)
...
>>> print d(3,3)
63
>>> print d(4,4)
321
The recursion tree has a maximum depth of m + n

Wednesday, December 27, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
We will review Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing hte chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters. #codingexercise
Find the Lobb Number. This counts the number of ways n+m open parantheses can be arranged to form the start of a valid sequence of balanced parantheses.
double GetLobb(double n, double m)
{
return ((2 * m + 1)  * GetNChooseK(2 * n , m + n)) / (m+n+1);
}
double GetNChooseK(double n, double k)
{
if (k <0 || k > n) return 0;
if (n < 0) return 0;
return Factorial(n) / (Factorial(n-k) * Factorial(k));
}
double Factorial(double n)
{
 if (n <= 1) return 1;
 return n * Factorial(n-1);
}
double GetNChooseKDP(double n, double k)
{
if (n < 0 || k < 0) return 0;
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

The NChooseK is also the binomial coefficient.

Tuesday, December 26, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.

#codingexercise
Find the Eulerian number E(n,m). This is the number of permutations of the numbers from 1 to n in which exactly m elements are greater than the previous elements.
For example, n = 3, m = 1 and there are 4 permutations in which exactly 1 element is greater than the previous element. These are :
1 2 3  => 1,2 and 2,3 resulting in count 2 ( invalid)
1 3 2 => 1,3 resulting in count 1 (valid)
2 1 3 => 1,3 resulting in count 1 (valid)
2 3 1 => 2,3 resulting in count 1 (valid)
3 1 2 => 1,2 resulting in count 1 (valid)

int GetEulerian(int n, int m)
{
    if (m >= n || n == 0)
        return 0;

    if (m == 0)
        return 1;

    return (n - m) * GetEulerian(n - 1, m - 1) +
           (m + 1) * GetEulerian(n - 1, m);

}

we can easily try this for n = 3, m = 1
(3,1)
2 x (2,0) + 2 x (2,1)
2 + 2 x ( 1x (1,0) +2x (1,1))
2 + 2
4
The recursive formula  uses the notion that either the m reduces or doesn't as n reduces.

Monday, December 25, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
#codingexercise
Find the Entringer number for a given n,k
An Entringer number is the number of permutations where we have a number formed from the permutation of digits 1 to n+1 such that the first digit starts with k+1 and the digits following that initially fall and then alternatively rise and fall. For example. E(4,2) are
32415
32514
31425
31524
int GetCountEntringer(int n, int k)
{
if (n == 0 && k == 0) return 1;
if (k == 0) return 0;
return GetCountEntringer(n, k-1) + GetCountEntringer(n-1, n-k);
}
From Wolfram definition of Entringer
(4,1)  (3,2)
(4,0)  (3,3) (3,1) (2,1)
 0   (3,2)(3,0) (3,0)(2,2)(2,0)(1,1)
 0   (3,1)(2,1) 0 0 (2,1)(1,0) 0 (1,0)(0,0)
0    (3,0)(2,2) (2,0)(1,1)0 0 (2,0)(1,1) 0 0 0 1
0    0 (2,1)(1,0)0(1,0)(0,0) 0 0 0 (1,0)(0,0) 0 0 0 1
0   0 (2,0) (1,1) 0 0 0 1 0 0 0 0 1 0 0 0 1
0   0  0 (1,0)(0,0) 0 0 0 1 0 0 0 0 1 0 0 0 1
0   0  0  0   1 0 0 0 1 0 0 0 0 1 0 0 0 1