Tuesday, January 2, 2018

We resume our discussion on Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It generates samples that are similar to the actual data set. A large number of samples generated enable us to get closer to the global minima. For example, for estimating the value of pi, when we have a large number of samples, the computed value could get very close to the actual value. Monte Carlo assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters
The assumed a0 initial parameter set together with noise can be used to generate new data sets at the same values of the independent  axis coordinate as the actual data set has with that axis. The synthetic data aligns with the actual data at these points of reference on the line. Now the interesting part is that the new dataset has the same relationship to a0 as the actual data has with the true parameter set.
We can generate many such data sets and they are called synthetic data sets. For each such dataset, we can fit a model and obtain the corresponding parameter set. This yields one data point for the difference between the parameter set  corresponding to the synthetic data from the initial parameter set. The simulation can be run thousands of times to generate sufficient data points. This leads to a probability distribution which is M-dimensional.  We can even calculate the standard deviation of the original parameter set after fit using the cumulative data point differences.
Historically simulations were used to test a previously understood deterministic problem. The sampling was used to generate uncertainities in the simulations. Monte Carlo simulations inverted this approach. It tries to solve deterministic problems using optimizations based on probabilistic interpretations. It draws samples from a probability distribution. Simulated Annealing is a special case of Monte Carlo which we had seen earlier here
#codingexercise
Count the number of ways to reach the n'th stair using 1,2 or 3
int GetCount(int n)
{
   switch(n)
   {
     case 0:
                return 0;
     case 1:
                return 1;
     case 2:
                return 2;
     case 3:
                return 4;
     default:
                return GetCount(n - 3) +
                           GetCount(n - 2) +
                           GetCount(n - 1);  
    }
}
The above can be rewritten with if then else statement.
f (4) = f (1) +f (2) +f (3) = 1 + 2 + 4 = 7
enumerated as
1 1 1 1
121
112
211
22
13
31
We can also similarly show f (5) = f (2)+f (3)+f (4) = 2 + 4 + 7 = 13
Note we could also write this dp using memoization and a bottom up manner computation.

Monday, January 1, 2018

Happy New Year
#codingexercise
Generate the nth Newman Conway Sequence number. This sequence is shown as
1 1 2 2 3 4 4 4 5 6 7 7
It is defined as the recursion :
P(n) = P(P(n - 1)) + P(n - P(n - 1))
and with closure conditions as
P (1) = 1
P (2) = 1

double GetNCS ( double n )
{
if ( n == 1 || n == 2)
      return 1;
else
      return GetNCS (GetNCS (n-1)) + GetNCS (n-GetNCS (n-1));
}
n = 3:
P (P (2))+P (3-P (2))
    = P (1) +P (2) = 2
n = 4
P (P (3))+P (4-P (3))
= P (2) +P (2) = 2

Sunday, December 31, 2017

We resume our discussion on Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters
The assumed a0 initial parameter set together with noise can be used to generate new data sets at the same values of the independent  axis coordinate as the actual data set has with that axis. The synthetic data aligns with the actual data at these points of reference on the line. Now the interesting part is that the new dataset has the same relationship to a0 as the actual data has with the true parameter set. 
We can generate many such data sets and they are called synthetic data sets. For each such dataset, we can fit a model and obtain the corresponding parameter set. This yields one data point for the difference between the parameter set  corresponding to the synthetic data from the initial parameter set. The simulation can be run thousands of times to generate sufficient data points. This leads to a probability distribution which is M-dimensional.  We can even calculate the standard deviation of the original parameter set after fit using the cumulative data point differences.
#codingexercise 
Given two natural number n and m, find the number of ways in which the numbers that are greater than or equal to m can be added to get the sum n.
for example, n = 3 and m = 1
We we have three ways : 1 + 1 + 1, 1 + 2, 3
int GetCountWays(int n, int m)
{
    int [,] dp = new int[n+2, n+2];
    dp [0, n+1] = 1;

    for (int k = n; k > m; k--) {
       for (int i = 0;  i <= n; i++) {
            dp[I, k] = dp [I, k+1];
            If (I-  k >= 0)
                 dp [i,k] = dp [i,k] + dp [i-k,k];
        }
    }
   return dp [n, m];
}

Saturday, December 30, 2017

In continuation of the hashing of stacktrace exceptions discussed in the previous post:
#! /usr/bin/python
import hashlib
hashes = {}
distinct_stacks = []
def hash(data):
      # unix command md5sum
      # print("StackTrace with %s bytes and md5=%s" % (len(data),hashlib.md5(data).hexdigest()))
      return hashlib.md5(data).hexdigest()
      # StackTrace with 1244 bytes and md5=1f369406ec1243e5bc3aead95e97f380

# strip a stack before comparision, canonicalize if necessary:
def prepare(stk):
      if stk:
         lines = stk.split('\n')
         ret = [x.split('(')[0].strip().replace(" ","") for x in lines]
         if ret:
            return ret
     
     
def compare(stk1, stk2):
     lines1 = stk1.split('\n')
     lines2 = stk2.split('\n')
     if (len(lines1) == 0 or len(lines2) == 0):
        return false
     min = len(lines1)
     if len(lines2) < min:
         min = len(lines2)
     count = 0
     for i in range(min):
         line2 = lines2[i].split('(')[i].strip().replace(" ", "")
         line1 = lines1[i].split('(')[i].strip().replace(" ", "")
         if line1.lower() != line2.lower():
            return false
         if count > 2:
            return true
         count=count+1
         return False

def count(stk, hashes, distinct_stacks):
       prepared = prepare(stk)
       if prepared:
            prepared = '\n'.join(prepared)
            hashed = hash(prepared)
            if hashed in hashes:
               hashes[hashed] += 1
            else:
               distinct_stacks += [prepared]
               hashes[prepared] = 1

def parse(filename, hashes, distinct_stacks):
      with open(filename, 'r') as fin:
              lines = fin.readlines()
              for line in lines:
                    if "Exception:" in line:
                        stk = '\n'.join(line.split("at"))
                        count(stk, hashes, distinct_stacks)


###
>>> hashes = {}
>>> distinct_stacks = []
>>> stackhasher.parse('stacktrace.txt', hashes, distinct_stacks)
>>>
>>> hashes
{'com.devdaily.sarah.tests.TrySuccessFailure$AlsException:Bummer!\n': 1}
>>> distinct_stacks
['com.devdaily.sarah.tests.TrySuccessFailure$AlsException:Bummer!\n']
>>>
###

Friday, December 29, 2017

Display of Stack-Trace Statistics in daily runs and production environments: 
Introduction: Log Analysis for error detection and reporting is an important activity for any commercial deployment. Together with powerful log scan and search techniques, log analysis both manual and automated are useful for the smooth running of any environment. Many in house and out of box commercial log analysis tools come with rich interfaces in the form of query language operators as well as dashboards to help the end users. However, the use of these query tools and their frequency is left to the discretion of the user. Most investigations based on log are on a case by case basis and therefore do not cover the overall hashes and counts of exception stack-trace. This write-up introduces automated stack-trace hasher from log scans as a useful addition to any statistics. 
Description: The conventional mode of query displaying graphs based on exception introduces the notion of filters that select exceptions matching one or more attributes usually in the form of keywords. Existing dashboards then serve up these selections in the same manner that they do with all exceptions on a continuous timeline.  
On the other hand, exceptions occurring in one code path often manifest themselves more than once. Depending on the frequency, these exceptions and their count indicate health of the component in the service that is logging these exceptions. While the dashboard for archived exceptions maybe available to view with pan and zoom, generally the statistics between service upgrades are missing. These can be easily solved with exception stack-trace hashing and taking snapshots of their running counts between upgrades and including it in the summaries. Such health indicators will be very helpful even in test runs of the service in various environments. Moreover, unnoticed exceptions can now get visibility and defect tracking may open up defects against the component based on these exceptions  
Design:  
 As the exceptions arrive, they are hashed and counted. This results in a frequency table that then gets displayed. We utilize both the exception frequency and inverse exception occurrence lookup for best usage.  
Languages: Sample stack trace hashing utilities are also available in Python and Java such as exhash. Therefore, taking a set of class, methods and optionally offsets as frame entries and creating a hash from a stack-trace using only class and methods will help with this tally. 
Conclusion: Utilizing a stack trace hasher in addition to test run reports or service monitoring will be helpful as a health indicator for the different components in a service. 
#codingexercise
Find the Newman-Shanks-Williams prime
This is defined by the recurrence relation:
F0 = 1
F1 = 1
Fn =  2 * Fn-1 + Fn-2
double GetNSWPrime(double n)
{
    if (n < 0) 
        return 0;

    if (n == 0 || n == 1)
        return 1;

    return 2 * GetNSWPrime(n - 1) + GetNSWPrime(n - 2);

}
>>> def p(n):
...    if (n==0) or (n==1):
...       return 1
...    return 2*p(n-1)+p(n-2)
...
>>> p(3)
7
>>> p(5)
41
>>> p(7)
239
>>>
Prime numbers initially are 1 or 2  units apart. The recurrence happens to be defined only on n-1 and n-2.

Thursday, December 28, 2017

We were reviewing Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters

#codingexercise
Find the Delannoy number
This number describes the number of paths from the southwest corner (0,0) of a rectangular grid to the northeast corner (m,n) using only single steps north, northeast or east
In a 3 by 3 square there are 63 paths from bottom left to top right corner using the above method.
The recursion for this is calculated as
D(m,n) = 1 if m == 0 or n == 0
             = D(m-1, n) + D(m-1, n-1) + D(m, n-1) for the three different paths permitted

int GetDelannoy(int n, int m)
{
    if (m == 0 || n == 0)
        return 1;

    return GetDelannoy(m - 1, n) +
               GetDelannoy(m - 1, n - 1) +
               GetDelannoy(m, n - 1);
}
>>> def d(m,n):
...     if ((m==0) or (n==0)):
...         return 1
...     return d(m-1,n)+d(m-1,n-1)+d(m,n-1)
...
>>> print d(3,3)
63
>>> print d(4,4)
321
The recursion tree has a maximum depth of m + n

Wednesday, December 27, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
We will review Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing hte chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters. #codingexercise
Find the Lobb Number. This counts the number of ways n+m open parantheses can be arranged to form the start of a valid sequence of balanced parantheses.
double GetLobb(double n, double m)
{
return ((2 * m + 1)  * GetNChooseK(2 * n , m + n)) / (m+n+1);
}
double GetNChooseK(double n, double k)
{
if (k <0 || k > n) return 0;
if (n < 0) return 0;
return Factorial(n) / (Factorial(n-k) * Factorial(k));
}
double Factorial(double n)
{
 if (n <= 1) return 1;
 return n * Factorial(n-1);
}
double GetNChooseKDP(double n, double k)
{
if (n < 0 || k < 0) return 0;
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

The NChooseK is also the binomial coefficient.