Cluster computing

Thursday, December 28, 2017

We were reviewing Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing the chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters

#codingexercise
Find the Delannoy number
This number describes the number of paths from the southwest corner (0,0) of a rectangular grid to the northeast corner (m,n) using only single steps north, northeast or east
In a 3 by 3 square there are 63 paths from bottom left to top right corner using the above method.
The recursion for this is calculated as
D(m,n) = 1 if m == 0 or n == 0
= D(m-1, n) + D(m-1, n-1) + D(m, n-1) for the three different paths permitted

int GetDelannoy(int n, int m)
{
if (m == 0 || n == 0)
return 1;

return GetDelannoy(m - 1, n) +
GetDelannoy(m - 1, n - 1) +
GetDelannoy(m, n - 1);
}
>>> def d(m,n):
... if ((m==0) or (n==0)):
... return 1
... return d(m-1,n)+d(m-1,n-1)+d(m,n-1)
...
>>> print d(3,3)
63
>>> print d(4,4)
321
The recursion tree has a maximum depth of m + n

Wednesday, December 27, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
We will review Monte Carlo Simulation of Synthetic Data Sets shortly. This is a powerful technique. It assumes that if the fitted parameters a0 is a reasonable estimate of the true parameters by minimizing hte chi-square then the distribution of difference in subsequent parameters from a0 should be similar to that of the corresponding calculation with true parameters. #codingexercise
Find the Lobb Number. This counts the number of ways n+m open parantheses can be arranged to form the start of a valid sequence of balanced parantheses.
double GetLobb(double n, double m)
{
return ((2 * m + 1) * GetNChooseK(2 * n , m + n)) / (m+n+1);
}
double GetNChooseK(double n, double k)

{

if (k <0 || k > n) return 0;

if (n < 0) return 0;

return Factorial(n) / (Factorial(n-k) * Factorial(k));

}

double Factorial(double n)

{

if (n <= 1) return 1;

return n * Factorial(n-1);

}

double GetNChooseKDP(double n, double k)
{
if (n < 0 || k < 0) return 0;
if (k == 0 || k == n)
return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

The NChooseK is also the binomial coefficient.

Tuesday, December 26, 2017

#codingexercise

Find the Eulerian number E(n,m). This is the number of permutations of the numbers from 1 to n in which exactly m elements are greater than the previous elements.

For example, n = 3, m = 1 and there are 4 permutations in which exactly 1 element is greater than the previous element. These are :

1 2 3 => 1,2 and 2,3 resulting in count 2 ( invalid)

1 3 2 => 1,3 resulting in count 1 (valid)

2 1 3 => 1,3 resulting in count 1 (valid)

2 3 1 => 2,3 resulting in count 1 (valid)

3 1 2 => 1,2 resulting in count 1 (valid)

int GetEulerian(int n, int m)
{
if (m >= n || n == 0)
return 0;

if (m == 0)
return 1;

return (n - m) * GetEulerian(n - 1, m - 1) +
(m + 1) * GetEulerian(n - 1, m);

}

we can easily try this for n = 3, m = 1
(3,1)
2 x (2,0) + 2 x (2,1)
2 + 2 x ( 1x (1,0) +2x (1,1))
2 + 2
4
The recursive formula uses the notion that either the m reduces or doesn't as n reduces.

Monday, December 25, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
#codingexercise
Find the Entringer number for a given n,k
An Entringer number is the number of permutations where we have a number formed from the permutation of digits 1 to n+1 such that the first digit starts with k+1 and the digits following that initially fall and then alternatively rise and fall. For example. E(4,2) are
32415
32514
31425
31524
int GetCountEntringer(int n, int k)
{
if (n == 0 && k == 0) return 1;
if (k == 0) return 0;
return GetCountEntringer(n, k-1) + GetCountEntringer(n-1, n-k);
}
From Wolfram definition of Entringer
(4,1) (3,2)
(4,0) (3,3) (3,1) (2,1)
0 (3,2)(3,0) (3,0)(2,2)(2,0)(1,1)
0 (3,1)(2,1) 0 0 (2,1)(1,0) 0 (1,0)(0,0)
0 (3,0)(2,2) (2,0)(1,1)0 0 (2,0)(1,1) 0 0 0 1
0 0 (2,1)(1,0)0(1,0)(0,0) 0 0 0 (1,0)(0,0) 0 0 0 1
0 0 (2,0) (1,1) 0 0 0 1 0 0 0 0 1 0 0 0 1
0 0 0 (1,0)(0,0) 0 0 0 1 0 0 0 0 1 0 0 0 1
0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1

Sunday, December 24, 2017

We were discussing how to compare two models. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a trade-off between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.
#codingexercise
Find the number of increasing subsequences of size k in a given array.
For example, if 1 2 3 4, the increasing subsequence of size 2 are 1 2, 1 3, 1 4, 2 3 , 2 4 , 3 4
The recurrence is therefore
dp[i][j] = 1, where i = 1 and 1 <= j <= n
dp[i][j] = sum(dp[i-1][j]), where 1 < i <= k, i <= j <= n and A[m] < A[j] for (i-1) <= m < j

int GetCountIncreasingSubsequences(int A[], int n, int k)
{
int [,] dp = new int[k, n];
int sum = 0;

for (int i = 0; i < n; i++)
dp[0, i] = 1;

for (int size = 1; size < k; size++) {
for (int i = size; i < n; i++) {
dp[size, i] = 0;
for (int j = size - 1; j < i; j++) {
if (A[j] < A[i])
dp[size, i] += dp[size - 1, j];
}
}
}

for (int i = k - 1; i < n; i++)
sum += dp[k - 1, i];

return sum;
}

The above method also memoizes the counts computed for sizes smaller than k so we can easily lookup the results for anything less than k.

Saturday, December 23, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information
The correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. The residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate - the residual, more acceptable.
I want to mention that curve fitting improves with higher degrees such as a quadratic over linear but this does not mean we go to as high a degree as possible. The model we build tried to minimize the residuals. If we can do this with lesser degrees, then that is acceptable
When we compare two models, we follow certain rules. For example, the number of data points must be larger than the number of parameters. if we increase the number of parameters, it will result in goodness of fit and better Chi-square. Consequently the model with higher number of parameters does better.This is articulated as a metric called the mean squared error which is chi-squared divided by degrees of freedom. MSE uses them as numerator and denominator so it represents a tradeoff between over-parameterized model and a lower chi-square. A model with a fewer parameters but lower chi-square is preferred.

Friday, December 22, 2017

We were discussing Bootstrap method, confidence intervals and accuracy of model parameters especially on linearization of non-linear models
The goodness of fit and the residuals plot are useful indicators along with the error function. Each gives helpful information
The correlation will mean that the fit is good but to improve its accuracy, it must also result in a specific distribution of residuals. The residuals should be distributed along the independent axis and normally distributed around zero with no systematic trends. The latter condition makes the difference between the data point and the estimate - the residual, more acceptable.
I want to mention that curve fitting improves with higher degrees such as a quadratic over linear but this does not mean we go to as high a degree as possible. The model we build tried to minimize the residuals. If we can do this with lesser degrees, then that is acceptable.
#codingexercise
Print Hosoya's triangle
Hosoya's triangle is a triangular arrangement of numbers based on Fibonacci numbers where the sum of the two numbers in the left diagonal or the right diagonal results in the number at the current position.
The recurrence relation is
H(0,0) = H(1,0) = H(1,1) = H(2,1) = 1
and
H(n,j) = H(n-1,j) + H(n-2, j)
= H(n-1, j-1) + H(n-2,j-2)
And H(n,i) = F(i+1) x F ( n - i + 1)

int GetHosoya(int r, int c)
{
if (r == 0 && c == 0) return 1;
if (r == 1 && c == 0) return 1;
if (r == 1 && c == 1) return 1;
if (r == 2 && c == 1) return 1;
if (r > c)
return GetHosoya(r - 1, c) + GetHosoya(r - 2, c);
if (r == c)
return GetHosoya(r - 1, c - 1) + GetHosoya(r - 2, c - 2);
return 0;
}