Cluster computing

Monday, May 12, 2014

LogParser and Splunk
We will cover some topics common to logParser and Splunk and see how we can integrate these.
LogParser is a versatile and powerful tool that enables searches over xml, csv etc. It has a universal query access to text based data.
LogParser can take a SQL expression on the command line and output the results that match the query.
Splunk has a unix style rich query operators to perform search that goes beyond just retrieving the results.
The advantage of a common LINQ expression for querying is that it can work with any data source. If we could consider logParser and Splunk as two different query based data providers, then arguably there is a way to support LINQ style querying over Splunk.
Let us on the other hand look at integrating LogParser and Splunk directly. While one can take the output of the other it is preferable that Splunk takes log parser as a modular input. More on this shortly.
Architecturally, there is considerable difference between LogParser and Splunk
Splunk to SQL connector apps already read relational data. There are plenty of apps that can connect to different SQL sources and perhaps use SQL queries.
However there is very limited apps that read logParser.
LogParser can read different kinds of data not just xml or csv. It can read different data such as event viewer.
The ability to query over such data is a major benefit contributing to its popularity on Windows systems.
Splunk could translate user queries to SQL and in this way have access to not only log parsers capabilities but also the data sources that are typically used with LogParser.

class Employee:
 def __init__(self, id, mgrId, name):
  self.id = id
  self.mgrId = mgrId
  self.name = name
    
# sample data
#E = [Employee(1, 0, 'ABC'), Employee(2, 1, 'DEF'), Employee(3, 1, 'GHI')]
E = [Employee(1, 1, 'ABC'), Employee(4,4,'IJK'), Employee(2, 1, 'DEF'), Employee(5,4,'MNO'), Employee(3, 1, 'GHI')]

def directReportsBFS(E,s):
 Q = []
 level = 0
 Q.append(s)
 while (len(Q) > 0):
  c = Q.pop(0)
  if c is None:
   level = level + 1
   continue
  Reports = [e for e in E if e.mgrId == c.id and e.mgrId != e.id]
  print 'Name:' + c.name + ' Level:' + str(level) + ' Reports:' +"[{0}]".format(", ".join(str(r.name) for r in Reports))
  Q.append(None)
  all(Q.append(report) for report in Reports)


[directReportsBFS(E,e) for e in E if e.mgrId == e.id]

Sunday, May 11, 2014

In this blog post, I talk about matrix operations from the book on Algorithms and Data Structures. I'm covering these as part of the series of earlier posts from that book. The book says operations on matrices are at the heart of scientific computing. Efficient algorithms for working with matrices are therefore discussed. One important issue that arises in practice is numerical stability. Numerical stability means that there can be rounding errors which affect the results and are considered numerically unstable.
A matrix is a rectangular array of numbers. The elements of a matrix in row i and column j is aij. The set of all m * n matrices with real valued entries is denoted by R m*n. The transpose of a matrix A is the matrix A' obtained by exchanging the rows and columns of A.
A vector is a one dimensional array of numbers. A column vector is a n * 1 matrix. A row vector is a 1 * n matrix. A transpose of a column vector is a row vector.
The unit vector is one whose ith element is 1 and all the other elements are 0. The size of a unit vector is clear from the context. A zero matrix is a matrix whose every entry is 0.
Square n * n matrices arise frequently. Several special cases of square matrices are of particular interest.
A diagonal matrix has aij = 0 whenever i != j. An upper triagonal matrix is one for which uij = 0 if i > j. All entries below the diagonal are zero. A lower triagonal matrix is one for which lij = 0 for i < j.
Operations on matrices involve matrix addition such as in cij = aij + bij.
Matrix subtraction is the addition of a negative matrix.
In Matrix multiplication, we start with two matrices A and B that are compatible in the sense that the number of columns of A is equal to the number of rows of B. We start with two matrices A and B that are compatible. then we perform C = AB where cik = Sum-j=1 to n (aij.bjk)
Matrices may have some of the associative properties.
Identity matrices are identities for matrix multiplication.
Matrix multiplication is associative
A(BC) = (AB)C
A(B+C) = AB + AC
(B+C)D = BD + CD
The inverse of an n x n matrix A is the n x n matrix A(-1) where paranthesis denotes superscript such that AA(-1) = A(-1)A = In
If a matrix has an inverse, it is called determinant or non-singular.
The vectors x1, x2, xn are linearly dependent if there exists co-efficients not all of which are zero.
The ijth minor of an n x n matrix A, for n > 1 is the (n-1)x(n-1) matrix obtained by deleting the ith row and the jth column of A. The determinant of an n x n matrix A can be defined recursively in terms of its minors by
det (A) = { a11 if n = 1
{ Sum for j = 1 to n [ (-1)^(1+j) a1j det(A 1,j) if n > 1
The term (-1)^(i+j).det(Aij) is known as the cofactor of the element aij.
The determinant of square matrix A has the following properties:
If any row or any column of A is zero, then det(A) is zero
The det(A) is multiplied by lambda if the entries of any one row or any one column are all multiplied by lambda.
The determinant of A is unchanged if the entries in one row are added to those in another row.
The det(A) equals the det(A transpose)
The det(A) is multiplied by -1 if any two rows or any two columns are exchanged.
For any square matrix, we have det(AB) = det(A).det(B)

In this post we look about a pseudo random number generator algorithm but we will revert to Queue implementation shortly. Specifically we discuss Mercene Twister algorithm. This method is based on Mersenne prime number and hence its name. The prime number has a value to 2 ^ 19937 -1. This method is said to be commonly used for most of the languages. Python Ruby Pascal PHP Maple MATLAB etc.
The commonly used version of this algorithm is called MT19937 which produces a sequence of 32 bit integers/
This method has a very long period of 2 ^ 19937 - 1. With a long period, there is more distribution and though it does not guarantee randomness, there is less so shorter periods.
It is k-distributed to 32 bit accuracy for every 1 < k < 623.
A pseudo-random sequence xi of w-bit integers of period P and is said to be k-distributed to v-bit accuracy, when we take the numbers formed by the leading v bits of x then each of the possible 2 ^ kv possible combinations of bits occurs the same number of times in a period. We can discount the all-zero combinations. Essentially we say that they are unique even if we take just a portion their bits. Random numbers are generally not expected to repeat. So using 32 bit notations all 32 bits are involved in making them distinct. But if we take only a few of the bits and they are still distinct, then they are said to be k-distributed.
Also, this method passes numerous tests for statistical randomness.

Saturday, May 10, 2014

In this post, we talk about the rules in queue alerts module design. ( http://1drv.ms/1m5CYRQ.)

When the rules are updated, the queue manager informs the worker. Since there are more than one workers, the queue manager can decide which rules to push to which worker. The workers are all doing the same task regardless of the rules. They evaluate the messages per the rules and defer the filtered messages to be forwarded or acted upon. The workers could maintain a list of messages or they could just put in a queue for the manager to read. I’ve denoted this with SendMessages() The manager may have a common queue across all workers such as an IO completion port and the messages may even be stamped with a worker id to denote the application context. How we scale out the applications between the manager and the workers is left to performance considerations but we mention to avoid more than one data structure such as IO completion port for communication between the manager and the workers. Ideally, the workers could take on beefier role while the manager manages the rules and retrieves the messages. If the Manager can GetAllMessages() without the worker’s involvement, that may be yet another improvement.

In this post, I want to include a puzzle I came across online.

Write a function to generate random numbers .

The problem talks about pseudo-random number generation since true random number can be difficult. This can be done in many ways.

It is generally best to use built-in pseudo random number generators such as Math.Random

However, I want to describe the problem this way. Generate random numbers between 0 to 3 and print them. The only test I have is that calling the function three times doesn't result in same sequence more than half the number of times the function is called.

By restricting the size of the array, I'm going to use the index and the contents of the array.

for( int i = 0; i <= N; i++)

{

Swap(h[i], h[Math.Random(0,N-1)]);

}

return h[n];
we could probably optimize this to just iterate over half the array.

Some languages implement random number as follows:

int next(int n)

{

seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1))

return (int)(seed >>> (48 - n));

}
or Mersenne Twister based such as in C documentation:
int rand(void)
{
next = next * 1103515245 + 12345;
return (unsigned int )(next/2 * (RAND_MAX + 1L) % (RAND_MAX + 1L));
}

Friday, May 9, 2014

In this post, we briefly review some of the compiler flags available from the MS IDE.
/Od disables optimization. If you see code that doesn't show locals or functions under a debugger, chances are the compiler has decided to optimize the code. This flag helps with the debugging in that it disables optimizations so that you have access to all code artifacts. This flag has also been used to troubleshoot the compiler. Compiler can give bad output as well say in one of the following circumstances:
1) a strange combination of compiler flags was used and
2) compiler was asked to build against the incorrect target
In the cases, where the compiler does something weird with the switches specified, the /Od comes to the rescue in providing a single expected version of the executable.
Another thing to mention is that the compiler also provides a /errorReport switch that comes to the use when there are error information report to be generated. This flags is described on the MSDN as one that allows you to provide internal compiler error directly to the Visual C++ team.
The /favor switch helps you to produce code that is optimized for a specific architecture or for micro architectures in both AMD64 and EM64T.
The /GA switch optimizes code for Windows application.
The /GA speeds access to data declared with the __declspec(thread) in a Windows based program.
The __declspec usually denotes the Microsoft specific storage class attributes such as for example align. What align specifies is that the alignment of user defined data be precisely controlled. This can be used with a struct, union or class or variables.
The other type of storage attributes include app domain, dllimport, dllexport, noalias, noinline, noreturn, nothrow, novtable, safebuffers, thread etc.
The /Ge switch activates stack probes. This is helpful for stack overflow The same functionality is now available via /Gs and is preferred over the now deprecated /Ge switch.
The /GF switch enables string pooling that can create smaller programs. /GF is in effect when /O1 or /O2 is used.
The /GL switch enables whole program optimization. Much of these optimizations are also described by the /LTCG option LTCG stands for Link time code generation and is responsible for optimizations such as
cross module inlining
inter procedural register allocation on 64bit
custom calling convention (on x86 only)
Small TLS displacement (on x86 only)
Stack double alignment(x86 only)
and disambiguation between global variables and input parameters.
/Wp64 compiler option detects portability problems
It does this by testing the usages of int, long and pointer in the program. If the 64bit compiler is regularly used, this flag can be disabled because the compiler will detect all such issues.
By default, this is off in 32 bit compilers and on in the 64 bit compilers.
Lastly, we discuss the /Zi option that generates complete debugging information. This option implies the /DEBUG (generate debug information) Note that this only affects the program database PDB where the compiler puts the debugging information.
/GT option supports fiber safety for data allocated using static thread-local storage.
The opposite of disabled optimization is full optimization which is specified by the /Ox maximum optimization (/Ob2gity /Gs)