Cluster computing

Monday, August 4, 2014

In this post, we will describe path counting using simple symmetric random walk in dimension 1. In this random walk there is equal probability to go forward or backward in one dimension. When the walk starts from zero, we can describe it as sum of 1 to n i.i.d random variables each with a probability of 1/2. This is a vector with values likely between -1,1 taken n times and each are equally likely. Since there are 2^n elements in {0,1} ^ n, we have the probability distribution as P(E1 =e1, E2=e2, ... En=en) = 1/2 * 1/2 * 1/2 ... = as 1/(2^n)
We now say that the Events A that depend only on the first n i.i.ds are subsets of this space.
So if there are #A elements in this space, we have P(A) = #A/2^n.
Therefore, if we can count the number of elements of A, then we can compute its probability.
First, we assign to each initial position and to each sequence of n signs, a path of length n. We can construct a grid and trace this path as cartesian coordinates.
Now let us take an example where we have an Event on this graph/grid constituting of positions and we want to compute its probability.
Here is such an example : A = {S2 >= 0, S4 = -1, S6 < 0, S7 < 0}.
The paths comprising this event are shown below. We can count that there are six paths in A.
So P(A) = 6/2^6 = 6//128 = 3/64 which is approximately 0.047.
We have thus found probabilities by counting path.
Next we look at paths that start and end at specific points. The number of paths that start at one specific point (x,y) and end at another specific point (m,n) is given by (m - x) Choose ((m -x + n - y)/2).
Additionally, If we take the start as (0,0) and finish at (n,i) then if (n+i) is not an even number, then there is no path that starts from (0,0) and ends at (n, i). Hence the totat number of paths = n Choose (n+i)/2 which is equal to zero.

Sunday, August 3, 2014

In the previous post, we did not mention how we use the law of large numbers. We will complete that today.
We claimed that pi(y) = g(y) - g(y-1) and this satisfies the balance equation. We then attempted to say the Markov chain describing the supply and demand at an electronics store can be expressed as
Mn+1 = 0 or max Zj where 1 <= j <= n-1 where Zj was the sum of all differences between demand and supply till date.
and this we approximated to Mn+1 = 0 or (Mn + ksi-0) because we can replace the random letter variables ksi without affecting the probability.
P(Mn+1) = P(0 or (Mn + ksi-0)
= Sigma x>= 0 P(Mn = x) pxy
Since the n-dependent terms are bounded, we may take the limit of both sides as n-> infinity.
and lim n-> infinity for P(Mn = y) is our definition for pi(y)
we can write the earlier equation as pi(y) = Sigma x >= 0 pi(x) pxy. This shows that pi satisfies the balance equations however we need to prove further that Sigma-y pi(y) = 1
That is we have a weighted distribution of the probabilities where the sum is normalized to unity.
Here we say g(y) is not identically equal to 1 when the pi adds up to unity.
we prove this by using the assumption E.ksi-n = -mu < 0
P(lim n->infinity Zn/n = -mu ) = 1 from the law of large numbers.
This implies that the P(lim n->infinity Zn = -infinity) = 1
But we know that we wrote down P(lim n-> infinity Mn = 0 or max {z1, z2 ...} < infinity ) = 1
This gives g(y) = P(0 or max {z1, z2, ...} > y) is not identically equal to 1.

I'm going to take a short break today to discuss storage or queuing application of a Markov chain. A storage facility has Xn electronic items at time n. A number An of the new components are added to the stock, while a number of them are sold to the market. If the demand is for Dn components , it is met instantaneously with what is available. If all of Dn cannot be satisfied, then it is satisfied with the stock reaching level zero. at time n+1. Thus the store has items Xn + An - Dn components in the stock or zero at time n+1 .
we write this Xn+1 = (Xn + An - Dn) or 0.
We make the following assumptions:
1) The pairs of random variables (An, Dn), n >= 0 are independent and identically distributed random variables (iid). The supply and demand at any time n is independent of the previous states and come from the same distribution and
2) E(An - Dn) = -mu < 0 . Thus the demand is larger than the supply per unit time on average which is indicated by the negative sign.
Now, Xn, n >= 0 is a Markov chain and we will show this Markov chain is positive recurrent.
Let us take xi (greek-letter) = An - Dn
Then the Markov chain is a simple recursion Xn+1 = Xn + xi or 0 for all n >= 0
Let us try solving the recursion.
In step 1, X1 = (X0 + xi-0) or 0
In step 2, X2 = (X1 + xi-1) or 0 = X0 + xi-0 + xi-1 or xi-1 or 0
In step n, Xn = (X0 + xi-0 + ...xi-n-1) or (xi-1 + ... + xi-n-1) or ... or xi-n-1 or 0
The probability that Px(Xn > y) can now be written in terms of step n. Thus the event whose probability we want to compute is a function of (xi-0, ..., xi-n-1) . Moreover these xi-0, ... xi-n-1 are i.i.ds so we can move them around and even reverse it without altering the distribution.
Then for abbreviations, we use
Zn = xi-0 + ... + xi-n-1 and
Mn = Zn or Zn-1 or ... Z1 or 0
and g = Px(Xn>y)
P0(Xn > y ) can then be written as P( Mn > y ) and since Mn+1 is >= Mn, we know that
P0(Xn > y) is less than and tending to left hand side than P0 ( X n+1 > y ) for all n. At infinity the limit of this bounded and increasing sequence exists.
We can take this difference as pi(y) = g(y) - g(y-1) and we claim that it satisfies the balance equations.
But Mn+1 = 0 or Max Zj where j is in 1 to n+1
which we can say is = 0 or (Mn + xi-0) . We make this approximation based on the law of large numbers.
Hence for y >= 0
P(Mn+1 = y ) = Sigma- x>=0 P(Mn = x) .pxy. Since these terms are bounded, we may take the limit of both sides as n-> infinity, we find pi(y) = Sum x>=0 pi(x)pxy. With this definition, where the weights add up to 1 we can say that pi satisfies the balance equations and g(y) does not become 1 at infinity.
Courtesy : Kostantopoulous

Saturday, August 2, 2014

Tonight we will continue our writeup on SplunkLite.Net : http://1drv.ms/1k3LnJ0
Code will be available on http://github.com/ravibeta/csharpexamples/SplunkLite.Net

Splunk Search Processors are generally of the following types:

Streaming

This is completely parallelizable with example operators as eval, where, rex

Stateful

This requires a single stream of data and therefore is not parallelizable. Example commands are sort, eventstats, etc.
Event-processing
This is not reporting, but also not streaming or stateful. Example commands are : Sort, eventstats etc.

Reporting

This is summarizing. Example commands are : stats, chart, timechart, top etc.
For now, we can exclude sub-searches.
Searches can be saved so we will need a SavedSearchAdminHandler that implements similar methods as for an AdminHandler.

Batched search:

Batch mode is not realtime and not summarizing. It is a reporting search such as the stats command. It operates one bucket at a time. Buckets are sorted into non-restartable and restartable buckets and read in that order – a few events at a time until the bucket gets exhausted.

Realtime search:

Realtime searches yield results as the events are indexed, even in a distributed environment. SplunkLite.Net is a single machine single instance mode and nothing stops it from being distributable especially since there’s CCW and REST support in .Net. This has the ability to do search and stats on the incoming pipeline data.

Indexed search:

This works on different time windows and adjust the time based on user preferences . For example – last one week, one month, one year etc. Buckets may overlap in time although only a few buckets will be writeable (hot). This kind of search runs with map/reduce like distribution.

For each of the searches it supports three different modes.

- Fast mode

o Field discovery off for event searches. No event or field data for stats searches.

- Smart mode

o Field discovery on for event searches. No event or field data for stats searches.

- Verbose mode

o All event and field data

Friday, August 1, 2014

Today I'm going to continue the discussion on SplunkLite.Net.

In today's post we continue our discussion on porting the Splunk forwarder functionality to Splunk Lite.Net Here we look at the data structures such as the QueueReader and QueueWriter in ProducerConsumerQueue. We are making sure the ProducerConsumerQueue has equivalent in.Net
Queue is the abstact notion to read CowPipelineData.
QueueReader is the interface for the client to read data from the queue
QueueWriter is the interface for the client app to add data to the queue. There are several queue implementations
Each QueueWriterImplementations has additional methods to get the size, check if the queue is full or empty, reset the size, get stats, remove a single item, add metrics and shutdown.
The writers can be in memory or in disk writers.
A PersistentQueue is one which has one reader and many writers. The writers write to memory but when they overflow they right to disk.
A ProducerConsumerQueue has in memory writers by name, persistent writers by name, list of all writers, the next provider to be polled for data, a mutex protecting condition, and a queue not empty condition.
With the PipelineData,we mentioned earlier that each thread can have its exlusive CowPipelineData though underlying they may share the same PipelineData.
Moreover, we mentioned that the PipelineData uses notion of RawStoragePool and RawStorageReference but these are not suitable for porting because we don't manage memory in .Net.
Anytime the underlying data is modified, we do something like this in the forwarder:
PipelineData *x = pData.getWriter()
x->put(...)
Together this shows that we can use .Net collection but retain the semantics of a queue reader and writer in SplunkLite.Net.
Search is one of the biggest operations and we will deal with that soon.