Cluster computing

Monday, July 21, 2014

Continuing from the previous post, we were discussing a logger for software components. In today's post we look at the component registration of logging channels. Initially a component may just specify a name (string) or an identifier (guid) to differentiate its logging channel but requiring that each new component specify a new channel is not usually enforced. Furthermore, the logging at all levels is left to the discretion of the component owners and this is generally inadequate. Besides, some components are considered too core for any interest to users and consequently their logging is left out. With the new logger, we require that the components have a supportability review and that they are facilitated to log as machine data without restriction on size or frequency and at the same time support a lot more features.
Hence one of the improvements we require from component registration is the metadata for the component's logging channel. This metadata includes among other things intended audience, frequency, error message mapping for corrective actions, support for payload, grouping etc. In other words, it helps the logging consumer take appropriate actions on the logging payload. Today the consumer decides whether to flush to disk, send to logging subscribers, redirect to a database, It slaps headers on the data for information such as for the listener when sending over the network etc, takes different actions when converting the data to binary mode, support operations such as compression, encryption, etc and maintains different modes of operation such as performance oriented with fast flush to disk or feature oriented such as above. Throttling and resource management of logging channels is possible via redirection to null queue.
In general, a sliding window protocol could be implemented for the logging channel with support for sequence number, There are many features that can be compared with the similarity to a TCP implementation.
TCP has several features - reordering, flow control etc . For our purposes we don't have reordering issues.

Sunday, July 20, 2014

In today's post we continue to investigate applications of Splunk. One of the applications is supportability. Processes, memory, CPU utilization, file descriptor usages, system call failures are pretty much the bulk of the failures that require supportability measures. The most important of the supportability measures is the logging and although all components log, most of the fear around verbose logging has centered around pollution of logs. In fact most often used components lack helpful logging only because they are used so often that it rapidly grows the size of the log to an overwhelming number. Such a log is found offensive to admins who view the splunkd log as actionable and for their eyes only.
Now searches have their own logs and they generate logs for the duration of the sessions. Search artifacts are a blessing for across the board troubleshooting. It can be turned to debug mode, the generate log file is persisted only for the duration of the user session invoking the search and it does not bother the admins.
What is required from the components that don't log even to the search logs because they are so heavily used or are used at times other than searches is to combine the technique for search logs with this kind of logging.
The call for action is not just for components to log more or support logging to a different destination or have grades of logging but fundamentally allow a component to log without any concern for resources or impact. Flags can be specified by the component for concerns such as logging levels or actions. A mechanism may also be needed for loggers to specify round robin.
The benefit of a round robin in memory log buffer is the decoupling of producers from the consumers. We will talk about logging improvements a lot more and cover a lot of aspects but the goal for now is to cover just this.
The in-memory buffer is entirely owned by the application and as such the components can given the slot number to write to. The entry or content for the log entries will follow some format but we will discuss that later. There can be only one consumer for this in-memory buffer and that services one or more out of process consumers that honor the user/admin's choices for destination, longevity and transformations.

Saturday, July 19, 2014

Today we will look at the Markov chain a little bit more to answer the following questions:
Will the random walk ever return to where it started from ?
If yes, how long will it take ?
If not, where will it go.
If we take the probability that the walk (started at 0) attains value 0 in n steps as u(n) for even n, we now want to find the probability f(n) that the walk will return to 0 for the first time in n steps.
Let Xn, n >= 0 be Markov with transition probabilities pij. Let St (t >= 0) be independent from Xn. Sn can be considered a selection.
The theorem for the calculation of f(n) is stated this way:
Let n be even and u(n) = P0 (Sn = 0). P0 denotes the return to zero.
Then f(n) = P0( S1 != 0, ... Sn-1 != 0, Sn = 0) = u(n)/ n - 1
And here's the proof:
Since the random walk cannot change sign before becoming zero,
f(n) = P0(S1 > 0, ... Sn-1 > 0, Sn = 0) + P0( S1 < 0, ... Sn-1 < 0, Sn = 0)
which comprises of two equal terms.
Now,
P0 (Sn-1 > 0, Sn = 0 ) = P0 (Sn = 0) P(Sn -1 > 0 | Sn = 0) which is the use of conditional probability.
we know that the first term is u(n) and given that the Sn-1 = 1 or -1 with equal probability. So the last term is 1/2. Sn = 0. By the Markov property at time n-1 we can omit the last event.
So f(n) can be rewritten as 2 u(n) 1/2 P0(S1 > 0, S2 > 0 ... Sn-2 > 0 | Sn-1 = 1)
and the last term by ballot theorem is 1/(n-1) which proves that f(n) = u(n)/n-1

To complete the post on Random walk for developers as discussed in the previous two, we briefly summarize the goal and the process.
Random walk is a special case of a process called Markov Chain where the future events are conditionally dependent on the present and not the past events. It is based on random numbers that assign a real number to the events based on probabilities. The probabilities to move between a finite set of states (sometimes two - forward and backward as in the case of a drunkard walking in a straight line) is called transition probabilities. Random walk leads to diffusion.
The iterations in a random walk are done based on the equation px,y = px+z,y+z for any translation z where px,y is the transition probability in space S.
Random walks possesses certain other properties
- time homogeneity
- space homogeneity
and sometime skip-free transitions which we have already discussed
The transition probabilities are the main thing to be worked out in a random walk.
Usually this is expressed in terms of ranges such as
p(x) = pj if xj = ej where j = 1 to d in a d-dimensional vector defined space.
= qj if xj = -ej
= 0
We have since how this simplifies to a forward and backward motion in the case of a drunkard's linear walk.
The walk is just iterative calculating a new position based on the outcome of transition probabilites.
The walk itself may be performed k times to average out the findings (hitting time) from each walk.
Each step traverses the bipartite graph.
http://1drv.ms/1nf79KL

Friday, July 18, 2014

Continuing from the previous post ...

The number of steps taken between any two nodes is called hitting time. The hitting time between an unclassified and domain term and can be averaged over all the walks that connect these two nodes.
Since the walks are selected based on transition probabilities, the most probable paths are selected first. The same pair of nodes can be connected with many paths and the same unclassified term can be connected to many other domain terms.
The contextual similarity of a classification pair n,m can then be described as the relative frequency of the hitting of those two nodes and other normalized nodes linked to that start node.
This is calculated as Contextual Similarity L(n,m) = H(n, m) / Sigma-i(H(n,m)).
We can also look at Stability of a random walk or a Markov chain in general.
Stability refers to the convergence of probabilities as the series becomes infinite. When the recurrence is positive and the series is not reducible, the average (called Cesaro average) 1/n Sum (PXk = x) converges to pi (x), as n -> infinity.
Stability is interesting because a Markov chain is a simple model of a stochastic dynamical system that remains within a small region. For example, when a pendulum swings, it finally comes to a stable position with dissipation of energy. Even if there were no friction, it would be deemed stable because it cannot swing too far away.
What the stability tells us is that when a Markov chain has certain properties ( irreducibility, positive recurrence, unique and stationary distribution pi) , the n-step transition matrix converges to a matrix with rows all equal to pi. This is called the fundamental stability theorem.
Stability works based on coupling.
Coupling refers to the various methods for constructing a combination of the two random variables. If the random variables are independent, then they can be combined in a straightforward manner taking co-occurrence. Coupling helps us define a third Markov chain Z from an arbitrary distribution X and a stationary distribution Y where Z is X prior to the meeting with Y and Z is Y after the meeting point. This then shows that the transition matrix converges to all rows as pi.
By choosing the most probable paths, the random walk follows the preferred state transitions. Thus while not all paths may end within the predetermined steps, we know that when it does, it would have chosen the higher transition probabilities.

A simple random walk has equal probabilities to move from one state to another.
To implement a simple random walk in (x,y) dimension, we can have a naive one like this:
for ( i = 1; i < 2^n; i++ )
if move in x-did
x[i] = x[i-1] + sample(step, 1);
y[i] = y[i-1];
else
x[i] = x[i-1];
y[i] = y[i-1] + sample(step,1);
print(x,y);

We can have the following metrics in a simple random walk:
first return time to zero
total number of visits to a state.
For all the random walks, we can have metrics like
total number of paths between two nodes
total number of paths in which a node appears

Thursday, July 17, 2014

We mentioned that Random walk is a special case of a Markov chain. To understand Markov chain, we take the example of a mouse in a cage with two cells - the first has fresh cheese and the second has stinky cheese. If at time n, the mouse is in cell 1, then at time n+1, the mouse is in either cell1 or cell 2. Let us say the probability to move from cell 1 to cell 2 is alpha and the reverse is beta. In this case, alpha is much smaller than beta. Then the transition diagram shows transitions as follows:
1->1 with a probability of 1- alpha
1 -> 2 with a probability of alpha
2 -> 1 with a probability of beta
and 2->2 with a probability of 1 - beta.
The mouse makes the choice of staying or moving with equal probability, the time it takes for the mouse to make the first move from 1 to 2 is the mean of the binomial distribution = 1/alpha.
Thus the moves are described in terms of random variables Xi. Let T denote the subset of integers in a sequence of random variables {Xn : n belongs to T}. The Markov property is then defined as follows:
any n belonging to T, the future process Xm (m > n) is independent of past process Xm (m < n) and conditionally dependent on Xn.
In the case of the mouse, the probability of the move is regardless of where the mouse was at earlier times i.e the future is independent of the past given the present.
This dependence of the future on the present is easy to generalize with random numbers instead of deterministic objects. The states are countable and can be denoted by state space S and since the moves are between states, Xn is called a Markov chain.
The Markov property talks about all before and after processes so (Xn+1 .... Xn+m) and (X0, ... Xn-1) are independent conditionally on Xn.
The walk based on a Markov chain is thus dependent on transition probabilities. If the transition probabilities are defined, then the walk is just iterative.
The joint probability distribution can be expressed in terms of two functions -
one for the initial states for each i in the state space S
and the other for the transitions pi,j(n, n+1) = P(Xn+1 = j | Xn = i), i, j belongs to S, n >= 0
We can generalize the transition from one step to more than one steps with
pi,j(m,n) = P(Xn = j | Xm = i).
We call Markov chain- time homogeneous when the single step transition probability does not depend on n
In the case of the mouse, the probability of move is independent of where it was at earlier times. Each single step transition probability, denoted by just P, is the same no matter how many times it is repeated and the transition from P(m,n) = P * P * P ... n-m times.
In a random walk, we add a few more properties we discussed earlier.
Now let us consider a random walk on the bipartite graph starting at an unclassified term and arriving at another semantically equivalent domain term. The walker starts from any node and then moves to any other node with transition probability Pij.
Pij is the normalized probability of the co-occurrence counts of the terms and the corresponding context.The transition probability between any two nodes i,j is defined as Pi,j = Wi,j / Sum(Wik) for all k.
The terms are either unclassified or domain and cannot be connected with each other except through context.
Each walk uses the transition probabilities for the move.
Each walk starts at some unclassified term and ends at a domain term or exceeds the maximum number of moves without finishing.
Several walks upto say K times can be initiated.

Wednesday, July 16, 2014

Let us look at performing a random walk between terms and context.
We discussed in the previous post that terms and context are different for machine data than for human text. The context in the case of machine data is the metadata fields. A random walk connects the terms and the context in a bipartite graph.
A random walk is a special kind of Markov chain. The latter has the property that it has time homogeneous transition probabilities. A random walk has an additional property that it has space homogeneous transition probabilities. This calls for a structure where px,y = Px+z,y+z for any translation z. A random walk is given by px,y = p(y - x)
In a undirected connected graph, the degree of each vertex $i = number of neighbors of i
For the same graph, a random walk is defined with following transition probabilities
pij = { 1 / degree(i), if j is a neighbor of i
{ 0 otherwise
The stationary distribution of RWG is given by pi(i) = C degree(i) where C is a constant.
This comes from the fact that pi(i) pij = pi(j) pji since both are equal to C. On each hand of the equation the first term is the stationery distribution and the second term is the inverse of degree(i).
This concludes that the stationary distribution is proportional to the degree(i). This is easy to work out with a sample connected graph of five vertices arranged inside a square. Moreover, the equations suggest that RWG is a reversible chain. The constant C is found by normalization. Given that the degress add up to twice the number of edges, we have C = 1/(2 |E| )

There are several ways to express transition probabilities resulting in different RWGs
p(x) = C2^ (-|x1| - ... - |xd|)

p(x) = pj if x = ej, j = 1, ... d
= qj if x = -ej, j = 1, ... d
0 otherwise
and all of the p(x) add upto 1.

if d = 1 then p(1) = p, p(-1) = q and 0 otherwise where p + q = 1 . This is the drunkards walk which is homogeneous in time and space and given that it must pass through all intermediate nodes because each hop is 1, the transitions are also skip free.