Cluster computing

Wednesday, August 20, 2014

Today we will look at Skorokhod embedding theorem. We consider a simple symmetric random walk where the probability for an event with value v is 1/2 and that against is 1/2. When we sample, we want the probability that the gambler's fortune will reach level b > x before reaching level a < x is given by (x-a)/(b-a). When we start the random walk from S0 = 0, the probability will reach level b > x before reaching level a < x equals -a / (b-a) and the probability that it will reach level a < x before b > x is b / (b-a).
These same probabilities can now be read as simple symmetric random walks with distribution as the range between a and b to reach b and a respectively.
We view the random walk this way. Given a 2 point probability distribution (p, 1-p) such that p is rational, we can always find integers a,b a < 0 < b such that pa + (1-p)b = 0.
This means we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a,b] The probability it exits from the left point is p and the the probability it exits from the right point is 1-p. Having introduced the theorem, lets look at some more complicated probability distribution.
If the probabilities were to be many in a distribution, can we find a stopping time T such that the random walk has the given distribution ? This is where Skorokhod embedding theorem helps.
This theorem states that for a simple symmetric random walk and (pi, i belongs to Z) a given probability distribution on Z with zero mean which means satisfying the equation sum of i.pi = 0, then there exits a stopping time T where the probability for random walk to arrive at i = pi

Monday, August 18, 2014

Today we will continue to discuss random walks. We discussed single dimension and two dimension random walks so far.We saw that the two dimensional random walk can be expressed in terms of random walks in single dimensions that are independent. We proved this by enumerating the probabilities of the transitions and by considering them relative to origin. We now look at another lemma that states the probability for the two dimensional random walk to return to zero after 2n times is equal to twice that of the simple symmetric random walk (ssrw) to return to zero after 2n times. This intuitively follows from the first argument we made where we described the two dimensional random walk to be expressed in terms of the two independent ssrw. Further more, both the ssrw have the same probabilities for the final state we are interested in. It follows that the probability for the two dimensional ssrw is twice that of the same for a ssrw which in this case is the count of number of paths from (0,0) to (2n,0) to the overall paths and works out to be 2n Choose n times 2 ^ -2n. The probability for the final state for the two dimensional random walk is double that.
We next look at recurrence
Remember that Markov chain has finitely many states and at least one state must be visited infinitely many times. Perhaps some states will be visited again and again infinitely. Such a state is called recurrent. If the chain returns to a state i infinitely many times starting from i, we say that the Pi(Xn = i for infinitely many n) = 1
In the two dimensional symmetric random walk, from the previous lemma we can say that the recurrence probability holds true for both sides as it tends to 1 for infinitely many series and given the probability we worked out and substituting for different n and approximating with Stirling's approximation to c/n for some constant c, we see that the recurrence holds true for two dimensional random walk.

Sunday, August 17, 2014

In the previous post, we considered a simple symmetric random walk in two dimensions. We said that when we change the co-ordinates say by rotating the axis and scaling, the random walks along the single dimensions are independent. To prove this, we start with the random walk in two dimensions and express it as a sequence of vectors where each vector is an incremental displacement along the rotated and scaled axis in cartesian co-ordinate pairs. This sequence is also i.i.d because it is taken with some function on the elements of sequence consisting of incremental vectors along the original axis. When we look at this function, we see that the rotation and the scaling implies taking the sum and the difference of the incremental displacements along the original axis. This kind of sum and difference is independent for each of the elements in the sequence of the two dimension walk. Moreover, the possible values for each incremental in the two dimension walk are either positive or negative 1. So we can write the probabilites for state transtions of the two dimension walk as follows:
Probability to get to (1,1) in the two dimension walk = Probability to get to (1,0) in the single dimension walk = 1/4
Probability to get to (-1, -1) in the two dimension walk = Probability to get to (-1,0) in the single dimension walk = 1/4
Probability to get to (1, -1) in the two dimension walk = Probability to get to (0,1) in the single dimension walk = 1/4
Probability to get to (-1,1) in the two dimension walk = Probability to get to (0, -1) in the single dimension walk = 1/4
This implies that the probability for each of the incremental displacements along the rotated axis is 1/2. And so we can decompose the random walk in two dimensions with probability for the final state to a pair (n1 to be some value e1 , n2 to be some value e2) as a combination of the probability for the final state to be (n1 to be value e1) and the probability of the final state to be (n2 to be some value e2) for all choices of the value e1 and e2 to be in positive or negative unit distance. Thus they are independent.
We will next look at a few more lemmas in this case.

Thursday, August 14, 2014

Today I will discuss simple symmetric random walk in two dimensions. We can consider the same drunkard problem in two dimensions. Consider a drunkard in a city whose roads are arranged in a rectangular grid. The corners have coordinates (x,y) and the origin is (0,0). He moves east, west, north or south with equal probability of 1/4. When he reaches a corner, he moves again in the same manner.
This random walk can be considered in the form Z^2 = Z * Z. Let E1, E2 be i.i.d random numbers such that the incremental relative displacements are each 1/4.
Because the relative displacements are all the same, we move the start of all random walks to be the origin.
We further describe the horizontal and the vertical displacements with separate notations say x1 and x2 for horizontal and vertical respectively such that they belong to the grid and the final state is the sum of the incrementals in both directions taken together.
One of the lemmas now is that if both the horizontal and the vertical displacements are random walks, the two are not independent.
Since E1, E2 are independent their x and y projections are also independent and they are identically distributed with a transition probability to zero as 1/2 and to either 1 or -1 as 1/4.
So the horizontal states are a sum of iid random variables and hence a random walk. But it is not a simple random walk because the transition to zero has a different probability.
The same argument applies to vertical states.
Thus we see that the two random walks are not independent. Further, if we change the co-ordinates by rotating the axis by 45 degrees and scaling them then we can suppose the final state is described by the sum and the difference of their horizontal and vertical final states. In such a case the random walks along those axis is now considered independent.

Wednesday, August 13, 2014

Having discussed Splunk DB in the previous posts, I want to outline and discuss one of the key concepts and its implementation in this post today. This is the time series partitioning of data or buckets and the inverted time search on these buckets. There are other very powerful concepts such as key value map and reduce that enables powerful analytics. We suggested and follow up on whether we could externalized this to NoSQL databases. In the meantime, I want to focus on just the time series data, its persistence and search - essentially whether we can avoid buckets.
Events that flow into Splunk are marked with timestamp. The timestamp gives an order to the event. In the absence of any other search query parameters, the timestamp helps partition the data correctly.
The events need to be sorted by timestamp. However, events can arrive out of order. Its helpful to think of these events as precious data. We should not drop events, if possible. So a sufficient window to let the out of order events arrive could be maintained. When the events arrive continuously, they can be streamed to a file or a table. There is no need to split the file or the table logically. If we are concerned about the reliability and fault tolerance, we can handle it externally. If we are concerned about a single point of maintenance, we can distribute it. If we are concerned about ACID properties, we can use a traditional relational database. Outwards, there is nothing preventing us from viewing it as a singular data store.
The ability to put an order to the late arrival events and then to search back time for query matching is the consideration for this post. One of the ways to do this is with a sliding window protocol. While we cannot solve the problem of unlimited sequence numbers, we can hope to index and maintain events that are of reasonable delay. If we cannot control that, we specify an auxilary storage for such overly delayed packets, which we can later retry sequencing asynchronously. We could do this with fixed sequence numbers. For every event sequenced, the sliding window moves forward by one event. Sequencing one event involves inserting it in the right place. Typically we avoid the lookup and the insertion by copying the minimum time-stamp to the start position of the sliding window and exchanging it with the element at the start. We do this only if the element at start is not the minimum. We also check that the element before the start is earlier than the element at the start. Then we advance the sliding window forward by one event.Since we check only the elements within the window and the element prior to start we can wrap-around the index for a fixed size sequence number.

Tuesday, August 12, 2014

We return to the discussion on Ballot theorem. We said that if the urn contains a azure balls and b = n - a black balls then if we are given that there is sampling without replacement, then the probability that azure balls come out ahead is (a-b)/n as long as a >= b. We prove this by drawing similarity to the random walk and translating it to a question about random numbers. The drawing of the azure ball can be considered positive and the drawing of the black ball can be considered negative.The sampling is precisely the event that the S0, S1, S2, ..., Sn are each positive (where Sn is the state of the walk at time n) And the random walk remains positive y. This we know from our equations for random walk to be y/n.
Next we consider asymmetric simple random walk with values in Z but which is not necessarily symmetric. The probability for a positive sign is taken as p and that for a negative sign is taken as 1-p, the displacements belong to Z.
Probability for the distribution with n steps and the initial state S0 to remain positive k is given by
n Choose (n+k) / 2 p ^ (n+k)/2 q ^ (n-k) /2
Here we are simply counting the paths. The numerator n Choose (n+k)/2 is the number of paths from (0,0) to (n,k) And the total number of paths is given by the inverse of the subsequent terms which are for positive and negative walk.
The initial state S0 is independent of the increments that lead to Sn.
We can now look at the hitting time -
We will now look at the probabilistic and analytical methods to compute the distribution of the hitting time The hitting time can be expressed as a infinite series of iterations n where n >= 0 for the final state Sn to arrive at x in Z and it takes values 0,1, 2 .. infinity
For a given x, y in displacement space Z, we know that the probability for the hitting time to arrive at y starting from x to be equal to t is the same probability for the hitting time to arrive at y-x from start to be equal to t. In other words, the space is homogenous and the relative transitions between any two points if similar, will result in similar hitting times. We use this to our advantage by considering all such random walks to start from zero without losing any generality.