Cluster computing

Tuesday, August 12, 2014

We return to the discussion on Ballot theorem. We said that if the urn contains a azure balls and b = n - a black balls then if we are given that there is sampling without replacement, then the probability that azure balls come out ahead is (a-b)/n as long as a >= b. We prove this by drawing similarity to the random walk and translating it to a question about random numbers. The drawing of the azure ball can be considered positive and the drawing of the black ball can be considered negative.The sampling is precisely the event that the S0, S1, S2, ..., Sn are each positive (where Sn is the state of the walk at time n) And the random walk remains positive y. This we know from our equations for random walk to be y/n.
Next we consider asymmetric simple random walk with values in Z but which is not necessarily symmetric. The probability for a positive sign is taken as p and that for a negative sign is taken as 1-p, the displacements belong to Z.
Probability for the distribution with n steps and the initial state S0 to remain positive k is given by
n Choose (n+k) / 2 p ^ (n+k)/2 q ^ (n-k) /2
Here we are simply counting the paths. The numerator n Choose (n+k)/2 is the number of paths from (0,0) to (n,k) And the total number of paths is given by the inverse of the subsequent terms which are for positive and negative walk.
The initial state S0 is independent of the increments that lead to Sn.
We can now look at the hitting time -
We will now look at the probabilistic and analytical methods to compute the distribution of the hitting time The hitting time can be expressed as a infinite series of iterations n where n >= 0 for the final state Sn to arrive at x in Z and it takes values 0,1, 2 .. infinity
For a given x, y in displacement space Z, we know that the probability for the hitting time to arrive at y starting from x to be equal to t is the same probability for the hitting time to arrive at y-x from start to be equal to t. In other words, the space is homogenous and the relative transitions between any two points if similar, will result in similar hitting times. We use this to our advantage by considering all such random walks to start from zero without losing any generality.

Monday, August 11, 2014

Today we look at the search functionality of Splunk.
Splunk Search as we saw earlier comes in differnent modes :
Cursored or Historical search
Batchmode search
Realtime search and
Indexed realtime search
and Search processors or their equivalent commands generally are of these types :
Streaming/Remote
Stateful Streaming
Events
Stream Reporting
Reporting
With Distributed search, we involve a few other players in these typical configurations:
search head with multiple peers
Multiple SH-s with multiple peers
SHs and clusters
The benefits are scaling, access control and geo-dispersed data
When a search is initiated, the request is transferred by the browser to the app server which in turn sends it to the backend splunkd The backend checks the system/user quota, bundle replication and creates the dispatch directory. The search is parsed into map and reduce At this point it can be distributed to peers. Peers can be a logical local peer or a remote peer. When the search request is received by a peer, it checks which bundle to use and initializes the conf subsystem based on the bundle, check indexes db and finds the buckets that are involved in the time range. With the help of the tsidx files the lexicon and the postings of those lexicons identifying the events where they occur help enumerate the matching events aka search results. The raw data from the compressed journal is read and processed into the search results. The activities performed, the parameters for the search and the results are all written out to the dispatch dir.

Splunk Databases:
We will discuss Splunk Database. This is a proprietary database.
Data is written to databases organized under home, thawed and cold database paths in a folder under the SPLUNK_HOME environment setting. The home path is thedefault path for the data to be written and is called the hot path. The sense oftemperature with the database paths is to indicate the actively used versus archivable data. Data in the cold path for example can be on cheaper storage.

These paths can be configured per index (an organizational unit that shares nothingwith others ). Each database can have several buckets of data and this isconfigurable. Buckets are time-series values with a min and a max timestamp. They hold three kinds of data. The first is a file with data extension that stores metadata such as Source/SourceType/Host metadata. The second is a TSIDX file (short for TimeSeries Index) that builds a lexicon of different words/fields extracted from the raw data and a posting of each occurrence for that word/field. They hold pointers to rawdata. Tsidx is an inverted index whose postings are stored in reverse-time order. It is optimized to execute search and return millions of events in reverse time order and interfaces with the higher layers / Lispy language. Finally, the bucket contains the rawdata. A compressed rawdata journal stores raw events in an efficient format. It contains enough information to rebuild the entire bucket.

In the bucket lifecycle, the data rolls over from the hot->warm->cold->frozen buckets.

Search is on a bucket by bucket basis. Buckets may overlap in time when searchrequests events to be returned in reverse time order. Set the cursor for each bucketto be the latest time of that bucket, read a chunk of data from bucket with latestcursor and move cursor to the latest time of unread data from the bucket.

Occassionally, streaming data can create different small time-series index files which make search inefficient. So they are consolidated into larger files with splunk-optimize command.

In addition, each database has a BucketManifest file. This file just lists the metadata for each bucket namely the id, path, raw_size, host_count, source_count,sourcetype_count, size_on_disk, modtime, frozen_in_cluster etc. Buckets are internalto the database.

The Database uses the buckets to retrieve values to satisfy a query. The values arewrapped to include metadata namely dbid, event_address, host, type, source,sourcetype, and timestamp. When filtering a query results the data has to match thetimestamp range.

Database and Bucket Partition Policy:

Policies applied to database are maintained by this class. It has a set of properties for the database and starts out with initialized values for these properties.

The properties include counts of different buckets, their allowed sizes, theiroptimization time, etc.

While the DatabaseDirectoryManager is mostly read only operations on thedatabase, the DatabasePartitionPolicy enables writes to the database bydetermining the database from the specified index in the event and updating the time range of the database with the event time, if necessary.

To keep the events sorted in time order, we could have a buffer to hold delayed and out of order events which we can sequence with a sliding window:
        public static void Add(this int[] events, int seq )
        {
            events[events.Length + 1] = seq;
            int start = events.Length - Lookahead > 1 ? events.Length - Lookahead : 1
            int end = events.Length;
            int min = events.Where((x,i) => i >= start && i <= end).Min();
int index = events.Where((x,i) => i >= start && i <= end).IndexOf(min);
            events[start-1] = min;
            for (int i = index; i < end; i++) events[i] = events[i+1];
            events.RemoveAt(end);
        }

Sunday, August 10, 2014

Today we will continue our discussion with Ballot theorem. In the previous post we mentioned the theorem. In this post, we mention the conditional probabilities. Out of an urn with n balls the sampling followed a random walk. The probability that given S0 = 0 and Sn = y > 0, the random walk never becomes zero up to time n was given by
Po ( S1 > 0 , S2 > 0 , ... Sn > 0 | Sn = y ) = y/n
Now we say that if n,y are both positive or zero, and are both even or both odd then,
then the same probability is (y + 1) / 1/2 ( n + y ) + 1
In particular, if n is even,
P0 (S1 >= 0, ... Sn >= 0 ) = 1 / (n/2) + 1
We use the corollary we discussed earlier to prove this :
The number of paths from (0,0) to (n,y) that remain above the horizontal axis is given by
n Choose (n+y) / 2 - n Choose ((n+y)/2 + 1)
The number of paths from (0,0) to (n,y) = n Choose (n+y)/2
so the probability we are interested can be expressed as a conditional P0 (Sn >= 0, ..., Sn >= 0, Sn = y) / P0 (Sn = y) = The number of paths from (0,0) to (n,y) that remain above the horizontal axis / The total number of paths from (0,0) to (n,y)
and by substituting the expressions from above for the number of paths and eliminating the common terms we prove the probability above.

Saturday, August 9, 2014

Today I wrote a document with a draft for a book. That said we will continue our discussion on Ballot theorem and probability from simple random walk. Ballot theorem was originally used with urns. Suppose that there is an urn with a azure balls and b black balls totalling n balls. A sequence of a balls when drawn has a uniform distribution. The ballot theorem originally proposed in 1887 posed a question this way. If there is such an urn and there is sampling without replacement what is the pro ability that the azure balls will lead the black ones for some positive finite and restricted draw. The problem was solved the same year and the probability was found to be (a-b)/n as long as a > b

Today we discuss the Ballot theorem as an extension of the path counting and probability estimation in simple random walks that we have been discussing in previous posts. This theorem is stated as follows:
The probability that S0 = 0 and Sn = y > 0, the random walk never becomes zero upto time n is given by (y/n)

Its written as P0 ( S1 > 0, S2 > 0, ... Sn > 0 | Sn = y ) = y / n

We rephrase the left hand side of this equation as the ratio of the probability for (1,1) to (n,y) to be above the horizontal axis to the probability for (0,0 to (n,y) to be above the horizontal axis.

Substituting m = 1 x = 1 and m = 0 x = 0 for the two probabilities, we use the equation for
probability as the ratio of the count of paths between (m,x) to (n,y) as ( n - m) Choose ( n - m + y - x ) / 2 to the total possible paths as 2 ^ n.

and taking out the common terms in the expansion of the n Choose r as n!/ r!(n-r)! we get the result as (y/n)

Thursday, August 7, 2014

Today we will continue our discussion on counting paths in random walks. In the previous posts, we discussed the number of paths from (0,0) to (n,y) where y >= 0 and remain above the horizontal axis. We also counted the number of paths from (0,0) to (n,0) that remain above the horizontal axis. as well as the number of paths that start from (0, 0) and have length n and remain above the horizontal axis.
With the paths of length n that start at (0, 0) and remain above the horizontal axis, we showed that they represent the shaded region in the lower right half triangle above the horizontal axis. By reflection, the same holds true for below the horizontal axis. However, what we went ahead and in proving but did not conjecture apriori is that the paths of Ilength n starting at (0,0) that remain above the horizontal axis is the the sum of the paths of length n starting at (0, 0) and ending at (n, 0) both above and below the horizontal axis. When the shaded area is represented in the graph, we can see by symmetry and reflection how this is self-evident.
We will next look at probability calculations.
Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae:
The probability distribution after n steps is as follows:
P0(Sn = y) = (n Choose (n+y)/2) times (2 ^ -n)
Pn (Sn = y | Sm = x) = ( n-m Choose ((n - m + y - x) / 2) times 2 ^ (-n+m)
These equations are written based on the definitions of the probability as a ratio of paths satisfying an event to the total possible paths as in a grid. We have introduced this in earlier posts. In particular with the above two equations if we take n as even, then
u (n) = P0 (Sn = 0) = (n Choose n/2) times 2 ^ (-n).
The proof is described as below :
For the first equation we take the fraction of the number of paths from (0,0) to (n, y ) to 2^n
For the second equation we take the fraction of the number of paths from (m, x) to (n, y) to 2 ^ ( n -m).

The number of paths between two points (m, x) and (n, y) is given by (n-m) Choose (n-m+y-x )/ 2 ^ (n-m)
Substituting m = 0 and x = 0 we get equation 1
The two equations are thus proved.
Courtesy : Konstantopoulous mcrw.pdf