Cluster computing

Monday, August 11, 2014

Today we look at the search functionality of Splunk.
Splunk Search as we saw earlier comes in differnent modes :
Cursored or Historical search
Batchmode search
Realtime search and
Indexed realtime search
and Search processors or their equivalent commands generally are of these types :
Streaming/Remote
Stateful Streaming
Events
Stream Reporting
Reporting
With Distributed search, we involve a few other players in these typical configurations:
search head with multiple peers
Multiple SH-s with multiple peers
SHs and clusters
The benefits are scaling, access control and geo-dispersed data
When a search is initiated, the request is transferred by the browser to the app server which in turn sends it to the backend splunkd The backend checks the system/user quota, bundle replication and creates the dispatch directory. The search is parsed into map and reduce At this point it can be distributed to peers. Peers can be a logical local peer or a remote peer. When the search request is received by a peer, it checks which bundle to use and initializes the conf subsystem based on the bundle, check indexes db and finds the buckets that are involved in the time range. With the help of the tsidx files the lexicon and the postings of those lexicons identifying the events where they occur help enumerate the matching events aka search results. The raw data from the compressed journal is read and processed into the search results. The activities performed, the parameters for the search and the results are all written out to the dispatch dir.

Splunk Databases:
We will discuss Splunk Database. This is a proprietary database.
Data is written to databases organized under home, thawed and cold database paths in a folder under the SPLUNK_HOME environment setting. The home path is thedefault path for the data to be written and is called the hot path. The sense oftemperature with the database paths is to indicate the actively used versus archivable data. Data in the cold path for example can be on cheaper storage.

These paths can be configured per index (an organizational unit that shares nothingwith others ). Each database can have several buckets of data and this isconfigurable. Buckets are time-series values with a min and a max timestamp. They hold three kinds of data. The first is a file with data extension that stores metadata such as Source/SourceType/Host metadata. The second is a TSIDX file (short for TimeSeries Index) that builds a lexicon of different words/fields extracted from the raw data and a posting of each occurrence for that word/field. They hold pointers to rawdata. Tsidx is an inverted index whose postings are stored in reverse-time order. It is optimized to execute search and return millions of events in reverse time order and interfaces with the higher layers / Lispy language. Finally, the bucket contains the rawdata. A compressed rawdata journal stores raw events in an efficient format. It contains enough information to rebuild the entire bucket.

In the bucket lifecycle, the data rolls over from the hot->warm->cold->frozen buckets.

Search is on a bucket by bucket basis. Buckets may overlap in time when searchrequests events to be returned in reverse time order. Set the cursor for each bucketto be the latest time of that bucket, read a chunk of data from bucket with latestcursor and move cursor to the latest time of unread data from the bucket.

Occassionally, streaming data can create different small time-series index files which make search inefficient. So they are consolidated into larger files with splunk-optimize command.

In addition, each database has a BucketManifest file. This file just lists the metadata for each bucket namely the id, path, raw_size, host_count, source_count,sourcetype_count, size_on_disk, modtime, frozen_in_cluster etc. Buckets are internalto the database.

The Database uses the buckets to retrieve values to satisfy a query. The values arewrapped to include metadata namely dbid, event_address, host, type, source,sourcetype, and timestamp. When filtering a query results the data has to match thetimestamp range.

Database and Bucket Partition Policy:

Policies applied to database are maintained by this class. It has a set of properties for the database and starts out with initialized values for these properties.

The properties include counts of different buckets, their allowed sizes, theiroptimization time, etc.

While the DatabaseDirectoryManager is mostly read only operations on thedatabase, the DatabasePartitionPolicy enables writes to the database bydetermining the database from the specified index in the event and updating the time range of the database with the event time, if necessary.

To keep the events sorted in time order, we could have a buffer to hold delayed and out of order events which we can sequence with a sliding window:
        public static void Add(this int[] events, int seq )
        {
            events[events.Length + 1] = seq;
            int start = events.Length - Lookahead > 1 ? events.Length - Lookahead : 1
            int end = events.Length;
            int min = events.Where((x,i) => i >= start && i <= end).Min();
int index = events.Where((x,i) => i >= start && i <= end).IndexOf(min);
            events[start-1] = min;
            for (int i = index; i < end; i++) events[i] = events[i+1];
            events.RemoveAt(end);
        }

Sunday, August 10, 2014

Today we will continue our discussion with Ballot theorem. In the previous post we mentioned the theorem. In this post, we mention the conditional probabilities. Out of an urn with n balls the sampling followed a random walk. The probability that given S0 = 0 and Sn = y > 0, the random walk never becomes zero up to time n was given by
Po ( S1 > 0 , S2 > 0 , ... Sn > 0 | Sn = y ) = y/n
Now we say that if n,y are both positive or zero, and are both even or both odd then,
then the same probability is (y + 1) / 1/2 ( n + y ) + 1
In particular, if n is even,
P0 (S1 >= 0, ... Sn >= 0 ) = 1 / (n/2) + 1
We use the corollary we discussed earlier to prove this :
The number of paths from (0,0) to (n,y) that remain above the horizontal axis is given by
n Choose (n+y) / 2 - n Choose ((n+y)/2 + 1)
The number of paths from (0,0) to (n,y) = n Choose (n+y)/2
so the probability we are interested can be expressed as a conditional P0 (Sn >= 0, ..., Sn >= 0, Sn = y) / P0 (Sn = y) = The number of paths from (0,0) to (n,y) that remain above the horizontal axis / The total number of paths from (0,0) to (n,y)
and by substituting the expressions from above for the number of paths and eliminating the common terms we prove the probability above.

Saturday, August 9, 2014

Today I wrote a document with a draft for a book. That said we will continue our discussion on Ballot theorem and probability from simple random walk. Ballot theorem was originally used with urns. Suppose that there is an urn with a azure balls and b black balls totalling n balls. A sequence of a balls when drawn has a uniform distribution. The ballot theorem originally proposed in 1887 posed a question this way. If there is such an urn and there is sampling without replacement what is the pro ability that the azure balls will lead the black ones for some positive finite and restricted draw. The problem was solved the same year and the probability was found to be (a-b)/n as long as a > b

Today we discuss the Ballot theorem as an extension of the path counting and probability estimation in simple random walks that we have been discussing in previous posts. This theorem is stated as follows:
The probability that S0 = 0 and Sn = y > 0, the random walk never becomes zero upto time n is given by (y/n)

Its written as P0 ( S1 > 0, S2 > 0, ... Sn > 0 | Sn = y ) = y / n

We rephrase the left hand side of this equation as the ratio of the probability for (1,1) to (n,y) to be above the horizontal axis to the probability for (0,0 to (n,y) to be above the horizontal axis.

Substituting m = 1 x = 1 and m = 0 x = 0 for the two probabilities, we use the equation for
probability as the ratio of the count of paths between (m,x) to (n,y) as ( n - m) Choose ( n - m + y - x ) / 2 to the total possible paths as 2 ^ n.

and taking out the common terms in the expansion of the n Choose r as n!/ r!(n-r)! we get the result as (y/n)

Thursday, August 7, 2014

Today we will continue our discussion on counting paths in random walks. In the previous posts, we discussed the number of paths from (0,0) to (n,y) where y >= 0 and remain above the horizontal axis. We also counted the number of paths from (0,0) to (n,0) that remain above the horizontal axis. as well as the number of paths that start from (0, 0) and have length n and remain above the horizontal axis.
With the paths of length n that start at (0, 0) and remain above the horizontal axis, we showed that they represent the shaded region in the lower right half triangle above the horizontal axis. By reflection, the same holds true for below the horizontal axis. However, what we went ahead and in proving but did not conjecture apriori is that the paths of Ilength n starting at (0,0) that remain above the horizontal axis is the the sum of the paths of length n starting at (0, 0) and ending at (n, 0) both above and below the horizontal axis. When the shaded area is represented in the graph, we can see by symmetry and reflection how this is self-evident.
We will next look at probability calculations.
Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae:
The probability distribution after n steps is as follows:
P0(Sn = y) = (n Choose (n+y)/2) times (2 ^ -n)
Pn (Sn = y | Sm = x) = ( n-m Choose ((n - m + y - x) / 2) times 2 ^ (-n+m)
These equations are written based on the definitions of the probability as a ratio of paths satisfying an event to the total possible paths as in a grid. We have introduced this in earlier posts. In particular with the above two equations if we take n as even, then
u (n) = P0 (Sn = 0) = (n Choose n/2) times 2 ^ (-n).
The proof is described as below :
For the first equation we take the fraction of the number of paths from (0,0) to (n, y ) to 2^n
For the second equation we take the fraction of the number of paths from (m, x) to (n, y) to 2 ^ ( n -m).

The number of paths between two points (m, x) and (n, y) is given by (n-m) Choose (n-m+y-x )/ 2 ^ (n-m)
Substituting m = 0 and x = 0 we get equation 1
The two equations are thus proved.
Courtesy : Konstantopoulous mcrw.pdf

Wednesday, August 6, 2014

In this post, we continue our discussion on counting paths in single dimension simple random walks. Tonight we look at Catalan numbers. The number of paths from (0, 0) to (n, 0) that remain >= 0 is given by 1 / (n+1) times (n + 1) Choose (n/2)
How do we prove this? We take the formula we described in the previous post for finding the number of paths from (0,0) to (n, y), where y >= 0, that remain >= 0. And in this formula we substitute y=0
That number of paths was given by n Choose (n + Y) / 2 - n Choose (( n + y ) / 2 + 1)
And substituting y = 0 we get
N Choose n/2 - n Choose (n/2 - 1)
We can expand the definition of the combination n Choose r = n! / r! (N-r)!
And we will reduce the result to 1/(n+1) times (n+1) Choose n/2
We look at another corollary as well. The number of paths that start from (0,0), have length n, and remain non-negative is given by :

(n Choose ceil(n/2)) where ceil(x) denotes the smallest integer that is just greater than or equal to x.

This is different from the other corollaries we have mentioned so far. This one talks about paths that have length n. We show that their number is (n Choose ceil(n/2)) this way.

We take the number of paths we computed earlier in this post for paths from (0,0) to (n,y) that don’t touch the horizontal axis as = n Choose (n+y)/2 - n Choose ((n+y)/2 + 1)

and then we sum over all y >= 0.

If n is even (2m) then we need to take y as even (2z) otherwise the formula above is equal to zero, and we have the number of desired paths

Sum z >= 0 ( 2m Choose (m+z) - 2m Choose (m+z+1))

which we can expand the series by listing each z >= 0 and summing it.

We will see that adjacent terms cancel out and we will get the result as the first term = 2m Choose m which is n Choose n/2

If n is odd (2m +1) then we need to take y as odd (2z + 1) otherwise the formula is again equal to zero, and we have

the number of paths from (0,0) to (2m+1, any) where paths remain above horizontal) =

Sum ( z >= 0 [ (2m + 1) Choose (m+z+1) - (2m + 1) Choose (m+z+2)]

and again we expand this series for each term z >= 0 , we see that the adjacent terms cancel out leaving us with only the first term as 2m+1 Choose m+1 which is

(n Choose ceil(n/2))