Cluster computing

Wednesday, August 13, 2014

Having discussed Splunk DB in the previous posts, I want to outline and discuss one of the key concepts and its implementation in this post today. This is the time series partitioning of data or buckets and the inverted time search on these buckets. There are other very powerful concepts such as key value map and reduce that enables powerful analytics. We suggested and follow up on whether we could externalized this to NoSQL databases. In the meantime, I want to focus on just the time series data, its persistence and search - essentially whether we can avoid buckets.
Events that flow into Splunk are marked with timestamp. The timestamp gives an order to the event. In the absence of any other search query parameters, the timestamp helps partition the data correctly.
The events need to be sorted by timestamp. However, events can arrive out of order. Its helpful to think of these events as precious data. We should not drop events, if possible. So a sufficient window to let the out of order events arrive could be maintained. When the events arrive continuously, they can be streamed to a file or a table. There is no need to split the file or the table logically. If we are concerned about the reliability and fault tolerance, we can handle it externally. If we are concerned about a single point of maintenance, we can distribute it. If we are concerned about ACID properties, we can use a traditional relational database. Outwards, there is nothing preventing us from viewing it as a singular data store.
The ability to put an order to the late arrival events and then to search back time for query matching is the consideration for this post. One of the ways to do this is with a sliding window protocol. While we cannot solve the problem of unlimited sequence numbers, we can hope to index and maintain events that are of reasonable delay. If we cannot control that, we specify an auxilary storage for such overly delayed packets, which we can later retry sequencing asynchronously. We could do this with fixed sequence numbers. For every event sequenced, the sliding window moves forward by one event. Sequencing one event involves inserting it in the right place. Typically we avoid the lookup and the insertion by copying the minimum time-stamp to the start position of the sliding window and exchanging it with the element at the start. We do this only if the element at start is not the minimum. We also check that the element before the start is earlier than the element at the start. Then we advance the sliding window forward by one event.Since we check only the elements within the window and the element prior to start we can wrap-around the index for a fixed size sequence number.

Tuesday, August 12, 2014

We return to the discussion on Ballot theorem. We said that if the urn contains a azure balls and b = n - a black balls then if we are given that there is sampling without replacement, then the probability that azure balls come out ahead is (a-b)/n as long as a >= b. We prove this by drawing similarity to the random walk and translating it to a question about random numbers. The drawing of the azure ball can be considered positive and the drawing of the black ball can be considered negative.The sampling is precisely the event that the S0, S1, S2, ..., Sn are each positive (where Sn is the state of the walk at time n) And the random walk remains positive y. This we know from our equations for random walk to be y/n.
Next we consider asymmetric simple random walk with values in Z but which is not necessarily symmetric. The probability for a positive sign is taken as p and that for a negative sign is taken as 1-p, the displacements belong to Z.
Probability for the distribution with n steps and the initial state S0 to remain positive k is given by
n Choose (n+k) / 2 p ^ (n+k)/2 q ^ (n-k) /2
Here we are simply counting the paths. The numerator n Choose (n+k)/2 is the number of paths from (0,0) to (n,k) And the total number of paths is given by the inverse of the subsequent terms which are for positive and negative walk.
The initial state S0 is independent of the increments that lead to Sn.
We can now look at the hitting time -
We will now look at the probabilistic and analytical methods to compute the distribution of the hitting time The hitting time can be expressed as a infinite series of iterations n where n >= 0 for the final state Sn to arrive at x in Z and it takes values 0,1, 2 .. infinity
For a given x, y in displacement space Z, we know that the probability for the hitting time to arrive at y starting from x to be equal to t is the same probability for the hitting time to arrive at y-x from start to be equal to t. In other words, the space is homogenous and the relative transitions between any two points if similar, will result in similar hitting times. We use this to our advantage by considering all such random walks to start from zero without losing any generality.

Monday, August 11, 2014

Today we look at the search functionality of Splunk.
Splunk Search as we saw earlier comes in differnent modes :
Cursored or Historical search
Batchmode search
Realtime search and
Indexed realtime search
and Search processors or their equivalent commands generally are of these types :
Streaming/Remote
Stateful Streaming
Events
Stream Reporting
Reporting
With Distributed search, we involve a few other players in these typical configurations:
search head with multiple peers
Multiple SH-s with multiple peers
SHs and clusters
The benefits are scaling, access control and geo-dispersed data
When a search is initiated, the request is transferred by the browser to the app server which in turn sends it to the backend splunkd The backend checks the system/user quota, bundle replication and creates the dispatch directory. The search is parsed into map and reduce At this point it can be distributed to peers. Peers can be a logical local peer or a remote peer. When the search request is received by a peer, it checks which bundle to use and initializes the conf subsystem based on the bundle, check indexes db and finds the buckets that are involved in the time range. With the help of the tsidx files the lexicon and the postings of those lexicons identifying the events where they occur help enumerate the matching events aka search results. The raw data from the compressed journal is read and processed into the search results. The activities performed, the parameters for the search and the results are all written out to the dispatch dir.

Splunk Databases:
We will discuss Splunk Database. This is a proprietary database.
Data is written to databases organized under home, thawed and cold database paths in a folder under the SPLUNK_HOME environment setting. The home path is thedefault path for the data to be written and is called the hot path. The sense oftemperature with the database paths is to indicate the actively used versus archivable data. Data in the cold path for example can be on cheaper storage.

These paths can be configured per index (an organizational unit that shares nothingwith others ). Each database can have several buckets of data and this isconfigurable. Buckets are time-series values with a min and a max timestamp. They hold three kinds of data. The first is a file with data extension that stores metadata such as Source/SourceType/Host metadata. The second is a TSIDX file (short for TimeSeries Index) that builds a lexicon of different words/fields extracted from the raw data and a posting of each occurrence for that word/field. They hold pointers to rawdata. Tsidx is an inverted index whose postings are stored in reverse-time order. It is optimized to execute search and return millions of events in reverse time order and interfaces with the higher layers / Lispy language. Finally, the bucket contains the rawdata. A compressed rawdata journal stores raw events in an efficient format. It contains enough information to rebuild the entire bucket.

In the bucket lifecycle, the data rolls over from the hot->warm->cold->frozen buckets.

Search is on a bucket by bucket basis. Buckets may overlap in time when searchrequests events to be returned in reverse time order. Set the cursor for each bucketto be the latest time of that bucket, read a chunk of data from bucket with latestcursor and move cursor to the latest time of unread data from the bucket.

Occassionally, streaming data can create different small time-series index files which make search inefficient. So they are consolidated into larger files with splunk-optimize command.

In addition, each database has a BucketManifest file. This file just lists the metadata for each bucket namely the id, path, raw_size, host_count, source_count,sourcetype_count, size_on_disk, modtime, frozen_in_cluster etc. Buckets are internalto the database.

The Database uses the buckets to retrieve values to satisfy a query. The values arewrapped to include metadata namely dbid, event_address, host, type, source,sourcetype, and timestamp. When filtering a query results the data has to match thetimestamp range.

Database and Bucket Partition Policy:

Policies applied to database are maintained by this class. It has a set of properties for the database and starts out with initialized values for these properties.

The properties include counts of different buckets, their allowed sizes, theiroptimization time, etc.

While the DatabaseDirectoryManager is mostly read only operations on thedatabase, the DatabasePartitionPolicy enables writes to the database bydetermining the database from the specified index in the event and updating the time range of the database with the event time, if necessary.

To keep the events sorted in time order, we could have a buffer to hold delayed and out of order events which we can sequence with a sliding window:
        public static void Add(this int[] events, int seq )
        {
            events[events.Length + 1] = seq;
            int start = events.Length - Lookahead > 1 ? events.Length - Lookahead : 1
            int end = events.Length;
            int min = events.Where((x,i) => i >= start && i <= end).Min();
int index = events.Where((x,i) => i >= start && i <= end).IndexOf(min);
            events[start-1] = min;
            for (int i = index; i < end; i++) events[i] = events[i+1];
            events.RemoveAt(end);
        }

Sunday, August 10, 2014

Today we will continue our discussion with Ballot theorem. In the previous post we mentioned the theorem. In this post, we mention the conditional probabilities. Out of an urn with n balls the sampling followed a random walk. The probability that given S0 = 0 and Sn = y > 0, the random walk never becomes zero up to time n was given by
Po ( S1 > 0 , S2 > 0 , ... Sn > 0 | Sn = y ) = y/n
Now we say that if n,y are both positive or zero, and are both even or both odd then,
then the same probability is (y + 1) / 1/2 ( n + y ) + 1
In particular, if n is even,
P0 (S1 >= 0, ... Sn >= 0 ) = 1 / (n/2) + 1
We use the corollary we discussed earlier to prove this :
The number of paths from (0,0) to (n,y) that remain above the horizontal axis is given by
n Choose (n+y) / 2 - n Choose ((n+y)/2 + 1)
The number of paths from (0,0) to (n,y) = n Choose (n+y)/2
so the probability we are interested can be expressed as a conditional P0 (Sn >= 0, ..., Sn >= 0, Sn = y) / P0 (Sn = y) = The number of paths from (0,0) to (n,y) that remain above the horizontal axis / The total number of paths from (0,0) to (n,y)
and by substituting the expressions from above for the number of paths and eliminating the common terms we prove the probability above.

Saturday, August 9, 2014

Today I wrote a document with a draft for a book. That said we will continue our discussion on Ballot theorem and probability from simple random walk. Ballot theorem was originally used with urns. Suppose that there is an urn with a azure balls and b black balls totalling n balls. A sequence of a balls when drawn has a uniform distribution. The ballot theorem originally proposed in 1887 posed a question this way. If there is such an urn and there is sampling without replacement what is the pro ability that the azure balls will lead the black ones for some positive finite and restricted draw. The problem was solved the same year and the probability was found to be (a-b)/n as long as a > b

Today we discuss the Ballot theorem as an extension of the path counting and probability estimation in simple random walks that we have been discussing in previous posts. This theorem is stated as follows:
The probability that S0 = 0 and Sn = y > 0, the random walk never becomes zero upto time n is given by (y/n)

Its written as P0 ( S1 > 0, S2 > 0, ... Sn > 0 | Sn = y ) = y / n

We rephrase the left hand side of this equation as the ratio of the probability for (1,1) to (n,y) to be above the horizontal axis to the probability for (0,0 to (n,y) to be above the horizontal axis.

Substituting m = 1 x = 1 and m = 0 x = 0 for the two probabilities, we use the equation for
probability as the ratio of the count of paths between (m,x) to (n,y) as ( n - m) Choose ( n - m + y - x ) / 2 to the total possible paths as 2 ^ n.

and taking out the common terms in the expansion of the n Choose r as n!/ r!(n-r)! we get the result as (y/n)