Cluster computing

Sunday, August 17, 2014

In the previous post, we considered a simple symmetric random walk in two dimensions. We said that when we change the co-ordinates say by rotating the axis and scaling, the random walks along the single dimensions are independent. To prove this, we start with the random walk in two dimensions and express it as a sequence of vectors where each vector is an incremental displacement along the rotated and scaled axis in cartesian co-ordinate pairs. This sequence is also i.i.d because it is taken with some function on the elements of sequence consisting of incremental vectors along the original axis. When we look at this function, we see that the rotation and the scaling implies taking the sum and the difference of the incremental displacements along the original axis. This kind of sum and difference is independent for each of the elements in the sequence of the two dimension walk. Moreover, the possible values for each incremental in the two dimension walk are either positive or negative 1. So we can write the probabilites for state transtions of the two dimension walk as follows:
Probability to get to (1,1) in the two dimension walk = Probability to get to (1,0) in the single dimension walk = 1/4
Probability to get to (-1, -1) in the two dimension walk = Probability to get to (-1,0) in the single dimension walk = 1/4
Probability to get to (1, -1) in the two dimension walk = Probability to get to (0,1) in the single dimension walk = 1/4
Probability to get to (-1,1) in the two dimension walk = Probability to get to (0, -1) in the single dimension walk = 1/4
This implies that the probability for each of the incremental displacements along the rotated axis is 1/2. And so we can decompose the random walk in two dimensions with probability for the final state to a pair (n1 to be some value e1 , n2 to be some value e2) as a combination of the probability for the final state to be (n1 to be value e1) and the probability of the final state to be (n2 to be some value e2) for all choices of the value e1 and e2 to be in positive or negative unit distance. Thus they are independent.
We will next look at a few more lemmas in this case.

Thursday, August 14, 2014

Today I will discuss simple symmetric random walk in two dimensions. We can consider the same drunkard problem in two dimensions. Consider a drunkard in a city whose roads are arranged in a rectangular grid. The corners have coordinates (x,y) and the origin is (0,0). He moves east, west, north or south with equal probability of 1/4. When he reaches a corner, he moves again in the same manner.
This random walk can be considered in the form Z^2 = Z * Z. Let E1, E2 be i.i.d random numbers such that the incremental relative displacements are each 1/4.
Because the relative displacements are all the same, we move the start of all random walks to be the origin.
We further describe the horizontal and the vertical displacements with separate notations say x1 and x2 for horizontal and vertical respectively such that they belong to the grid and the final state is the sum of the incrementals in both directions taken together.
One of the lemmas now is that if both the horizontal and the vertical displacements are random walks, the two are not independent.
Since E1, E2 are independent their x and y projections are also independent and they are identically distributed with a transition probability to zero as 1/2 and to either 1 or -1 as 1/4.
So the horizontal states are a sum of iid random variables and hence a random walk. But it is not a simple random walk because the transition to zero has a different probability.
The same argument applies to vertical states.
Thus we see that the two random walks are not independent. Further, if we change the co-ordinates by rotating the axis by 45 degrees and scaling them then we can suppose the final state is described by the sum and the difference of their horizontal and vertical final states. In such a case the random walks along those axis is now considered independent.

Wednesday, August 13, 2014

Having discussed Splunk DB in the previous posts, I want to outline and discuss one of the key concepts and its implementation in this post today. This is the time series partitioning of data or buckets and the inverted time search on these buckets. There are other very powerful concepts such as key value map and reduce that enables powerful analytics. We suggested and follow up on whether we could externalized this to NoSQL databases. In the meantime, I want to focus on just the time series data, its persistence and search - essentially whether we can avoid buckets.
Events that flow into Splunk are marked with timestamp. The timestamp gives an order to the event. In the absence of any other search query parameters, the timestamp helps partition the data correctly.
The events need to be sorted by timestamp. However, events can arrive out of order. Its helpful to think of these events as precious data. We should not drop events, if possible. So a sufficient window to let the out of order events arrive could be maintained. When the events arrive continuously, they can be streamed to a file or a table. There is no need to split the file or the table logically. If we are concerned about the reliability and fault tolerance, we can handle it externally. If we are concerned about a single point of maintenance, we can distribute it. If we are concerned about ACID properties, we can use a traditional relational database. Outwards, there is nothing preventing us from viewing it as a singular data store.
The ability to put an order to the late arrival events and then to search back time for query matching is the consideration for this post. One of the ways to do this is with a sliding window protocol. While we cannot solve the problem of unlimited sequence numbers, we can hope to index and maintain events that are of reasonable delay. If we cannot control that, we specify an auxilary storage for such overly delayed packets, which we can later retry sequencing asynchronously. We could do this with fixed sequence numbers. For every event sequenced, the sliding window moves forward by one event. Sequencing one event involves inserting it in the right place. Typically we avoid the lookup and the insertion by copying the minimum time-stamp to the start position of the sliding window and exchanging it with the element at the start. We do this only if the element at start is not the minimum. We also check that the element before the start is earlier than the element at the start. Then we advance the sliding window forward by one event.Since we check only the elements within the window and the element prior to start we can wrap-around the index for a fixed size sequence number.

Tuesday, August 12, 2014

We return to the discussion on Ballot theorem. We said that if the urn contains a azure balls and b = n - a black balls then if we are given that there is sampling without replacement, then the probability that azure balls come out ahead is (a-b)/n as long as a >= b. We prove this by drawing similarity to the random walk and translating it to a question about random numbers. The drawing of the azure ball can be considered positive and the drawing of the black ball can be considered negative.The sampling is precisely the event that the S0, S1, S2, ..., Sn are each positive (where Sn is the state of the walk at time n) And the random walk remains positive y. This we know from our equations for random walk to be y/n.
Next we consider asymmetric simple random walk with values in Z but which is not necessarily symmetric. The probability for a positive sign is taken as p and that for a negative sign is taken as 1-p, the displacements belong to Z.
Probability for the distribution with n steps and the initial state S0 to remain positive k is given by
n Choose (n+k) / 2 p ^ (n+k)/2 q ^ (n-k) /2
Here we are simply counting the paths. The numerator n Choose (n+k)/2 is the number of paths from (0,0) to (n,k) And the total number of paths is given by the inverse of the subsequent terms which are for positive and negative walk.
The initial state S0 is independent of the increments that lead to Sn.
We can now look at the hitting time -
We will now look at the probabilistic and analytical methods to compute the distribution of the hitting time The hitting time can be expressed as a infinite series of iterations n where n >= 0 for the final state Sn to arrive at x in Z and it takes values 0,1, 2 .. infinity
For a given x, y in displacement space Z, we know that the probability for the hitting time to arrive at y starting from x to be equal to t is the same probability for the hitting time to arrive at y-x from start to be equal to t. In other words, the space is homogenous and the relative transitions between any two points if similar, will result in similar hitting times. We use this to our advantage by considering all such random walks to start from zero without losing any generality.

Monday, August 11, 2014

Today we look at the search functionality of Splunk.
Splunk Search as we saw earlier comes in differnent modes :
Cursored or Historical search
Batchmode search
Realtime search and
Indexed realtime search
and Search processors or their equivalent commands generally are of these types :
Streaming/Remote
Stateful Streaming
Events
Stream Reporting
Reporting
With Distributed search, we involve a few other players in these typical configurations:
search head with multiple peers
Multiple SH-s with multiple peers
SHs and clusters
The benefits are scaling, access control and geo-dispersed data
When a search is initiated, the request is transferred by the browser to the app server which in turn sends it to the backend splunkd The backend checks the system/user quota, bundle replication and creates the dispatch directory. The search is parsed into map and reduce At this point it can be distributed to peers. Peers can be a logical local peer or a remote peer. When the search request is received by a peer, it checks which bundle to use and initializes the conf subsystem based on the bundle, check indexes db and finds the buckets that are involved in the time range. With the help of the tsidx files the lexicon and the postings of those lexicons identifying the events where they occur help enumerate the matching events aka search results. The raw data from the compressed journal is read and processed into the search results. The activities performed, the parameters for the search and the results are all written out to the dispatch dir.

Splunk Databases:
We will discuss Splunk Database. This is a proprietary database.
Data is written to databases organized under home, thawed and cold database paths in a folder under the SPLUNK_HOME environment setting. The home path is thedefault path for the data to be written and is called the hot path. The sense oftemperature with the database paths is to indicate the actively used versus archivable data. Data in the cold path for example can be on cheaper storage.

These paths can be configured per index (an organizational unit that shares nothingwith others ). Each database can have several buckets of data and this isconfigurable. Buckets are time-series values with a min and a max timestamp. They hold three kinds of data. The first is a file with data extension that stores metadata such as Source/SourceType/Host metadata. The second is a TSIDX file (short for TimeSeries Index) that builds a lexicon of different words/fields extracted from the raw data and a posting of each occurrence for that word/field. They hold pointers to rawdata. Tsidx is an inverted index whose postings are stored in reverse-time order. It is optimized to execute search and return millions of events in reverse time order and interfaces with the higher layers / Lispy language. Finally, the bucket contains the rawdata. A compressed rawdata journal stores raw events in an efficient format. It contains enough information to rebuild the entire bucket.

In the bucket lifecycle, the data rolls over from the hot->warm->cold->frozen buckets.

Search is on a bucket by bucket basis. Buckets may overlap in time when searchrequests events to be returned in reverse time order. Set the cursor for each bucketto be the latest time of that bucket, read a chunk of data from bucket with latestcursor and move cursor to the latest time of unread data from the bucket.

Occassionally, streaming data can create different small time-series index files which make search inefficient. So they are consolidated into larger files with splunk-optimize command.

In addition, each database has a BucketManifest file. This file just lists the metadata for each bucket namely the id, path, raw_size, host_count, source_count,sourcetype_count, size_on_disk, modtime, frozen_in_cluster etc. Buckets are internalto the database.

The Database uses the buckets to retrieve values to satisfy a query. The values arewrapped to include metadata namely dbid, event_address, host, type, source,sourcetype, and timestamp. When filtering a query results the data has to match thetimestamp range.

Database and Bucket Partition Policy:

Policies applied to database are maintained by this class. It has a set of properties for the database and starts out with initialized values for these properties.

The properties include counts of different buckets, their allowed sizes, theiroptimization time, etc.

While the DatabaseDirectoryManager is mostly read only operations on thedatabase, the DatabasePartitionPolicy enables writes to the database bydetermining the database from the specified index in the event and updating the time range of the database with the event time, if necessary.

To keep the events sorted in time order, we could have a buffer to hold delayed and out of order events which we can sequence with a sliding window:
        public static void Add(this int[] events, int seq )
        {
            events[events.Length + 1] = seq;
            int start = events.Length - Lookahead > 1 ? events.Length - Lookahead : 1
            int end = events.Length;
            int min = events.Where((x,i) => i >= start && i <= end).Min();
int index = events.Where((x,i) => i >= start && i <= end).IndexOf(min);
            events[start-1] = min;
            for (int i = index; i < end; i++) events[i] = events[i+1];
            events.RemoveAt(end);
        }