Cluster computing

Sunday, August 3, 2014

In the previous post, we did not mention how we use the law of large numbers. We will complete that today.
We claimed that pi(y) = g(y) - g(y-1) and this satisfies the balance equation. We then attempted to say the Markov chain describing the supply and demand at an electronics store can be expressed as
Mn+1 = 0 or max Zj where 1 <= j <= n-1 where Zj was the sum of all differences between demand and supply till date.
and this we approximated to Mn+1 = 0 or (Mn + ksi-0) because we can replace the random letter variables ksi without affecting the probability.
P(Mn+1) = P(0 or (Mn + ksi-0)
= Sigma x>= 0 P(Mn = x) pxy
Since the n-dependent terms are bounded, we may take the limit of both sides as n-> infinity.
and lim n-> infinity for P(Mn = y) is our definition for pi(y)
we can write the earlier equation as pi(y) = Sigma x >= 0 pi(x) pxy. This shows that pi satisfies the balance equations however we need to prove further that Sigma-y pi(y) = 1
That is we have a weighted distribution of the probabilities where the sum is normalized to unity.
Here we say g(y) is not identically equal to 1 when the pi adds up to unity.
we prove this by using the assumption E.ksi-n = -mu < 0
P(lim n->infinity Zn/n = -mu ) = 1 from the law of large numbers.
This implies that the P(lim n->infinity Zn = -infinity) = 1
But we know that we wrote down P(lim n-> infinity Mn = 0 or max {z1, z2 ...} < infinity ) = 1
This gives g(y) = P(0 or max {z1, z2, ...} > y) is not identically equal to 1.

I'm going to take a short break today to discuss storage or queuing application of a Markov chain. A storage facility has Xn electronic items at time n. A number An of the new components are added to the stock, while a number of them are sold to the market. If the demand is for Dn components , it is met instantaneously with what is available. If all of Dn cannot be satisfied, then it is satisfied with the stock reaching level zero. at time n+1. Thus the store has items Xn + An - Dn components in the stock or zero at time n+1 .
we write this Xn+1 = (Xn + An - Dn) or 0.
We make the following assumptions:
1) The pairs of random variables (An, Dn), n >= 0 are independent and identically distributed random variables (iid). The supply and demand at any time n is independent of the previous states and come from the same distribution and
2) E(An - Dn) = -mu < 0 . Thus the demand is larger than the supply per unit time on average which is indicated by the negative sign.
Now, Xn, n >= 0 is a Markov chain and we will show this Markov chain is positive recurrent.
Let us take xi (greek-letter) = An - Dn
Then the Markov chain is a simple recursion Xn+1 = Xn + xi or 0 for all n >= 0
Let us try solving the recursion.
In step 1, X1 = (X0 + xi-0) or 0
In step 2, X2 = (X1 + xi-1) or 0 = X0 + xi-0 + xi-1 or xi-1 or 0
In step n, Xn = (X0 + xi-0 + ...xi-n-1) or (xi-1 + ... + xi-n-1) or ... or xi-n-1 or 0
The probability that Px(Xn > y) can now be written in terms of step n. Thus the event whose probability we want to compute is a function of (xi-0, ..., xi-n-1) . Moreover these xi-0, ... xi-n-1 are i.i.ds so we can move them around and even reverse it without altering the distribution.
Then for abbreviations, we use
Zn = xi-0 + ... + xi-n-1 and
Mn = Zn or Zn-1 or ... Z1 or 0
and g = Px(Xn>y)
P0(Xn > y ) can then be written as P( Mn > y ) and since Mn+1 is >= Mn, we know that
P0(Xn > y) is less than and tending to left hand side than P0 ( X n+1 > y ) for all n. At infinity the limit of this bounded and increasing sequence exists.
We can take this difference as pi(y) = g(y) - g(y-1) and we claim that it satisfies the balance equations.
But Mn+1 = 0 or Max Zj where j is in 1 to n+1
which we can say is = 0 or (Mn + xi-0) . We make this approximation based on the law of large numbers.
Hence for y >= 0
P(Mn+1 = y ) = Sigma- x>=0 P(Mn = x) .pxy. Since these terms are bounded, we may take the limit of both sides as n-> infinity, we find pi(y) = Sum x>=0 pi(x)pxy. With this definition, where the weights add up to 1 we can say that pi satisfies the balance equations and g(y) does not become 1 at infinity.
Courtesy : Kostantopoulous

Saturday, August 2, 2014

Tonight we will continue our writeup on SplunkLite.Net : http://1drv.ms/1k3LnJ0
Code will be available on http://github.com/ravibeta/csharpexamples/SplunkLite.Net

Splunk Search Processors are generally of the following types:

Streaming

This is completely parallelizable with example operators as eval, where, rex

Stateful

This requires a single stream of data and therefore is not parallelizable. Example commands are sort, eventstats, etc.
Event-processing
This is not reporting, but also not streaming or stateful. Example commands are : Sort, eventstats etc.

Reporting

This is summarizing. Example commands are : stats, chart, timechart, top etc.
For now, we can exclude sub-searches.
Searches can be saved so we will need a SavedSearchAdminHandler that implements similar methods as for an AdminHandler.

Batched search:

Batch mode is not realtime and not summarizing. It is a reporting search such as the stats command. It operates one bucket at a time. Buckets are sorted into non-restartable and restartable buckets and read in that order – a few events at a time until the bucket gets exhausted.

Realtime search:

Realtime searches yield results as the events are indexed, even in a distributed environment. SplunkLite.Net is a single machine single instance mode and nothing stops it from being distributable especially since there’s CCW and REST support in .Net. This has the ability to do search and stats on the incoming pipeline data.

Indexed search:

This works on different time windows and adjust the time based on user preferences . For example – last one week, one month, one year etc. Buckets may overlap in time although only a few buckets will be writeable (hot). This kind of search runs with map/reduce like distribution.

For each of the searches it supports three different modes.

- Fast mode

o Field discovery off for event searches. No event or field data for stats searches.

- Smart mode

o Field discovery on for event searches. No event or field data for stats searches.

- Verbose mode

o All event and field data

Friday, August 1, 2014

Today I'm going to continue the discussion on SplunkLite.Net.

In today's post we continue our discussion on porting the Splunk forwarder functionality to Splunk Lite.Net Here we look at the data structures such as the QueueReader and QueueWriter in ProducerConsumerQueue. We are making sure the ProducerConsumerQueue has equivalent in.Net
Queue is the abstact notion to read CowPipelineData.
QueueReader is the interface for the client to read data from the queue
QueueWriter is the interface for the client app to add data to the queue. There are several queue implementations
Each QueueWriterImplementations has additional methods to get the size, check if the queue is full or empty, reset the size, get stats, remove a single item, add metrics and shutdown.
The writers can be in memory or in disk writers.
A PersistentQueue is one which has one reader and many writers. The writers write to memory but when they overflow they right to disk.
A ProducerConsumerQueue has in memory writers by name, persistent writers by name, list of all writers, the next provider to be polled for data, a mutex protecting condition, and a queue not empty condition.
With the PipelineData,we mentioned earlier that each thread can have its exlusive CowPipelineData though underlying they may share the same PipelineData.
Moreover, we mentioned that the PipelineData uses notion of RawStoragePool and RawStorageReference but these are not suitable for porting because we don't manage memory in .Net.
Anytime the underlying data is modified, we do something like this in the forwarder:
PipelineData *x = pData.getWriter()
x->put(...)
Together this shows that we can use .Net collection but retain the semantics of a queue reader and writer in SplunkLite.Net.
Search is one of the biggest operations and we will deal with that soon.

Thursday, July 31, 2014

In today's post, we continue our discussion on porting Splunk forwarder to SplunkLite.Net which is a lightweight application that forwards, indexes and searches Splunk data. In the previous posts, we discussed a few of the functionalities we require such as the ability to create a data pipeline for input, processors that can convert the input into events and save them for analytics later. There's still a few more data structures to look into but as we see the majority of the framework and utilities we use are conveniently available to us in .Net libraries. This reduces the code significantly. Framework helpers such as HTTP request and response handling, HttpStaticDispatcher, Processor, QueryRunningSingleton, SearchResults, SearchEvaluator, ServerConfig TagManager, etc are still needed. The ability to secure the REST calls with AdminManager is also needed. The KeyManagers for localhost, search peers and general settings can come in later too. The utilities for FileSystemWatcher, datetime handing, HTTPServer, ProducerConsumerQueue, already have support in .Net. Proprietary database helpers such as PersistentHashData, PersistentHashManager, PersistentMapManager and PersistentStorage are still required. Let us look at the Persistent data structures more closely. PersistentMapManager provides a way to lookup based on keys and tags. It has methods to get all the keys, or matching Keys or to check if a key exists or to remove keys. The same holds for tags. Ability to look up the store based on keys and tags has been a key feature of Splunk analytics. PersistentHashManager maintains a hash table and gets all the data that matches a key. The values are maintained as PersistentHashData and the data on disk is accessed via RecordFileManager which loads the DB file into memory and has methods for read and write records to disk.
Results from the search on the database are available via a data structure called SearchResults which is a collection of SearchResult and maintains a key map. Each SearchResult returns a list of fields which can be multivalued.
Note that the SearchResult is internal to Splunk. The export of results in different data formats via online and offline methods are also available. This let Splunk integrate well in most ecosystems. I introduced a way for Splunk to provide searchable data to LogParser which has a SQL interface. The ability to use SQL over splunk makes it user friendly to users who work primarily with databases.