Cluster computing

Sunday, August 3, 2014

I'm going to take a short break today to discuss storage or queuing application of a Markov chain. A storage facility has Xn electronic items at time n. A number An of the new components are added to the stock, while a number of them are sold to the market. If the demand is for Dn components , it is met instantaneously with what is available. If all of Dn cannot be satisfied, then it is satisfied with the stock reaching level zero. at time n+1. Thus the store has items Xn + An - Dn components in the stock or zero at time n+1 .
we write this Xn+1 = (Xn + An - Dn) or 0.
We make the following assumptions:
1) The pairs of random variables (An, Dn), n >= 0 are independent and identically distributed random variables (iid). The supply and demand at any time n is independent of the previous states and come from the same distribution and
2) E(An - Dn) = -mu < 0 . Thus the demand is larger than the supply per unit time on average which is indicated by the negative sign.
Now, Xn, n >= 0 is a Markov chain and we will show this Markov chain is positive recurrent.
Let us take xi (greek-letter) = An - Dn
Then the Markov chain is a simple recursion Xn+1 = Xn + xi or 0 for all n >= 0
Let us try solving the recursion.
In step 1, X1 = (X0 + xi-0) or 0
In step 2, X2 = (X1 + xi-1) or 0 = X0 + xi-0 + xi-1 or xi-1 or 0
In step n, Xn = (X0 + xi-0 + ...xi-n-1) or (xi-1 + ... + xi-n-1) or ... or xi-n-1 or 0
The probability that Px(Xn > y) can now be written in terms of step n. Thus the event whose probability we want to compute is a function of (xi-0, ..., xi-n-1) . Moreover these xi-0, ... xi-n-1 are i.i.ds so we can move them around and even reverse it without altering the distribution.
Then for abbreviations, we use
Zn = xi-0 + ... + xi-n-1 and
Mn = Zn or Zn-1 or ... Z1 or 0
and g = Px(Xn>y)
P0(Xn > y ) can then be written as P( Mn > y ) and since Mn+1 is >= Mn, we know that
P0(Xn > y) is less than and tending to left hand side than P0 ( X n+1 > y ) for all n. At infinity the limit of this bounded and increasing sequence exists.
We can take this difference as pi(y) = g(y) - g(y-1) and we claim that it satisfies the balance equations.
But Mn+1 = 0 or Max Zj where j is in 1 to n+1
which we can say is = 0 or (Mn + xi-0) . We make this approximation based on the law of large numbers.
Hence for y >= 0
P(Mn+1 = y ) = Sigma- x>=0 P(Mn = x) .pxy. Since these terms are bounded, we may take the limit of both sides as n-> infinity, we find pi(y) = Sum x>=0 pi(x)pxy. With this definition, where the weights add up to 1 we can say that pi satisfies the balance equations and g(y) does not become 1 at infinity.
Courtesy : Kostantopoulous

Saturday, August 2, 2014

Tonight we will continue our writeup on SplunkLite.Net : http://1drv.ms/1k3LnJ0
Code will be available on http://github.com/ravibeta/csharpexamples/SplunkLite.Net

Splunk Search Processors are generally of the following types:

Streaming

This is completely parallelizable with example operators as eval, where, rex

Stateful

This requires a single stream of data and therefore is not parallelizable. Example commands are sort, eventstats, etc.
Event-processing
This is not reporting, but also not streaming or stateful. Example commands are : Sort, eventstats etc.

Reporting

This is summarizing. Example commands are : stats, chart, timechart, top etc.
For now, we can exclude sub-searches.
Searches can be saved so we will need a SavedSearchAdminHandler that implements similar methods as for an AdminHandler.

Batched search:

Batch mode is not realtime and not summarizing. It is a reporting search such as the stats command. It operates one bucket at a time. Buckets are sorted into non-restartable and restartable buckets and read in that order – a few events at a time until the bucket gets exhausted.

Realtime search:

Realtime searches yield results as the events are indexed, even in a distributed environment. SplunkLite.Net is a single machine single instance mode and nothing stops it from being distributable especially since there’s CCW and REST support in .Net. This has the ability to do search and stats on the incoming pipeline data.

Indexed search:

This works on different time windows and adjust the time based on user preferences . For example – last one week, one month, one year etc. Buckets may overlap in time although only a few buckets will be writeable (hot). This kind of search runs with map/reduce like distribution.

For each of the searches it supports three different modes.

- Fast mode

o Field discovery off for event searches. No event or field data for stats searches.

- Smart mode

o Field discovery on for event searches. No event or field data for stats searches.

- Verbose mode

o All event and field data

Friday, August 1, 2014

Today I'm going to continue the discussion on SplunkLite.Net.

In today's post we continue our discussion on porting the Splunk forwarder functionality to Splunk Lite.Net Here we look at the data structures such as the QueueReader and QueueWriter in ProducerConsumerQueue. We are making sure the ProducerConsumerQueue has equivalent in.Net
Queue is the abstact notion to read CowPipelineData.
QueueReader is the interface for the client to read data from the queue
QueueWriter is the interface for the client app to add data to the queue. There are several queue implementations
Each QueueWriterImplementations has additional methods to get the size, check if the queue is full or empty, reset the size, get stats, remove a single item, add metrics and shutdown.
The writers can be in memory or in disk writers.
A PersistentQueue is one which has one reader and many writers. The writers write to memory but when they overflow they right to disk.
A ProducerConsumerQueue has in memory writers by name, persistent writers by name, list of all writers, the next provider to be polled for data, a mutex protecting condition, and a queue not empty condition.
With the PipelineData,we mentioned earlier that each thread can have its exlusive CowPipelineData though underlying they may share the same PipelineData.
Moreover, we mentioned that the PipelineData uses notion of RawStoragePool and RawStorageReference but these are not suitable for porting because we don't manage memory in .Net.
Anytime the underlying data is modified, we do something like this in the forwarder:
PipelineData *x = pData.getWriter()
x->put(...)
Together this shows that we can use .Net collection but retain the semantics of a queue reader and writer in SplunkLite.Net.
Search is one of the biggest operations and we will deal with that soon.

Thursday, July 31, 2014

In today's post, we continue our discussion on porting Splunk forwarder to SplunkLite.Net which is a lightweight application that forwards, indexes and searches Splunk data. In the previous posts, we discussed a few of the functionalities we require such as the ability to create a data pipeline for input, processors that can convert the input into events and save them for analytics later. There's still a few more data structures to look into but as we see the majority of the framework and utilities we use are conveniently available to us in .Net libraries. This reduces the code significantly. Framework helpers such as HTTP request and response handling, HttpStaticDispatcher, Processor, QueryRunningSingleton, SearchResults, SearchEvaluator, ServerConfig TagManager, etc are still needed. The ability to secure the REST calls with AdminManager is also needed. The KeyManagers for localhost, search peers and general settings can come in later too. The utilities for FileSystemWatcher, datetime handing, HTTPServer, ProducerConsumerQueue, already have support in .Net. Proprietary database helpers such as PersistentHashData, PersistentHashManager, PersistentMapManager and PersistentStorage are still required. Let us look at the Persistent data structures more closely. PersistentMapManager provides a way to lookup based on keys and tags. It has methods to get all the keys, or matching Keys or to check if a key exists or to remove keys. The same holds for tags. Ability to look up the store based on keys and tags has been a key feature of Splunk analytics. PersistentHashManager maintains a hash table and gets all the data that matches a key. The values are maintained as PersistentHashData and the data on disk is accessed via RecordFileManager which loads the DB file into memory and has methods for read and write records to disk.
Results from the search on the database are available via a data structure called SearchResults which is a collection of SearchResult and maintains a key map. Each SearchResult returns a list of fields which can be multivalued.
Note that the SearchResult is internal to Splunk. The export of results in different data formats via online and offline methods are also available. This let Splunk integrate well in most ecosystems. I introduced a way for Splunk to provide searchable data to LogParser which has a SQL interface. The ability to use SQL over splunk makes it user friendly to users who work primarily with databases.

Wednesday, July 30, 2014

Today we are going to continue the discussion on the Splunk Forwarder port to SplunkLite.Net. Here we cover the event and Metadata. If we look at an input processor in Splunk input pipeline, we see a common pattern such as :
public class StdinInputProcessor : InputProcessor{

    public StdinInputProcessor();
    public ~StdinInputProcessor();

    public virtual void init( XmlNode pluginInstanceConfig);

    public virtual EProcessorReturn execute(CowPipelineData pData);
};

essentially the only two operations it is responsible for is initialization and execution.
The execute method merely copies the data from external to the event writer and sets metadata such as sourcekey etc.
Occassionally a producer consumer queue and a shutdown thread is utilized when there are more than one channel needs to be served and decoupling the producer and consumer helps.

Event conversion happens after the data has been copied over. Few fields are populated to describe the event including source and destination metadata such as host, timestamp, sourcetype, index, etc.

CowPipelineData is a proxy object that wraps the PipelineData which manages the data that is passed around in the pipeline which maintains a map of predefined keys. PipelineData processes headers and does serialization as well. CowPipelineData provides threadsafe copy-on-write semantics. Each thread has an exclusive CowPipelineData although they may share the same PipelineData in a threadsafe manner.