Cluster computing

Thursday, July 31, 2014

In today's post, we continue our discussion on porting Splunk forwarder to SplunkLite.Net which is a lightweight application that forwards, indexes and searches Splunk data. In the previous posts, we discussed a few of the functionalities we require such as the ability to create a data pipeline for input, processors that can convert the input into events and save them for analytics later. There's still a few more data structures to look into but as we see the majority of the framework and utilities we use are conveniently available to us in .Net libraries. This reduces the code significantly. Framework helpers such as HTTP request and response handling, HttpStaticDispatcher, Processor, QueryRunningSingleton, SearchResults, SearchEvaluator, ServerConfig TagManager, etc are still needed. The ability to secure the REST calls with AdminManager is also needed. The KeyManagers for localhost, search peers and general settings can come in later too. The utilities for FileSystemWatcher, datetime handing, HTTPServer, ProducerConsumerQueue, already have support in .Net. Proprietary database helpers such as PersistentHashData, PersistentHashManager, PersistentMapManager and PersistentStorage are still required. Let us look at the Persistent data structures more closely. PersistentMapManager provides a way to lookup based on keys and tags. It has methods to get all the keys, or matching Keys or to check if a key exists or to remove keys. The same holds for tags. Ability to look up the store based on keys and tags has been a key feature of Splunk analytics. PersistentHashManager maintains a hash table and gets all the data that matches a key. The values are maintained as PersistentHashData and the data on disk is accessed via RecordFileManager which loads the DB file into memory and has methods for read and write records to disk.
Results from the search on the database are available via a data structure called SearchResults which is a collection of SearchResult and maintains a key map. Each SearchResult returns a list of fields which can be multivalued.
Note that the SearchResult is internal to Splunk. The export of results in different data formats via online and offline methods are also available. This let Splunk integrate well in most ecosystems. I introduced a way for Splunk to provide searchable data to LogParser which has a SQL interface. The ability to use SQL over splunk makes it user friendly to users who work primarily with databases.

Wednesday, July 30, 2014

Today we are going to continue the discussion on the Splunk Forwarder port to SplunkLite.Net. Here we cover the event and Metadata. If we look at an input processor in Splunk input pipeline, we see a common pattern such as :
public class StdinInputProcessor : InputProcessor{

    public StdinInputProcessor();
    public ~StdinInputProcessor();

    public virtual void init( XmlNode pluginInstanceConfig);

    public virtual EProcessorReturn execute(CowPipelineData pData);
};

essentially the only two operations it is responsible for is initialization and execution.
The execute method merely copies the data from external to the event writer and sets metadata such as sourcekey etc.
Occassionally a producer consumer queue and a shutdown thread is utilized when there are more than one channel needs to be served and decoupling the producer and consumer helps.

Event conversion happens after the data has been copied over. Few fields are populated to describe the event including source and destination metadata such as host, timestamp, sourcetype, index, etc.

CowPipelineData is a proxy object that wraps the PipelineData which manages the data that is passed around in the pipeline which maintains a map of predefined keys. PipelineData processes headers and does serialization as well. CowPipelineData provides threadsafe copy-on-write semantics. Each thread has an exclusive CowPipelineData although they may share the same PipelineData in a threadsafe manner.

As we describe the inputs and the forwarder of SplunkLite.Net, its interesting to note that the events diffused in the input file follow a Markov chain. Consider a Markov chain and decompose it into communicating classes. For each positive recurrent communicating class C, we have a unique stationary distribution pi-c which assigns positive probability to each state in C. Let a be a state belonging to C and let pi-c be pi-a. An arbitrary stationary distribution pi must be a linear combination of these pi-C.

To define this more formally (Konstantopoulous) : To each positive recurrent communicating class C there corresponds a stationary distribution pi-C. Any stationary distribution pi is necessarily of the form pi = Sigma - over - C ( alpha-C. pi - C) where alpha-C >= 0, such that Sum-c(alpha-C) = 1

We say that the distribution is normalized to unity because the probabilities acts as weights and the sums of all these probabilities is one.

Here's how we prove that an arbitrary stationary distribution is a weighted combination of the unique stationary distributions.
Let pi is an arbitrary stationary distribution, If pi(x) is > 0 then x belongs to some positive recurrent class C. The conditional distribution of x over C is then given by pi(x | C) = pi(x) | pi (C) where the denominator is an aggregation of all x in C. To make x unique we have pi-(x|C) as equivalent to pi-C(x) where pi-x aggregated over all C is alpha-C - the state we picked. So we can write pi = alpha-c(pi-C) is an arbitrary stationary distribution pi is a linear combination of these unique stationary distributions.
Now we return to our steps to port the Splunk forwarder

Tuesday, July 29, 2014

Let us look at the porting of Splunk Forwarder to SplunkLite.Net ( the app ). This is the app we want to have all three roles forwarder, indexer and searchhead. Our next few posts may continue to be on forwarder because they are all going to scope the Splunk code first. When we prepare a timeline for implementation, we will consider sprints and include time for TDD and feature priority but first let us take a look at the different sections of the input pipeline and what we need to do. First we looked at the PipelineData. We now look at the context data structures. Our goal is to enumerate the classes and data structures needed to enable writing an event from a Modular input. We will look at Metadata too.
Context data structures include configuration information. Configuration is read and merged at the startup of the program. For forwarding, we need some configuration that are expressed in files such as input.conf, props.conf etc. These are key value pairs and not the xml that .Net configuration files are. The format of the configuration does not matter. What matters is their validation and population in runtime data structures called a PropertyMap. Configuration keys also have their metadata and this is important because we actively use it.
PipelineData class manages data that is passed around in the pipeline. Note that we don't need raw storage for the data since we don't need to manage the memory. The same goes for Memory pool because we don't actively manage the memory. The runtime does it for us and this scales well.What we need is the per-event data structure including metadata. Metadata can be for each field. This lets us understand the event better.
We will leave the rest of the discussion on Metadata to indexer time discussion. For now, we have plenty more to cover for Forwarder.
Interestingly one of the modular inputs is UDP and this should facilitate the traffic such as games, realtime traffic and other such applications. However, udp modular input may not be working in Splunk especially if connection_host key is specified.
We will focus exclusively on file based modular input for the prototype we want to build. This will keep it simple and give us the flexibility to work with events small and large, few and several.

Monday, July 28, 2014

Today onwards I'm going to start a series of posts on Splunk internals. As we know there are three roles for Splunk Server. - forwarding, indexing and searching.
We will cover forwarding section today to see what all components need to be ported to a SplunkLite framework.
First we look at Pipeline Data and associated data structures. It is easy to port some of these to C# and it gives the same definitions to the input that we have in Splunk Server.
Pipeline components and actor threads can be used directly in .Net. Configurations can be maintained via configuration files just the same as in Splunk Server. threads that are dedicated to shutdown or for running the event loop are still required. Note the support for framework items like event loop Can Be Substituted By The equivalent .Net scheduler classes. .Net has a rich support for threads and scheduling via the .Net 4.0 task library.
While on the topic of framework items, we might as well cover logger. Logging is available via packages like log4net or enterprise application block. These are convenient to add to the application and come with multiple destination features. When we iterate down the utilities required for this application, we will see that a lot of the efforts of writing something portable with Splunk Server are done away with because .Net comes with those already.
When writing the forwarder, we can start small with a few set of Input and expand the choices later. Having one forwarder one indexer and one search head will be sufficient for proof of concept. The code can provide end to end functionality and we can then augment each of the processors whether they are input search or index processors. Essentially the processors all conform in the same way for the same role, so how we expand it is up to us.
PersistentStorage may need to be specified to work with the data and this is proprietary. Data and Metadata may require data structures similar to what we have in Splunk. We would look into Hash manager and record file manager. We should budget for things that are specific to Splunk first because they are artifacts that have a history and a purpose.
Items that we can and should deliberately avoid are those for which we have rich and robust .net features such as producer consumer queues etc.
The porting of the application may require a lot of time. An initial estimation for the bare bones and resting is in order. For anything else we can keep it prioritized.

Sunday, July 27, 2014

Splunk indexes both text and binary data. In this post, we will see how we can use Splunk for archival storage devices. Companies like Datadomain have great commercial advantage in data backup and Archival. Their ability to backup data on a continuous basis and use deduplication to reduce the size of the data makes the use of Splunk interesting. But first let us look at what it means to index binary data. We know that it's different from text where indexing is using compact hash values and an efficient data structure such as a B+ tree for lookup and retrieval. Text data also lends itself to key value pairs extraction which come in handy with NoSQL databases. And the trouble with the binary data is that it cannot be meaningfully searched and analyzed. Unless there is textual metadata associated with it, binary data is not helpful. For example, an image file data is not as helpful as the size, creation tool, username, camera, gps location etc. Also even textual representations such as XML are also not helpful since they are difficult to read by humans and it requires parsing. As an example, serializing code objects in an application may be helpful but logging its significant key value pairs may be even better since they will be in textual format that lends itself to Splunk forwarding, indexing and searching.
This can be used with periodic and acyclic maintenance data archival as well. The applications that archive data are moving sometimes terabytes of data. Moreover, they are interpreting and organizing this data in a way where nothing is lost during the move, yet the delta changes between say two backup runs is collected and saved with efficiency in size and computation. There is a lot of metadata gathered in the process by these applications and the same can be logged to Splunk. Splunk in turn enables superior analytics on these data. One characteristic of such data that are backed up or archived regularly is that they have a lot of recurrence and repetitions That is why companies like DataDomain are able to de-duplicate the events and reduce the footprint of the data to be archived. Those computations can have a lot of information associated that can be expressed in rich metadata suitable for analytics later. For example, the source of the data, the programs that use the data, metadata on the data that was de-duped, the location and availability of the archived data are all relevant for analytics later. This way those applications need not fetch the data to search for answers to analytical queries but directly work off the Splunk indexes instead.

If we look at the custom search commands in a Splunk instance, we actually find a trove of utilities. Some of these scripts include such things as streaming search results to a xml file. There are command to do each of the following:

Produces a summary of each search result.

Add fields that contain common information about the current search.

Computes the sum of all numeric fields for each result.

Computes an "unexpectedness" score for an event.

Finds and summarizes irregular

Appends subsearch results to current results.

Appends the fields of the subsearch results to current results

Find association rules between field values

Identifies correlations between fields.

Returns audit trail information that is stored in the local audit index.

Sets up data for calculating the moving average.

Analyzes numerical fields for their ability to predict another discrete field.

Keeps a running total of a specified numeric field.

Computes the difference in field value between nearby results.

Puts continuous numerical values into discrete sets.

Returns results in a tabular output for charting.

Find how many times field1 and field2 values occurred together

Builds a contingency table for two fields.

Converts field values into numerical values.

Crawls the filesystem for files of interest to Splunk

Adds the RSS item into the specified RSS feed.

Allows user to examine data models and run the search for a datamodel object.

Removes the subsequent results that match specified criteria.

Returns the difference between two search results.

Automatically extracts field values similar to the example values.

Calculates an expression and puts the resulting value into a field.

Extracts values from search results

Extracts field-value pairs from search results.

Keeps or removes fields from search results.

Generates summary information for all or a subset of the fields.

Replace null values with last non-null value

Replaces null values with a specified value.

Replaces "attr" with higher-level grouping

Replaces PATHFIELD with higher-level grouping

Run a templatized streaming subsearch for each field in a wildcarded field list

Takes the results of a subsearch and formats them into a single result.

Transforms results into a format suitable for display by the Gauge chart types.

Generates time range results.

Generate statistics which are clustered into geographical bins to be rendered on a world map.

Returns the first n number of specified results.

Returns the last n number of specified results.

Returns information about the Splunk index.

Adds or disables sources from being processed by Splunk.

Loads search results from the specified CSV file.

Loads search results from a specified static lookup table.

SQL-like joining of results from the main results pipeline with the results from the subpipeline.

Joins results with itself.

Performs k-means clustering on selected fields.

Returns a list of time ranges in which the search results were found.

Prevents subsequent commands from being executed on remote peers.

Loads events or results of a previously completed search job.

Explicitly invokes field value lookups.

Looping operator

Extracts field-values from table-formatted events.

Do multiple searches at the same time

Combines events in the search results that have a single differing field value into one result with a multi-value field of the differing field.

Expands the values of a multi-value field into separate events for each value of the multi-value field.

Changes a specified field into a multi-value field during a search.

Changes a specified multi-value field into a single-value field at search time.

Removes outlying numerical values.

Executes a given search query and export events to a set of chunk files on local disk.

Outputs search results to the specified CSV file.

Save search results to specified static lookup table.

Outputs search results in a simple

Outputs the raw text (_raw) of results into the _xml field.

Finds events in a summary index that overlap in time or have missed events.

Allows user to run pivot searches against a particular datamodel object.

Predict future values of fields.

See what events from a file will look like when indexed without actually indexing the file.

Displays the least common values of a field.

Removes results that do not match the specified regular expression.

Calculates how well the event matches the query.

Renames a specified field (wildcards can be used to specify multiple fields).

Replaces values of specified fields with a specified new value.

Specifies a Perl regular expression named groups to extract fields while you search.

Buffers events from real-time search to emit them in ascending time order when possible

The select command is deprecated. If you want to compute aggregate statistics

Makes calls to external Perl or Python programs.

Returns a random sampling of N search results.

Returns the search results of a saved search.

Emails search results to specified email addresses.

Sets the field values for all results to a common value.

Extracts values from structured data (XML or JSON) and stores them in a field or fields.

Turns rows into columns.

Filters out repeated adjacent results

Retrieves event metadata from indexes based on terms in the <logical-expression>

Filters results using keywords

Performs set operations on subsearches.

Clusters similar events together.

Produces a symbolic 'shape' attribute describing the shape of a numeric multivalued field

Sorts search results by the specified fields.

Puts search results into a summary index.

Adds summary statistics to all search results in a streaming manner.

Adds summary statistics to all search results.

Provides statistics

Concatenates string values.

Summary indexing friendly versions of stats command.

Summary indexing friendly versions of top command.

Summary indexing friendly versions of rare command.

Summary indexing friendly versions of chart command.

Summary indexing friendly versions of timechart command.

Annotates specified fields in your search results with tags.

Computes the moving averages of fields.

Creates a time series chart with corresponding table of statistics.

Displays the most common values of a field.

Writes the result table into *.tsidx files using indexed fields format.

Performs statistics on indexed fields in tsidx files

Groups events into transactions.

Returns typeahead on a specified prefix.

Generates suggested eventtypes. Deprecated: preferred command is 'findtypes'

Calculates the eventtypes for the search results

Runs an eval expression to filter the results. The result of the expression must be Boolean.

Causes UI to highlight specified terms.

Converts results into a format suitable for graphing.

Extracts XML key-value pairs.

Un-escapes XML characters.

Extracts the xpath value from FIELD and sets the OUTFIELD attribute.

Extracts location information from IP addresses using 3rd-party databases.

Processes the given file as if it were indexed.

Sets RANGE field to the name of the ranges that match.

Returns statistics about the raw field.

Sets the 'reltime' field to a human readable value of the difference between 'now' and '_time'.

Anonymizes the search results.

Returns a list of source

Performs a debug command.

Performs a deletion from the index.

Returns the number of events in an index.

Generates suggested event types.

convenient way to return values up from a subsearch

Internal command used to execute scripted alerts

finds transaction events given search constraints

Runs the search script

Remove seasonal fluctuations in fields.