Cluster computing

Wednesday, July 30, 2014

As we describe the inputs and the forwarder of SplunkLite.Net, its interesting to note that the events diffused in the input file follow a Markov chain. Consider a Markov chain and decompose it into communicating classes. For each positive recurrent communicating class C, we have a unique stationary distribution pi-c which assigns positive probability to each state in C. Let a be a state belonging to C and let pi-c be pi-a. An arbitrary stationary distribution pi must be a linear combination of these pi-C.

To define this more formally (Konstantopoulous) : To each positive recurrent communicating class C there corresponds a stationary distribution pi-C. Any stationary distribution pi is necessarily of the form pi = Sigma - over - C ( alpha-C. pi - C) where alpha-C >= 0, such that Sum-c(alpha-C) = 1

We say that the distribution is normalized to unity because the probabilities acts as weights and the sums of all these probabilities is one.

Here's how we prove that an arbitrary stationary distribution is a weighted combination of the unique stationary distributions.
Let pi is an arbitrary stationary distribution, If pi(x) is > 0 then x belongs to some positive recurrent class C. The conditional distribution of x over C is then given by pi(x | C) = pi(x) | pi (C) where the denominator is an aggregation of all x in C. To make x unique we have pi-(x|C) as equivalent to pi-C(x) where pi-x aggregated over all C is alpha-C - the state we picked. So we can write pi = alpha-c(pi-C) is an arbitrary stationary distribution pi is a linear combination of these unique stationary distributions.
Now we return to our steps to port the Splunk forwarder

Tuesday, July 29, 2014

Let us look at the porting of Splunk Forwarder to SplunkLite.Net ( the app ). This is the app we want to have all three roles forwarder, indexer and searchhead. Our next few posts may continue to be on forwarder because they are all going to scope the Splunk code first. When we prepare a timeline for implementation, we will consider sprints and include time for TDD and feature priority but first let us take a look at the different sections of the input pipeline and what we need to do. First we looked at the PipelineData. We now look at the context data structures. Our goal is to enumerate the classes and data structures needed to enable writing an event from a Modular input. We will look at Metadata too.
Context data structures include configuration information. Configuration is read and merged at the startup of the program. For forwarding, we need some configuration that are expressed in files such as input.conf, props.conf etc. These are key value pairs and not the xml that .Net configuration files are. The format of the configuration does not matter. What matters is their validation and population in runtime data structures called a PropertyMap. Configuration keys also have their metadata and this is important because we actively use it.
PipelineData class manages data that is passed around in the pipeline. Note that we don't need raw storage for the data since we don't need to manage the memory. The same goes for Memory pool because we don't actively manage the memory. The runtime does it for us and this scales well.What we need is the per-event data structure including metadata. Metadata can be for each field. This lets us understand the event better.
We will leave the rest of the discussion on Metadata to indexer time discussion. For now, we have plenty more to cover for Forwarder.
Interestingly one of the modular inputs is UDP and this should facilitate the traffic such as games, realtime traffic and other such applications. However, udp modular input may not be working in Splunk especially if connection_host key is specified.
We will focus exclusively on file based modular input for the prototype we want to build. This will keep it simple and give us the flexibility to work with events small and large, few and several.

Monday, July 28, 2014

Today onwards I'm going to start a series of posts on Splunk internals. As we know there are three roles for Splunk Server. - forwarding, indexing and searching.
We will cover forwarding section today to see what all components need to be ported to a SplunkLite framework.
First we look at Pipeline Data and associated data structures. It is easy to port some of these to C# and it gives the same definitions to the input that we have in Splunk Server.
Pipeline components and actor threads can be used directly in .Net. Configurations can be maintained via configuration files just the same as in Splunk Server. threads that are dedicated to shutdown or for running the event loop are still required. Note the support for framework items like event loop Can Be Substituted By The equivalent .Net scheduler classes. .Net has a rich support for threads and scheduling via the .Net 4.0 task library.
While on the topic of framework items, we might as well cover logger. Logging is available via packages like log4net or enterprise application block. These are convenient to add to the application and come with multiple destination features. When we iterate down the utilities required for this application, we will see that a lot of the efforts of writing something portable with Splunk Server are done away with because .Net comes with those already.
When writing the forwarder, we can start small with a few set of Input and expand the choices later. Having one forwarder one indexer and one search head will be sufficient for proof of concept. The code can provide end to end functionality and we can then augment each of the processors whether they are input search or index processors. Essentially the processors all conform in the same way for the same role, so how we expand it is up to us.
PersistentStorage may need to be specified to work with the data and this is proprietary. Data and Metadata may require data structures similar to what we have in Splunk. We would look into Hash manager and record file manager. We should budget for things that are specific to Splunk first because they are artifacts that have a history and a purpose.
Items that we can and should deliberately avoid are those for which we have rich and robust .net features such as producer consumer queues etc.
The porting of the application may require a lot of time. An initial estimation for the bare bones and resting is in order. For anything else we can keep it prioritized.

Sunday, July 27, 2014

Splunk indexes both text and binary data. In this post, we will see how we can use Splunk for archival storage devices. Companies like Datadomain have great commercial advantage in data backup and Archival. Their ability to backup data on a continuous basis and use deduplication to reduce the size of the data makes the use of Splunk interesting. But first let us look at what it means to index binary data. We know that it's different from text where indexing is using compact hash values and an efficient data structure such as a B+ tree for lookup and retrieval. Text data also lends itself to key value pairs extraction which come in handy with NoSQL databases. And the trouble with the binary data is that it cannot be meaningfully searched and analyzed. Unless there is textual metadata associated with it, binary data is not helpful. For example, an image file data is not as helpful as the size, creation tool, username, camera, gps location etc. Also even textual representations such as XML are also not helpful since they are difficult to read by humans and it requires parsing. As an example, serializing code objects in an application may be helpful but logging its significant key value pairs may be even better since they will be in textual format that lends itself to Splunk forwarding, indexing and searching.
This can be used with periodic and acyclic maintenance data archival as well. The applications that archive data are moving sometimes terabytes of data. Moreover, they are interpreting and organizing this data in a way where nothing is lost during the move, yet the delta changes between say two backup runs is collected and saved with efficiency in size and computation. There is a lot of metadata gathered in the process by these applications and the same can be logged to Splunk. Splunk in turn enables superior analytics on these data. One characteristic of such data that are backed up or archived regularly is that they have a lot of recurrence and repetitions That is why companies like DataDomain are able to de-duplicate the events and reduce the footprint of the data to be archived. Those computations can have a lot of information associated that can be expressed in rich metadata suitable for analytics later. For example, the source of the data, the programs that use the data, metadata on the data that was de-duped, the location and availability of the archived data are all relevant for analytics later. This way those applications need not fetch the data to search for answers to analytical queries but directly work off the Splunk indexes instead.

If we look at the custom search commands in a Splunk instance, we actually find a trove of utilities. Some of these scripts include such things as streaming search results to a xml file. There are command to do each of the following:

Produces a summary of each search result.

Add fields that contain common information about the current search.

Computes the sum of all numeric fields for each result.

Computes an "unexpectedness" score for an event.

Finds and summarizes irregular

Appends subsearch results to current results.

Appends the fields of the subsearch results to current results

Find association rules between field values

Identifies correlations between fields.

Returns audit trail information that is stored in the local audit index.

Sets up data for calculating the moving average.

Analyzes numerical fields for their ability to predict another discrete field.

Keeps a running total of a specified numeric field.

Computes the difference in field value between nearby results.

Puts continuous numerical values into discrete sets.

Returns results in a tabular output for charting.

Find how many times field1 and field2 values occurred together

Builds a contingency table for two fields.

Converts field values into numerical values.

Crawls the filesystem for files of interest to Splunk

Adds the RSS item into the specified RSS feed.

Allows user to examine data models and run the search for a datamodel object.

Removes the subsequent results that match specified criteria.

Returns the difference between two search results.

Automatically extracts field values similar to the example values.

Calculates an expression and puts the resulting value into a field.

Extracts values from search results

Extracts field-value pairs from search results.

Keeps or removes fields from search results.

Generates summary information for all or a subset of the fields.

Replace null values with last non-null value

Replaces null values with a specified value.

Replaces "attr" with higher-level grouping

Replaces PATHFIELD with higher-level grouping

Run a templatized streaming subsearch for each field in a wildcarded field list

Takes the results of a subsearch and formats them into a single result.

Transforms results into a format suitable for display by the Gauge chart types.

Generates time range results.

Generate statistics which are clustered into geographical bins to be rendered on a world map.

Returns the first n number of specified results.

Returns the last n number of specified results.

Returns information about the Splunk index.

Adds or disables sources from being processed by Splunk.

Loads search results from the specified CSV file.

Loads search results from a specified static lookup table.

SQL-like joining of results from the main results pipeline with the results from the subpipeline.

Joins results with itself.

Performs k-means clustering on selected fields.

Returns a list of time ranges in which the search results were found.

Prevents subsequent commands from being executed on remote peers.

Loads events or results of a previously completed search job.

Explicitly invokes field value lookups.

Looping operator

Extracts field-values from table-formatted events.

Do multiple searches at the same time

Combines events in the search results that have a single differing field value into one result with a multi-value field of the differing field.

Expands the values of a multi-value field into separate events for each value of the multi-value field.

Changes a specified field into a multi-value field during a search.

Changes a specified multi-value field into a single-value field at search time.

Removes outlying numerical values.

Executes a given search query and export events to a set of chunk files on local disk.

Outputs search results to the specified CSV file.

Save search results to specified static lookup table.

Outputs search results in a simple

Outputs the raw text (_raw) of results into the _xml field.

Finds events in a summary index that overlap in time or have missed events.

Allows user to run pivot searches against a particular datamodel object.

Predict future values of fields.

See what events from a file will look like when indexed without actually indexing the file.

Displays the least common values of a field.

Removes results that do not match the specified regular expression.

Calculates how well the event matches the query.

Renames a specified field (wildcards can be used to specify multiple fields).

Replaces values of specified fields with a specified new value.

Specifies a Perl regular expression named groups to extract fields while you search.

Buffers events from real-time search to emit them in ascending time order when possible

The select command is deprecated. If you want to compute aggregate statistics

Makes calls to external Perl or Python programs.

Returns a random sampling of N search results.

Returns the search results of a saved search.

Emails search results to specified email addresses.

Sets the field values for all results to a common value.

Extracts values from structured data (XML or JSON) and stores them in a field or fields.

Turns rows into columns.

Filters out repeated adjacent results

Retrieves event metadata from indexes based on terms in the <logical-expression>

Filters results using keywords

Performs set operations on subsearches.

Clusters similar events together.

Produces a symbolic 'shape' attribute describing the shape of a numeric multivalued field

Sorts search results by the specified fields.

Puts search results into a summary index.

Adds summary statistics to all search results in a streaming manner.

Adds summary statistics to all search results.

Provides statistics

Concatenates string values.

Summary indexing friendly versions of stats command.

Summary indexing friendly versions of top command.

Summary indexing friendly versions of rare command.

Summary indexing friendly versions of chart command.

Summary indexing friendly versions of timechart command.

Annotates specified fields in your search results with tags.

Computes the moving averages of fields.

Creates a time series chart with corresponding table of statistics.

Displays the most common values of a field.

Writes the result table into *.tsidx files using indexed fields format.

Performs statistics on indexed fields in tsidx files

Groups events into transactions.

Returns typeahead on a specified prefix.

Generates suggested eventtypes. Deprecated: preferred command is 'findtypes'

Calculates the eventtypes for the search results

Runs an eval expression to filter the results. The result of the expression must be Boolean.

Causes UI to highlight specified terms.

Converts results into a format suitable for graphing.

Extracts XML key-value pairs.

Un-escapes XML characters.

Extracts the xpath value from FIELD and sets the OUTFIELD attribute.

Extracts location information from IP addresses using 3rd-party databases.

Processes the given file as if it were indexed.

Sets RANGE field to the name of the ranges that match.

Returns statistics about the raw field.

Sets the 'reltime' field to a human readable value of the difference between 'now' and '_time'.

Anonymizes the search results.

Returns a list of source

Performs a debug command.

Performs a deletion from the index.

Returns the number of events in an index.

Generates suggested event types.

convenient way to return values up from a subsearch

Internal command used to execute scripted alerts

finds transaction events given search constraints

Runs the search script

Remove seasonal fluctuations in fields.

Saturday, July 26, 2014

As I mentioned in the previous post, we are going to write a custom command that transforms search results into xml. Something like :
        SearchResults::iterator it;
        for (it = results.begin(); it != results.end(); ++it) {
            SearchResult r = **it;
            _output.append("<SearchResult>");
            std::set<Str> allFields;
            results.getAllKeys(allFields);
            for (std::set<Str>::const_iterator sit = allFields.begin(); sit !=
                allFields.end(); ++sit) {
               sr_index_t index = results.getIndex(*sit);
                // check all xml tags are constructed without whitespaces
                if (r.exists(index)){
                    _output.append("<" + (*sit).trim() + ">");
                    _output.append(r.getValue(index));
                    _output.append("</" + (*sit).trim() + ">");
                }
            }
            _output.append("</SearchResult>");
        }

but Splunk already has xpath.py that makes event value valid xml i.e it makes <data>%s<data> where the innerxml is the value corresponding to _raw in the event. This is different from above.
There are data-structure to xml python recipes on the web such as Recipe #577268 here.

There's also another way described in the Splunk sdk as follows:
To use the reader, instantiate :class:`ResultsReader` on a search result stream
as follows:::

    reader = ResultsReader(result_stream)
    for item in reader:
        print(item)

We try to do it this way :
rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

# untested
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        results,dummyresults,settings = si.getOrganizedResults()
        myxml = "<searchResults>"
fields = ["host", "source", "sourcetype", "_raw", "_time"]
        outfield = options.get('outfield', 'xml')
        for result in results:
            element = "<searchResult>"
            for i in fields:
                field = options.get('field', str(i))
                 val = result.get(field, None)
                 if val != None:
                    element += "<" + str(field).strip() + ">" + str(val) + "</" + str(field).strip() + ">"
            element += "/<searchResult>"
            myxml += element
        myxml += "</searchResults>"
         result[outfield] = myxml
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

Friday, July 25, 2014

Today I'm going to talk about writing custom search commands in python. You can use them with search operators in Splunk this way :
index=_internal | head 1 | smplcmd

rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

rrajamani-mbp15r:splunkb rrajamani$ cat ./etc/apps/search/bin/smplcmd.py
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        defaultval = options.get('default', None)
        results,dummyresults,settings = si.getOrganizedResults()
        # pass through
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

we will write a custom command that transforms search results to xml

This SUMMER I'm going to devote a series of detailed posts to implement Splunk entirely in .Net and being a git based developer, we will write some light weight packages with .nuget and force a test driven development and a continuous integration on a git repository to go with it. Effectively we will build SplunkLite in .Net