Cluster computing: July 2014

Thursday, July 31, 2014

In today's post, we continue our discussion on porting Splunk forwarder to SplunkLite.Net which is a lightweight application that forwards, indexes and searches Splunk data. In the previous posts, we discussed a few of the functionalities we require such as the ability to create a data pipeline for input, processors that can convert the input into events and save them for analytics later. There's still a few more data structures to look into but as we see the majority of the framework and utilities we use are conveniently available to us in .Net libraries. This reduces the code significantly. Framework helpers such as HTTP request and response handling, HttpStaticDispatcher, Processor, QueryRunningSingleton, SearchResults, SearchEvaluator, ServerConfig TagManager, etc are still needed. The ability to secure the REST calls with AdminManager is also needed. The KeyManagers for localhost, search peers and general settings can come in later too. The utilities for FileSystemWatcher, datetime handing, HTTPServer, ProducerConsumerQueue, already have support in .Net. Proprietary database helpers such as PersistentHashData, PersistentHashManager, PersistentMapManager and PersistentStorage are still required. Let us look at the Persistent data structures more closely. PersistentMapManager provides a way to lookup based on keys and tags. It has methods to get all the keys, or matching Keys or to check if a key exists or to remove keys. The same holds for tags. Ability to look up the store based on keys and tags has been a key feature of Splunk analytics. PersistentHashManager maintains a hash table and gets all the data that matches a key. The values are maintained as PersistentHashData and the data on disk is accessed via RecordFileManager which loads the DB file into memory and has methods for read and write records to disk.
Results from the search on the database are available via a data structure called SearchResults which is a collection of SearchResult and maintains a key map. Each SearchResult returns a list of fields which can be multivalued.
Note that the SearchResult is internal to Splunk. The export of results in different data formats via online and offline methods are also available. This let Splunk integrate well in most ecosystems. I introduced a way for Splunk to provide searchable data to LogParser which has a SQL interface. The ability to use SQL over splunk makes it user friendly to users who work primarily with databases.

Wednesday, July 30, 2014

Today we are going to continue the discussion on the Splunk Forwarder port to SplunkLite.Net. Here we cover the event and Metadata. If we look at an input processor in Splunk input pipeline, we see a common pattern such as :
public class StdinInputProcessor : InputProcessor{

    public StdinInputProcessor();
    public ~StdinInputProcessor();

    public virtual void init( XmlNode pluginInstanceConfig);

    public virtual EProcessorReturn execute(CowPipelineData pData);
};

essentially the only two operations it is responsible for is initialization and execution.
The execute method merely copies the data from external to the event writer and sets metadata such as sourcekey etc.
Occassionally a producer consumer queue and a shutdown thread is utilized when there are more than one channel needs to be served and decoupling the producer and consumer helps.

Event conversion happens after the data has been copied over. Few fields are populated to describe the event including source and destination metadata such as host, timestamp, sourcetype, index, etc.

CowPipelineData is a proxy object that wraps the PipelineData which manages the data that is passed around in the pipeline which maintains a map of predefined keys. PipelineData processes headers and does serialization as well. CowPipelineData provides threadsafe copy-on-write semantics. Each thread has an exclusive CowPipelineData although they may share the same PipelineData in a threadsafe manner.

As we describe the inputs and the forwarder of SplunkLite.Net, its interesting to note that the events diffused in the input file follow a Markov chain. Consider a Markov chain and decompose it into communicating classes. For each positive recurrent communicating class C, we have a unique stationary distribution pi-c which assigns positive probability to each state in C. Let a be a state belonging to C and let pi-c be pi-a. An arbitrary stationary distribution pi must be a linear combination of these pi-C.

To define this more formally (Konstantopoulous) : To each positive recurrent communicating class C there corresponds a stationary distribution pi-C. Any stationary distribution pi is necessarily of the form pi = Sigma - over - C ( alpha-C. pi - C) where alpha-C >= 0, such that Sum-c(alpha-C) = 1

We say that the distribution is normalized to unity because the probabilities acts as weights and the sums of all these probabilities is one.

Here's how we prove that an arbitrary stationary distribution is a weighted combination of the unique stationary distributions.
Let pi is an arbitrary stationary distribution, If pi(x) is > 0 then x belongs to some positive recurrent class C. The conditional distribution of x over C is then given by pi(x | C) = pi(x) | pi (C) where the denominator is an aggregation of all x in C. To make x unique we have pi-(x|C) as equivalent to pi-C(x) where pi-x aggregated over all C is alpha-C - the state we picked. So we can write pi = alpha-c(pi-C) is an arbitrary stationary distribution pi is a linear combination of these unique stationary distributions.
Now we return to our steps to port the Splunk forwarder

Tuesday, July 29, 2014

Let us look at the porting of Splunk Forwarder to SplunkLite.Net ( the app ). This is the app we want to have all three roles forwarder, indexer and searchhead. Our next few posts may continue to be on forwarder because they are all going to scope the Splunk code first. When we prepare a timeline for implementation, we will consider sprints and include time for TDD and feature priority but first let us take a look at the different sections of the input pipeline and what we need to do. First we looked at the PipelineData. We now look at the context data structures. Our goal is to enumerate the classes and data structures needed to enable writing an event from a Modular input. We will look at Metadata too.
Context data structures include configuration information. Configuration is read and merged at the startup of the program. For forwarding, we need some configuration that are expressed in files such as input.conf, props.conf etc. These are key value pairs and not the xml that .Net configuration files are. The format of the configuration does not matter. What matters is their validation and population in runtime data structures called a PropertyMap. Configuration keys also have their metadata and this is important because we actively use it.
PipelineData class manages data that is passed around in the pipeline. Note that we don't need raw storage for the data since we don't need to manage the memory. The same goes for Memory pool because we don't actively manage the memory. The runtime does it for us and this scales well.What we need is the per-event data structure including metadata. Metadata can be for each field. This lets us understand the event better.
We will leave the rest of the discussion on Metadata to indexer time discussion. For now, we have plenty more to cover for Forwarder.
Interestingly one of the modular inputs is UDP and this should facilitate the traffic such as games, realtime traffic and other such applications. However, udp modular input may not be working in Splunk especially if connection_host key is specified.
We will focus exclusively on file based modular input for the prototype we want to build. This will keep it simple and give us the flexibility to work with events small and large, few and several.

Monday, July 28, 2014

Today onwards I'm going to start a series of posts on Splunk internals. As we know there are three roles for Splunk Server. - forwarding, indexing and searching.
We will cover forwarding section today to see what all components need to be ported to a SplunkLite framework.
First we look at Pipeline Data and associated data structures. It is easy to port some of these to C# and it gives the same definitions to the input that we have in Splunk Server.
Pipeline components and actor threads can be used directly in .Net. Configurations can be maintained via configuration files just the same as in Splunk Server. threads that are dedicated to shutdown or for running the event loop are still required. Note the support for framework items like event loop Can Be Substituted By The equivalent .Net scheduler classes. .Net has a rich support for threads and scheduling via the .Net 4.0 task library.
While on the topic of framework items, we might as well cover logger. Logging is available via packages like log4net or enterprise application block. These are convenient to add to the application and come with multiple destination features. When we iterate down the utilities required for this application, we will see that a lot of the efforts of writing something portable with Splunk Server are done away with because .Net comes with those already.
When writing the forwarder, we can start small with a few set of Input and expand the choices later. Having one forwarder one indexer and one search head will be sufficient for proof of concept. The code can provide end to end functionality and we can then augment each of the processors whether they are input search or index processors. Essentially the processors all conform in the same way for the same role, so how we expand it is up to us.
PersistentStorage may need to be specified to work with the data and this is proprietary. Data and Metadata may require data structures similar to what we have in Splunk. We would look into Hash manager and record file manager. We should budget for things that are specific to Splunk first because they are artifacts that have a history and a purpose.
Items that we can and should deliberately avoid are those for which we have rich and robust .net features such as producer consumer queues etc.
The porting of the application may require a lot of time. An initial estimation for the bare bones and resting is in order. For anything else we can keep it prioritized.

Sunday, July 27, 2014

Splunk indexes both text and binary data. In this post, we will see how we can use Splunk for archival storage devices. Companies like Datadomain have great commercial advantage in data backup and Archival. Their ability to backup data on a continuous basis and use deduplication to reduce the size of the data makes the use of Splunk interesting. But first let us look at what it means to index binary data. We know that it's different from text where indexing is using compact hash values and an efficient data structure such as a B+ tree for lookup and retrieval. Text data also lends itself to key value pairs extraction which come in handy with NoSQL databases. And the trouble with the binary data is that it cannot be meaningfully searched and analyzed. Unless there is textual metadata associated with it, binary data is not helpful. For example, an image file data is not as helpful as the size, creation tool, username, camera, gps location etc. Also even textual representations such as XML are also not helpful since they are difficult to read by humans and it requires parsing. As an example, serializing code objects in an application may be helpful but logging its significant key value pairs may be even better since they will be in textual format that lends itself to Splunk forwarding, indexing and searching.
This can be used with periodic and acyclic maintenance data archival as well. The applications that archive data are moving sometimes terabytes of data. Moreover, they are interpreting and organizing this data in a way where nothing is lost during the move, yet the delta changes between say two backup runs is collected and saved with efficiency in size and computation. There is a lot of metadata gathered in the process by these applications and the same can be logged to Splunk. Splunk in turn enables superior analytics on these data. One characteristic of such data that are backed up or archived regularly is that they have a lot of recurrence and repetitions That is why companies like DataDomain are able to de-duplicate the events and reduce the footprint of the data to be archived. Those computations can have a lot of information associated that can be expressed in rich metadata suitable for analytics later. For example, the source of the data, the programs that use the data, metadata on the data that was de-duped, the location and availability of the archived data are all relevant for analytics later. This way those applications need not fetch the data to search for answers to analytical queries but directly work off the Splunk indexes instead.

If we look at the custom search commands in a Splunk instance, we actually find a trove of utilities. Some of these scripts include such things as streaming search results to a xml file. There are command to do each of the following:

Produces a summary of each search result.

Add fields that contain common information about the current search.

Computes the sum of all numeric fields for each result.

Computes an "unexpectedness" score for an event.

Finds and summarizes irregular

Appends subsearch results to current results.

Appends the fields of the subsearch results to current results

Find association rules between field values

Identifies correlations between fields.

Returns audit trail information that is stored in the local audit index.

Sets up data for calculating the moving average.

Analyzes numerical fields for their ability to predict another discrete field.

Keeps a running total of a specified numeric field.

Computes the difference in field value between nearby results.

Puts continuous numerical values into discrete sets.

Returns results in a tabular output for charting.

Find how many times field1 and field2 values occurred together

Builds a contingency table for two fields.

Converts field values into numerical values.

Crawls the filesystem for files of interest to Splunk

Adds the RSS item into the specified RSS feed.

Allows user to examine data models and run the search for a datamodel object.

Removes the subsequent results that match specified criteria.

Returns the difference between two search results.

Automatically extracts field values similar to the example values.

Calculates an expression and puts the resulting value into a field.

Extracts values from search results

Extracts field-value pairs from search results.

Keeps or removes fields from search results.

Generates summary information for all or a subset of the fields.

Replace null values with last non-null value

Replaces null values with a specified value.

Replaces "attr" with higher-level grouping

Replaces PATHFIELD with higher-level grouping

Run a templatized streaming subsearch for each field in a wildcarded field list

Takes the results of a subsearch and formats them into a single result.

Transforms results into a format suitable for display by the Gauge chart types.

Generates time range results.

Generate statistics which are clustered into geographical bins to be rendered on a world map.

Returns the first n number of specified results.

Returns the last n number of specified results.

Returns information about the Splunk index.

Adds or disables sources from being processed by Splunk.

Loads search results from the specified CSV file.

Loads search results from a specified static lookup table.

SQL-like joining of results from the main results pipeline with the results from the subpipeline.

Joins results with itself.

Performs k-means clustering on selected fields.

Returns a list of time ranges in which the search results were found.

Prevents subsequent commands from being executed on remote peers.

Loads events or results of a previously completed search job.

Explicitly invokes field value lookups.

Looping operator

Extracts field-values from table-formatted events.

Do multiple searches at the same time

Combines events in the search results that have a single differing field value into one result with a multi-value field of the differing field.

Expands the values of a multi-value field into separate events for each value of the multi-value field.

Changes a specified field into a multi-value field during a search.

Changes a specified multi-value field into a single-value field at search time.

Removes outlying numerical values.

Executes a given search query and export events to a set of chunk files on local disk.

Outputs search results to the specified CSV file.

Save search results to specified static lookup table.

Outputs search results in a simple

Outputs the raw text (_raw) of results into the _xml field.

Finds events in a summary index that overlap in time or have missed events.

Allows user to run pivot searches against a particular datamodel object.

Predict future values of fields.

See what events from a file will look like when indexed without actually indexing the file.

Displays the least common values of a field.

Removes results that do not match the specified regular expression.

Calculates how well the event matches the query.

Renames a specified field (wildcards can be used to specify multiple fields).

Replaces values of specified fields with a specified new value.

Specifies a Perl regular expression named groups to extract fields while you search.

Buffers events from real-time search to emit them in ascending time order when possible

The select command is deprecated. If you want to compute aggregate statistics

Makes calls to external Perl or Python programs.

Returns a random sampling of N search results.

Returns the search results of a saved search.

Emails search results to specified email addresses.

Sets the field values for all results to a common value.

Extracts values from structured data (XML or JSON) and stores them in a field or fields.

Turns rows into columns.

Filters out repeated adjacent results

Retrieves event metadata from indexes based on terms in the <logical-expression>

Filters results using keywords

Performs set operations on subsearches.

Clusters similar events together.

Produces a symbolic 'shape' attribute describing the shape of a numeric multivalued field

Sorts search results by the specified fields.

Puts search results into a summary index.

Adds summary statistics to all search results in a streaming manner.

Adds summary statistics to all search results.

Provides statistics

Concatenates string values.

Summary indexing friendly versions of stats command.

Summary indexing friendly versions of top command.

Summary indexing friendly versions of rare command.

Summary indexing friendly versions of chart command.

Summary indexing friendly versions of timechart command.

Annotates specified fields in your search results with tags.

Computes the moving averages of fields.

Creates a time series chart with corresponding table of statistics.

Displays the most common values of a field.

Writes the result table into *.tsidx files using indexed fields format.

Performs statistics on indexed fields in tsidx files

Groups events into transactions.

Returns typeahead on a specified prefix.

Generates suggested eventtypes. Deprecated: preferred command is 'findtypes'

Calculates the eventtypes for the search results

Runs an eval expression to filter the results. The result of the expression must be Boolean.

Causes UI to highlight specified terms.

Converts results into a format suitable for graphing.

Extracts XML key-value pairs.

Un-escapes XML characters.

Extracts the xpath value from FIELD and sets the OUTFIELD attribute.

Extracts location information from IP addresses using 3rd-party databases.

Processes the given file as if it were indexed.

Sets RANGE field to the name of the ranges that match.

Returns statistics about the raw field.

Sets the 'reltime' field to a human readable value of the difference between 'now' and '_time'.

Anonymizes the search results.

Returns a list of source

Performs a debug command.

Performs a deletion from the index.

Returns the number of events in an index.

Generates suggested event types.

convenient way to return values up from a subsearch

Internal command used to execute scripted alerts

finds transaction events given search constraints

Runs the search script

Remove seasonal fluctuations in fields.

Saturday, July 26, 2014

As I mentioned in the previous post, we are going to write a custom command that transforms search results into xml. Something like :
        SearchResults::iterator it;
        for (it = results.begin(); it != results.end(); ++it) {
            SearchResult r = **it;
            _output.append("<SearchResult>");
            std::set<Str> allFields;
            results.getAllKeys(allFields);
            for (std::set<Str>::const_iterator sit = allFields.begin(); sit !=
                allFields.end(); ++sit) {
               sr_index_t index = results.getIndex(*sit);
                // check all xml tags are constructed without whitespaces
                if (r.exists(index)){
                    _output.append("<" + (*sit).trim() + ">");
                    _output.append(r.getValue(index));
                    _output.append("</" + (*sit).trim() + ">");
                }
            }
            _output.append("</SearchResult>");
        }

but Splunk already has xpath.py that makes event value valid xml i.e it makes <data>%s<data> where the innerxml is the value corresponding to _raw in the event. This is different from above.
There are data-structure to xml python recipes on the web such as Recipe #577268 here.

There's also another way described in the Splunk sdk as follows:
To use the reader, instantiate :class:`ResultsReader` on a search result stream
as follows:::

    reader = ResultsReader(result_stream)
    for item in reader:
        print(item)

We try to do it this way :
rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

# untested
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        results,dummyresults,settings = si.getOrganizedResults()
        myxml = "<searchResults>"
fields = ["host", "source", "sourcetype", "_raw", "_time"]
        outfield = options.get('outfield', 'xml')
        for result in results:
            element = "<searchResult>"
            for i in fields:
                field = options.get('field', str(i))
                 val = result.get(field, None)
                 if val != None:
                    element += "<" + str(field).strip() + ">" + str(val) + "</" + str(field).strip() + ">"
            element += "/<searchResult>"
            myxml += element
        myxml += "</searchResults>"
         result[outfield] = myxml
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

Friday, July 25, 2014

Today I'm going to talk about writing custom search commands in python. You can use them with search operators in Splunk this way :
index=_internal | head 1 | smplcmd

rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

rrajamani-mbp15r:splunkb rrajamani$ cat ./etc/apps/search/bin/smplcmd.py
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        defaultval = options.get('default', None)
        results,dummyresults,settings = si.getOrganizedResults()
        # pass through
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

we will write a custom command that transforms search results to xml

This SUMMER I'm going to devote a series of detailed posts to implement Splunk entirely in .Net and being a git based developer, we will write some light weight packages with .nuget and force a test driven development and a continuous integration on a git repository to go with it. Effectively we will build SplunkLite in .Net

Wednesday, July 23, 2014

In tonight's post we continue the discussion on file security checks for path names. Some of these checks are internalized by the APIs of the operating system. The trouble with path names is that it comes from untrusted users and as with all strings generates risks of buffer overruns. In addition it might point to device or pseudo device location that may pass for a path but can amount to a security breach. Even if the application is running with low privilege or not requiring administrator privileges, not validating path names adequately on Windows will cause vulnerabilities that can be exploited. These include gaining access to the application or redirection to invoke a malicious software. The application can be compromised from what it was intended. Checks to safeguard against this include validating local and UNC paths as well as securing access with ACLs. Device driver, printer and registry paths should be avoided. It is preferable to treat the path as opaque and interpreted with OS API rather than parsing it. Some simple checks are not ruled out though and the level of security should be modulated with the rest of the application. It is not right to block the window if the door is open. Also choice of API matters. For example a single API call can perform most of the checks we want.

Tuesday, July 22, 2014

We will discuss some configuration file entries for Splunk particularly one related to path specifiers say for certificates to launch Splunk in https mode, its syntax, semantics and migration issues. When Splunk is configured to run in https mode, the user indicates a flag called enableSplunkWebSSL and two paths for the certificates - the private cert (privKeyPath) and the certification authority cert (caCertPath). The path specified with these keys is considered relative to the 'splunk_home' directory. However, users could choose to keep the certificates wherever they like and so the paths could have '..' specifiers included. Paths could also start with '/' the specifier on unix style machines but these are generally not supported when the path is taken as relative. The '/' prefix to path is considered to be an absolute path specifier.
Since the user can store certificates anywhere on the machine, the path could be read as an absolute path. This way the user can directly specify path without the cumbersome '..' notation and the paths will be treated the same as the other configuration keys for Splunk. Other than that there are no advantages.
Now lets look at the caveats for converting relative to absolute paths.
First if the keys were specified then Splunk was working in https mode, so the certificates exist on the target. If the certificates are found under splunk_home then during migration, we can normalize and convert them to absolute paths. If the certificates are found under the root by way of '..' entries in the path, then this too can be made absolute with something like os.path.normpath(join(os.getcwd(), path)) in the migration script. If the certificates are not found by either means, then these keys should be removed so that Splunk can launch in default http mode ( although this will constitute a change in behavior )
Now that absolute paths have been specified in the configuration files, splunkd can assume that these directly point to the certificates and need not prepend them with splunk_home. So it first checks that the certificates are pointed to where the path is available. Next it checks where the certificates are found under splunk_home with the path specified. This step could not have been avoided because we cannot rely on the migration script all the time. The user can change the settings anytime after first run. We could rely on the prefix '/' since the migration script makes paths absolute with a '/' prefix and if it is missing we proceed to look for the certificates under the splunk_home. However the '/' prefix is only for linux. On windows we don't have that luxury. The os.path.isabs(x) may need to be implemented and used by splunkd. Besides path on windows has several security issues : for example we should not allow paths to begin with \\?\ etc and device and pseudo-device specifiers. Merely checking whether the path exists may not be enough. Besides, certificates should not be on remote machines.
Finally with the new change to support absolute and relative paths, the splunkd process assumes that most paths encountered are absolute. These paths need to be checked for prefixes, length and validity before the certificates are found under them. If the certificates are not found either because they don't exist or because they are not accessible, then if the path is relative we look for the certificates under splunk_home and if that doesn't work we error out.
if (absolute)
check_and_return
if (relative)
check_and_return
error_and_escape

Today we discuss another application for Splunk. I want to spend the next few days reviewing the implementation of some core components of Splunk. But for now, I want to talk about API monitoring. Splunk exposes a REST API model for its features that are called from the UI and by SDK. These APIs are logged in the web access log. The same APIs can be called from mobile applications on Android devices and iPhones/iPads. The purpose of this application is to get statistics from API calls such as percentage of times error was encountered, the number of internal server errors, the number and distribution of timeouts. And with the statistics gathered, we can set up alerts on thresholds exceeded. Essentially, this is along the same lines as Mashery api management solution. While APIs monitored by Mashery help study traffic from all devices to the API providers, in this case, we are talking about that for a Splunk instance from the enterprise users. Mobile apps are currently not available for Splunk but when it does, this kind of application would help to troubleshoot those applications as well because it would show the differences between other callers and those devices.
The way Mashery works is with the use of a http/s proxy. However in this case we rely on the logs directly assuming that all the data we need is available in the logs. The difference between searching the logs and running this application is that the application has continuous visualization and fires alerts.
This kind of application is different from a REST modular input because the latter indexes the response from the APIs and in this case we are not keen on the responses but the response code. At the same time we are also interested in user-agent and other such header information to enrich our stats just so long as they are logged.
Caching is a service available in Mashery or from Applications such as AppFabric but this is likely a candidate feature for Splunk rather than this application due to the type of input to the application. Caching works well when requests responses are intercepted but in our case this application is expected to use the log as an input.

Monday, July 21, 2014

Continuing from the previous post, we were discussing a logger for software components. In today's post we look at the component registration of logging channels. Initially a component may just specify a name (string) or an identifier (guid) to differentiate its logging channel but requiring that each new component specify a new channel is not usually enforced. Furthermore, the logging at all levels is left to the discretion of the component owners and this is generally inadequate. Besides, some components are considered too core for any interest to users and consequently their logging is left out. With the new logger, we require that the components have a supportability review and that they are facilitated to log as machine data without restriction on size or frequency and at the same time support a lot more features.
Hence one of the improvements we require from component registration is the metadata for the component's logging channel. This metadata includes among other things intended audience, frequency, error message mapping for corrective actions, support for payload, grouping etc. In other words, it helps the logging consumer take appropriate actions on the logging payload. Today the consumer decides whether to flush to disk, send to logging subscribers, redirect to a database, It slaps headers on the data for information such as for the listener when sending over the network etc, takes different actions when converting the data to binary mode, support operations such as compression, encryption, etc and maintains different modes of operation such as performance oriented with fast flush to disk or feature oriented such as above. Throttling and resource management of logging channels is possible via redirection to null queue.
In general, a sliding window protocol could be implemented for the logging channel with support for sequence number, There are many features that can be compared with the similarity to a TCP implementation.
TCP has several features - reordering, flow control etc . For our purposes we don't have reordering issues.

Sunday, July 20, 2014

In today's post we continue to investigate applications of Splunk. One of the applications is supportability. Processes, memory, CPU utilization, file descriptor usages, system call failures are pretty much the bulk of the failures that require supportability measures. The most important of the supportability measures is the logging and although all components log, most of the fear around verbose logging has centered around pollution of logs. In fact most often used components lack helpful logging only because they are used so often that it rapidly grows the size of the log to an overwhelming number. Such a log is found offensive to admins who view the splunkd log as actionable and for their eyes only.
Now searches have their own logs and they generate logs for the duration of the sessions. Search artifacts are a blessing for across the board troubleshooting. It can be turned to debug mode, the generate log file is persisted only for the duration of the user session invoking the search and it does not bother the admins.
What is required from the components that don't log even to the search logs because they are so heavily used or are used at times other than searches is to combine the technique for search logs with this kind of logging.
The call for action is not just for components to log more or support logging to a different destination or have grades of logging but fundamentally allow a component to log without any concern for resources or impact. Flags can be specified by the component for concerns such as logging levels or actions. A mechanism may also be needed for loggers to specify round robin.
The benefit of a round robin in memory log buffer is the decoupling of producers from the consumers. We will talk about logging improvements a lot more and cover a lot of aspects but the goal for now is to cover just this.
The in-memory buffer is entirely owned by the application and as such the components can given the slot number to write to. The entry or content for the log entries will follow some format but we will discuss that later. There can be only one consumer for this in-memory buffer and that services one or more out of process consumers that honor the user/admin's choices for destination, longevity and transformations.

Saturday, July 19, 2014

Today we will look at the Markov chain a little bit more to answer the following questions:
Will the random walk ever return to where it started from ?
If yes, how long will it take ?
If not, where will it go.
If we take the probability that the walk (started at 0) attains value 0 in n steps as u(n) for even n, we now want to find the probability f(n) that the walk will return to 0 for the first time in n steps.
Let Xn, n >= 0 be Markov with transition probabilities pij. Let St (t >= 0) be independent from Xn. Sn can be considered a selection.
The theorem for the calculation of f(n) is stated this way:
Let n be even and u(n) = P0 (Sn = 0). P0 denotes the return to zero.
Then f(n) = P0( S1 != 0, ... Sn-1 != 0, Sn = 0) = u(n)/ n - 1
And here's the proof:
Since the random walk cannot change sign before becoming zero,
f(n) = P0(S1 > 0, ... Sn-1 > 0, Sn = 0) + P0( S1 < 0, ... Sn-1 < 0, Sn = 0)
which comprises of two equal terms.
Now,
P0 (Sn-1 > 0, Sn = 0 ) = P0 (Sn = 0) P(Sn -1 > 0 | Sn = 0) which is the use of conditional probability.
we know that the first term is u(n) and given that the Sn-1 = 1 or -1 with equal probability. So the last term is 1/2. Sn = 0. By the Markov property at time n-1 we can omit the last event.
So f(n) can be rewritten as 2 u(n) 1/2 P0(S1 > 0, S2 > 0 ... Sn-2 > 0 | Sn-1 = 1)
and the last term by ballot theorem is 1/(n-1) which proves that f(n) = u(n)/n-1

To complete the post on Random walk for developers as discussed in the previous two, we briefly summarize the goal and the process.
Random walk is a special case of a process called Markov Chain where the future events are conditionally dependent on the present and not the past events. It is based on random numbers that assign a real number to the events based on probabilities. The probabilities to move between a finite set of states (sometimes two - forward and backward as in the case of a drunkard walking in a straight line) is called transition probabilities. Random walk leads to diffusion.
The iterations in a random walk are done based on the equation px,y = px+z,y+z for any translation z where px,y is the transition probability in space S.
Random walks possesses certain other properties
- time homogeneity
- space homogeneity
and sometime skip-free transitions which we have already discussed
The transition probabilities are the main thing to be worked out in a random walk.
Usually this is expressed in terms of ranges such as
p(x) = pj if xj = ej where j = 1 to d in a d-dimensional vector defined space.
= qj if xj = -ej
= 0
We have since how this simplifies to a forward and backward motion in the case of a drunkard's linear walk.
The walk is just iterative calculating a new position based on the outcome of transition probabilites.
The walk itself may be performed k times to average out the findings (hitting time) from each walk.
Each step traverses the bipartite graph.
http://1drv.ms/1nf79KL

Friday, July 18, 2014

Continuing from the previous post ...

The number of steps taken between any two nodes is called hitting time. The hitting time between an unclassified and domain term and can be averaged over all the walks that connect these two nodes.
Since the walks are selected based on transition probabilities, the most probable paths are selected first. The same pair of nodes can be connected with many paths and the same unclassified term can be connected to many other domain terms.
The contextual similarity of a classification pair n,m can then be described as the relative frequency of the hitting of those two nodes and other normalized nodes linked to that start node.
This is calculated as Contextual Similarity L(n,m) = H(n, m) / Sigma-i(H(n,m)).
We can also look at Stability of a random walk or a Markov chain in general.
Stability refers to the convergence of probabilities as the series becomes infinite. When the recurrence is positive and the series is not reducible, the average (called Cesaro average) 1/n Sum (PXk = x) converges to pi (x), as n -> infinity.
Stability is interesting because a Markov chain is a simple model of a stochastic dynamical system that remains within a small region. For example, when a pendulum swings, it finally comes to a stable position with dissipation of energy. Even if there were no friction, it would be deemed stable because it cannot swing too far away.
What the stability tells us is that when a Markov chain has certain properties ( irreducibility, positive recurrence, unique and stationary distribution pi) , the n-step transition matrix converges to a matrix with rows all equal to pi. This is called the fundamental stability theorem.
Stability works based on coupling.
Coupling refers to the various methods for constructing a combination of the two random variables. If the random variables are independent, then they can be combined in a straightforward manner taking co-occurrence. Coupling helps us define a third Markov chain Z from an arbitrary distribution X and a stationary distribution Y where Z is X prior to the meeting with Y and Z is Y after the meeting point. This then shows that the transition matrix converges to all rows as pi.
By choosing the most probable paths, the random walk follows the preferred state transitions. Thus while not all paths may end within the predetermined steps, we know that when it does, it would have chosen the higher transition probabilities.

A simple random walk has equal probabilities to move from one state to another.
To implement a simple random walk in (x,y) dimension, we can have a naive one like this:
for ( i = 1; i < 2^n; i++ )
if move in x-did
x[i] = x[i-1] + sample(step, 1);
y[i] = y[i-1];
else
x[i] = x[i-1];
y[i] = y[i-1] + sample(step,1);
print(x,y);

We can have the following metrics in a simple random walk:
first return time to zero
total number of visits to a state.
For all the random walks, we can have metrics like
total number of paths between two nodes
total number of paths in which a node appears

Thursday, July 17, 2014

We mentioned that Random walk is a special case of a Markov chain. To understand Markov chain, we take the example of a mouse in a cage with two cells - the first has fresh cheese and the second has stinky cheese. If at time n, the mouse is in cell 1, then at time n+1, the mouse is in either cell1 or cell 2. Let us say the probability to move from cell 1 to cell 2 is alpha and the reverse is beta. In this case, alpha is much smaller than beta. Then the transition diagram shows transitions as follows:
1->1 with a probability of 1- alpha
1 -> 2 with a probability of alpha
2 -> 1 with a probability of beta
and 2->2 with a probability of 1 - beta.
The mouse makes the choice of staying or moving with equal probability, the time it takes for the mouse to make the first move from 1 to 2 is the mean of the binomial distribution = 1/alpha.
Thus the moves are described in terms of random variables Xi. Let T denote the subset of integers in a sequence of random variables {Xn : n belongs to T}. The Markov property is then defined as follows:
any n belonging to T, the future process Xm (m > n) is independent of past process Xm (m < n) and conditionally dependent on Xn.
In the case of the mouse, the probability of the move is regardless of where the mouse was at earlier times i.e the future is independent of the past given the present.
This dependence of the future on the present is easy to generalize with random numbers instead of deterministic objects. The states are countable and can be denoted by state space S and since the moves are between states, Xn is called a Markov chain.
The Markov property talks about all before and after processes so (Xn+1 .... Xn+m) and (X0, ... Xn-1) are independent conditionally on Xn.
The walk based on a Markov chain is thus dependent on transition probabilities. If the transition probabilities are defined, then the walk is just iterative.
The joint probability distribution can be expressed in terms of two functions -
one for the initial states for each i in the state space S
and the other for the transitions pi,j(n, n+1) = P(Xn+1 = j | Xn = i), i, j belongs to S, n >= 0
We can generalize the transition from one step to more than one steps with
pi,j(m,n) = P(Xn = j | Xm = i).
We call Markov chain- time homogeneous when the single step transition probability does not depend on n
In the case of the mouse, the probability of move is independent of where it was at earlier times. Each single step transition probability, denoted by just P, is the same no matter how many times it is repeated and the transition from P(m,n) = P * P * P ... n-m times.
In a random walk, we add a few more properties we discussed earlier.
Now let us consider a random walk on the bipartite graph starting at an unclassified term and arriving at another semantically equivalent domain term. The walker starts from any node and then moves to any other node with transition probability Pij.
Pij is the normalized probability of the co-occurrence counts of the terms and the corresponding context.The transition probability between any two nodes i,j is defined as Pi,j = Wi,j / Sum(Wik) for all k.
The terms are either unclassified or domain and cannot be connected with each other except through context.
Each walk uses the transition probabilities for the move.
Each walk starts at some unclassified term and ends at a domain term or exceeds the maximum number of moves without finishing.
Several walks upto say K times can be initiated.

Wednesday, July 16, 2014

Let us look at performing a random walk between terms and context.
We discussed in the previous post that terms and context are different for machine data than for human text. The context in the case of machine data is the metadata fields. A random walk connects the terms and the context in a bipartite graph.
A random walk is a special kind of Markov chain. The latter has the property that it has time homogeneous transition probabilities. A random walk has an additional property that it has space homogeneous transition probabilities. This calls for a structure where px,y = Px+z,y+z for any translation z. A random walk is given by px,y = p(y - x)
In a undirected connected graph, the degree of each vertex $i = number of neighbors of i
For the same graph, a random walk is defined with following transition probabilities
pij = { 1 / degree(i), if j is a neighbor of i
{ 0 otherwise
The stationary distribution of RWG is given by pi(i) = C degree(i) where C is a constant.
This comes from the fact that pi(i) pij = pi(j) pji since both are equal to C. On each hand of the equation the first term is the stationery distribution and the second term is the inverse of degree(i).
This concludes that the stationary distribution is proportional to the degree(i). This is easy to work out with a sample connected graph of five vertices arranged inside a square. Moreover, the equations suggest that RWG is a reversible chain. The constant C is found by normalization. Given that the degress add up to twice the number of edges, we have C = 1/(2 |E| )

There are several ways to express transition probabilities resulting in different RWGs
p(x) = C2^ (-|x1| - ... - |xd|)

p(x) = pj if x = ej, j = 1, ... d
= qj if x = -ej, j = 1, ... d
0 otherwise
and all of the p(x) add upto 1.

if d = 1 then p(1) = p, p(-1) = q and 0 otherwise where p + q = 1 . This is the drunkards walk which is homogeneous in time and space and given that it must pass through all intermediate nodes because each hop is 1, the transitions are also skip free.

Machine data has several traits that are different from texts and discourses. When we search machine data we see fields such as messages, timestamps, applications, source, sourcetype, host, sizes, counts, protocols and protocol data formats etc. In other words, there is a set of domain terms that is finite and covers a large portion of the logs seen.
There is a high number of repetitions of these domain terms due to the fact that machines and repeaters often spit out the same set of messages at different times. At the same time, there are messages that have a high degree of variation in content such as web traffic capture. But these two different types are well delineated in terms of flows.
Machine data therefore can be separated into three different types:
Data that doesn't vary a lot in terms of content and terms
Data that varies a lot in terms of content only, metadata is still same
Data that mixes the two above
For the the first two cases, when they constitute a bulk of messages, we can treat them differently. The mixture of the two cases amounts to noise and can be ignored for the moment. We will treat this as any other arbitrary text later.
Predictive patterns in the data help with analysis so the first two case are interesting and lend themselves to field extraction, summarization etc.
Terms extracted from the data are already falling into a frequency table that is skewed as opposed to the concordance from human text. So the selection of these terms is different from that of regular text.
Moreover context for the data can be described in terms of the metadata fields. These fields are predetermined or even specified by the forwarders of the data. The indexers that store the raw and searchable data have both data and metadata from a wide variety of sources.
Association between the terms and the context let us look for those that can be considered the semantics of the machine data. These terms are special in that they tag the data in a special way. Finding these terms is helpful in knowing what the machine data is about.
We can associate the terms and the context with bipartite graphs.
When we perform a random walk to connect the terms and the context, we are performing a special case of Markov chain which is iterative. In a random walk, we follow a stochastic process with random variables X1, X2, ..., Xk such that X1 = 0 and X i+1 is a vertex chosen uniformly at random from the neighbors of Xi. The number pv,w,k(G) is the probability that a random walk of length k starting at v ends at w. If the edges are considered in terms of an ohm resistor, then the resistance between a point to infinity is finite when the graph is transient. We will review Aldous and Fill book on random walks on graph next.

Tuesday, July 15, 2014

I will resume my previous post but I want to take a short break to discuss another software application. In some earlier posts I discussed a fiddler like application for mobile devices. In particular I was looking for a fiddler like application for its devices. I pointed out a proxy switching code for IoS available on the net. Changing the proxy helps with the packet capture.
When we talked about the viability of a fiddler like application for mobile devices, we wanted to to use it to test the APIs.
Before continuing any further, I have a work item to look at machine data semantics and I will return shortly.

In today's post, we look at Two mode networks as a social network method of study from Hanneman lectures. Brieger (1974) first highlighted the dual focus on social network analysis on how individuals, by their agency create social structure and at the same time those structures impose constrains and shapes the behavior of the individuals embedded in them. Social network analysis measure the relations at micro level and use it to infer the presence of structure at the macro level. For example, the ties of individuals (micro) allow us to infer the cliques (macro).
Davis study showed that there can be different levels of analysis. This study finds ties between actors and events and as such is not membership to a clique but affiliations. By seeing which actors participate in which events, we can infer the meaning of the event by the affiliations of the actors while seeing the influence of the event on the choices of the actors.
Further, we can see examples of this macro-micro social structure at different levels. This is referred to as nesting where individuals are part of a social structure and the structure can be part of a larger structure. At each level of the nesting, there's tension between the structure and the agency i.e macro and micro group.
There are some tools to examine this two-mode data. It involves finding both qualitative and quantitative patterns. If we take an example where we look at the contributions of donors to campaigns supporting and opposing ballot initiatives over a period of time, our data set has two modes - donors and initiatives. A binary data for whether there was a contribution or not could describe what a donor did. A valued data could describe the relations between donors and initiatives using a simple ordinal scale.
A rectangular matrix of actors (rows) and events(columns) could describe this dual mode data.
This could then be converted into two one mode data sets where we measure the strength of ties between actors by the number of times they contributed to the same side of initiatives and where we measure the initiative by initiative ties where we measure the number of donors that each pair of initiatives had in common.
To create actor by actor relations, we could use a cross -product method that takes entry of the row for actor A and multiplies it with that of actor B and then sums the result. This gives an indication of co-occurrence and works well with binary data where each product is 1 only when both actors are present.
Instead of the cross-product, we could also take the minimum of the two values which goes to say the tie is the weaker of the ties of the two actors to the event.
Two mode data are sometimes stored in a second way called the bipartite matrix. A bipartite matrix is one where the same rows as in the original matrix are added as additional columns and the same columns as in the original matrix are added as additional rows. Actors and events are being treated as social objects at a single level of analysis.
This is different from a bipartite graph also called a digraph which is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge. In the context of word similarity extractions, we used terms and their N-gram contexts as the two partites and used random walks to connect them.

I will cover random walks in more detail.

Sunday, July 13, 2014

In this post like in the previous, we will continue to look at Splunk integration with SQL and NoSQL systems. Specifically we will look at log parser and Splunk interaction. Splunk users know how to tran slate SQL queries to Splunk search queries. We use search operators for this. For non-Splunk users we could provide Splunk as a data store with log parser as a SQL interface. Therefore, we will look into providing Splunk searchable data as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Together splunk and log parser brings the power of splunk to log parser users without requiring them to know about Splunk search commands. At the same time, they have the choice to search the Splunk indexes directly. The ability to use SQL makes Splunk more common and inviting to windows users.

<SCRIPTLET>
<registration
    Description=“Splunk Input Log Parser Scriptlet"
    Progid="Splunk.Input.LogParser.Scriptlet"
    Classid="{fb947990-aa8c-4de5-8ff3-32a59fb66a6c}"
    Version="1.00"
    Remotable="False" />
<comment>
EXAMPLE: logparser "SELECT * FROM MAIN" -i:COM -iProgID:Splunk.Input.LogParser.Scriptlet
</comment>
<implements id="Automation" type="Automation">
    <method name="OpenInput">
      <parameter name="strValue"/>
    </method>
    <method name="GetFieldCount" />
    <method name="GetFieldName">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="GetFieldType">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="ReadRecord" />
    <method name="GetValue">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="CloseInput">
      <parameter name="blnAbort"/>
    </method>
</implements>
<SCRIPT LANGUAGE="VBScript">

Option Explicit

Const MAX_RECORDS = 5

Dim objAdminManager, objResultDictionary
Dim objResultsSection, objResultsCollection
Dim objResultElement
Dim objResultsElement, objResultElement
Dim intResultElementPos, intResult, intRecordIndex
Dim clsResult
Dim intRecordCount

' --------------------------------------------------------------------------------
' Open the input Result.
' --------------------------------------------------------------------------------

Public Function OpenInput(strValue)
    intRecordIndex = -1
Set objResultDictionary = CreateObject("Scripting.Dictionary")
Set objResultsSection = GetSearchResults(“index=main”);
Set objResultsCollection = objResultsSection.Collection
If IsNumeric(strValue) Then
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("id", strValue))
Else
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("name", strValue))
End If
If intResultElementPos > -1 Then
    Set objresultElement = objResultsCollection.Item(intResultElementPos)
    Set objFtpServerElement = objResultElement.ChildElements.Item(“SearchResults”)
    Set objResultsElement = objFtpServerElement.ChildElements.Item(“SearchResult).Collection
    For intResult = 0 To CLng(objResultsElement.Count)-1
       Set objResultElement = objResultsElement.Item(intResult)
       Set clsResult = New Result
       clsResult.Timestamp = objResultElement.GetPropertyByName(“timestamp”).Value
       clsResult.Host = objResultElement.GetPropertyByName(“host”).Value
       clsResult.Source = objResultElement.GetPropertyByName(“source”).Value
       clsResult.SourceType = objResultElement.GetPropertyByName(“sourcetype”).Value
       clsResult.Raw = objResultElement.GetPropertyByName(“raw”).Value
       objResultDictionary.Add intResult,clsResult
    Next
End If
End Function

' --------------------------------------------------------------------------------
' Close the input Result.
' --------------------------------------------------------------------------------

Public Function CloseInput(blnAbort)
intRecordIndex = -1
objResultDictionary.RemoveAll
End Function

' --------------------------------------------------------------------------------
' Return the count of fields.
' --------------------------------------------------------------------------------

Public Function GetFieldCount()
    GetFieldCount = 5
End Function

' --------------------------------------------------------------------------------
' Return the specified field's name.
' --------------------------------------------------------------------------------

Public Function GetFieldName(intFieldIndex)
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldName = “Timestamp”
        Case 1:
            GetFieldName = “Host”
        Case 2:
            GetFieldName = “Source”
        Case 3:
            GetFieldName = “Sourcetype”
        Case 4:
            GetFieldName = “Raw”
        Case Else
            GetFieldName = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's type.
' --------------------------------------------------------------------------------

Public Function GetFieldType(intFieldIndex)
    ' Define the field type constants.
    Const TYPE_STRING   = 1
    Const TYPE_REAL      = 2
    Const TYPE_TIMESTAMP    = 3
    Const TYPE_NULL = 4
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldType = TYPE_TIMESTAMP
        Case 1:
            GetFieldType = TYPE_STRING
        Case 2:
            GetFieldType = TYPE_STRING
        Case 3:
            GetFieldType = TYPE_STRING
        Case 4:
            GetFieldType = TYPE_STRING
        Case Else
            GetFieldType = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's value.
' --------------------------------------------------------------------------------

Public Function GetValue(intFieldIndex)
If objResultDictionary.Count > 0 Then
    Select Case CInt(intFieldIndex)
        Case 0:
            GetValue = objResultDictionary(intRecordIndex).Timestamp
        Case 1:
            GetValue = objResultDictionary(intRecordIndex).Host
        Case 2:
            GetValue = objResultDictionary(intRecordIndex).Source
        Case 3:
            GetValue = objResultDictionary(intRecordIndex).SourceType
        Case 4:
            GetValue = objResultDictionary(intRecordIndex).Raw
        Case Else
            GetValue = Null
    End Select
End If
End Function

' --------------------------------------------------------------------------------
' Read the next record, and return true or false if there is more data.
' --------------------------------------------------------------------------------

Public Function ReadRecord()
If objResultDictionary.Count > 0 Then
    If intRecordIndex < (objResultDictionary.Count-1) Then
    intRecordIndex = intRecordIdndex + 1
        ReadRecord = True
    Else
        ReadRecord = False
    End If
End If
End Function

Class Result
Public Timestamp
Public Host
Public Source
Public SourceType
Public Raw
End Class

</SCRIPT>

</SCRIPTLET>

Scriptlet Courtesy : Robert McMurray's blog

I will provide a class library for the COM callable wrapper to Splunk searchable data in C#.

The COM library that returns the search results can implement methods like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Splunk;
using SplunkSDKHelper;
using System.Xml;

namespace SplunkComponent
{

    [System.Runtime.InteropServices.ComVisible(false)]
    public class SplunkComponent
    {
        public SplunkComponent()
        {
            // Load connection info for Splunk server in .splunkrc file.
            var cli = Command.Splunk("search");
            cli.AddRule("search", typeof(string), "search string");
            cli.Parse(new string[] {"--search=\"index=main\""});
            if (!cli.Opts.ContainsKey("search"))
            {
                System.Console.WriteLine("Search query string required, use --search=\"query\"");
                Environment.Exit(1);
            }

            var service = Service.Connect(cli.Opts);
            var jobs = service.GetJobs();
            job = jobs.Create((string)cli.Opts["search"]);
            while (!job.IsDone)
            {
                System.Threading.Thread.Sleep(1000);
            }
        }

        [System.Runtime.InteropServices.ComVisible(false)]
        public string GetAllResults()
        {
            var outArgs = new JobResultsArgs
            {
                OutputMode = JobResultsArgs.OutputModeEnum.Xml,

                // Return all entries.
                Count = 0
            };

            using (var stream = job.Results(outArgs))
            {
                var setting = new XmlReaderSettings
                {
                    ConformanceLevel = ConformanceLevel.Fragment,
                };

                using (var rr = XmlReader.Create(stream, setting))
                {
                    return rr.ReadOuterXml();
                }
            }
        }

        private Job job { get; set; }
    }
}

https://github.com/ravibeta/csharpexamples/tree/master/SplunkComponent.