Cluster computing

Monday, July 28, 2014

Today onwards I'm going to start a series of posts on Splunk internals. As we know there are three roles for Splunk Server. - forwarding, indexing and searching.
We will cover forwarding section today to see what all components need to be ported to a SplunkLite framework.
First we look at Pipeline Data and associated data structures. It is easy to port some of these to C# and it gives the same definitions to the input that we have in Splunk Server.
Pipeline components and actor threads can be used directly in .Net. Configurations can be maintained via configuration files just the same as in Splunk Server. threads that are dedicated to shutdown or for running the event loop are still required. Note the support for framework items like event loop Can Be Substituted By The equivalent .Net scheduler classes. .Net has a rich support for threads and scheduling via the .Net 4.0 task library.
While on the topic of framework items, we might as well cover logger. Logging is available via packages like log4net or enterprise application block. These are convenient to add to the application and come with multiple destination features. When we iterate down the utilities required for this application, we will see that a lot of the efforts of writing something portable with Splunk Server are done away with because .Net comes with those already.
When writing the forwarder, we can start small with a few set of Input and expand the choices later. Having one forwarder one indexer and one search head will be sufficient for proof of concept. The code can provide end to end functionality and we can then augment each of the processors whether they are input search or index processors. Essentially the processors all conform in the same way for the same role, so how we expand it is up to us.
PersistentStorage may need to be specified to work with the data and this is proprietary. Data and Metadata may require data structures similar to what we have in Splunk. We would look into Hash manager and record file manager. We should budget for things that are specific to Splunk first because they are artifacts that have a history and a purpose.
Items that we can and should deliberately avoid are those for which we have rich and robust .net features such as producer consumer queues etc.
The porting of the application may require a lot of time. An initial estimation for the bare bones and resting is in order. For anything else we can keep it prioritized.

Sunday, July 27, 2014

Splunk indexes both text and binary data. In this post, we will see how we can use Splunk for archival storage devices. Companies like Datadomain have great commercial advantage in data backup and Archival. Their ability to backup data on a continuous basis and use deduplication to reduce the size of the data makes the use of Splunk interesting. But first let us look at what it means to index binary data. We know that it's different from text where indexing is using compact hash values and an efficient data structure such as a B+ tree for lookup and retrieval. Text data also lends itself to key value pairs extraction which come in handy with NoSQL databases. And the trouble with the binary data is that it cannot be meaningfully searched and analyzed. Unless there is textual metadata associated with it, binary data is not helpful. For example, an image file data is not as helpful as the size, creation tool, username, camera, gps location etc. Also even textual representations such as XML are also not helpful since they are difficult to read by humans and it requires parsing. As an example, serializing code objects in an application may be helpful but logging its significant key value pairs may be even better since they will be in textual format that lends itself to Splunk forwarding, indexing and searching.
This can be used with periodic and acyclic maintenance data archival as well. The applications that archive data are moving sometimes terabytes of data. Moreover, they are interpreting and organizing this data in a way where nothing is lost during the move, yet the delta changes between say two backup runs is collected and saved with efficiency in size and computation. There is a lot of metadata gathered in the process by these applications and the same can be logged to Splunk. Splunk in turn enables superior analytics on these data. One characteristic of such data that are backed up or archived regularly is that they have a lot of recurrence and repetitions That is why companies like DataDomain are able to de-duplicate the events and reduce the footprint of the data to be archived. Those computations can have a lot of information associated that can be expressed in rich metadata suitable for analytics later. For example, the source of the data, the programs that use the data, metadata on the data that was de-duped, the location and availability of the archived data are all relevant for analytics later. This way those applications need not fetch the data to search for answers to analytical queries but directly work off the Splunk indexes instead.

If we look at the custom search commands in a Splunk instance, we actually find a trove of utilities. Some of these scripts include such things as streaming search results to a xml file. There are command to do each of the following:

Produces a summary of each search result.

Add fields that contain common information about the current search.

Computes the sum of all numeric fields for each result.

Computes an "unexpectedness" score for an event.

Finds and summarizes irregular

Appends subsearch results to current results.

Appends the fields of the subsearch results to current results

Find association rules between field values

Identifies correlations between fields.

Returns audit trail information that is stored in the local audit index.

Sets up data for calculating the moving average.

Analyzes numerical fields for their ability to predict another discrete field.

Keeps a running total of a specified numeric field.

Computes the difference in field value between nearby results.

Puts continuous numerical values into discrete sets.

Returns results in a tabular output for charting.

Find how many times field1 and field2 values occurred together

Builds a contingency table for two fields.

Converts field values into numerical values.

Crawls the filesystem for files of interest to Splunk

Adds the RSS item into the specified RSS feed.

Allows user to examine data models and run the search for a datamodel object.

Removes the subsequent results that match specified criteria.

Returns the difference between two search results.

Automatically extracts field values similar to the example values.

Calculates an expression and puts the resulting value into a field.

Extracts values from search results

Extracts field-value pairs from search results.

Keeps or removes fields from search results.

Generates summary information for all or a subset of the fields.

Replace null values with last non-null value

Replaces null values with a specified value.

Replaces "attr" with higher-level grouping

Replaces PATHFIELD with higher-level grouping

Run a templatized streaming subsearch for each field in a wildcarded field list

Takes the results of a subsearch and formats them into a single result.

Transforms results into a format suitable for display by the Gauge chart types.

Generates time range results.

Generate statistics which are clustered into geographical bins to be rendered on a world map.

Returns the first n number of specified results.

Returns the last n number of specified results.

Returns information about the Splunk index.

Adds or disables sources from being processed by Splunk.

Loads search results from the specified CSV file.

Loads search results from a specified static lookup table.

SQL-like joining of results from the main results pipeline with the results from the subpipeline.

Joins results with itself.

Performs k-means clustering on selected fields.

Returns a list of time ranges in which the search results were found.

Prevents subsequent commands from being executed on remote peers.

Loads events or results of a previously completed search job.

Explicitly invokes field value lookups.

Looping operator

Extracts field-values from table-formatted events.

Do multiple searches at the same time

Combines events in the search results that have a single differing field value into one result with a multi-value field of the differing field.

Expands the values of a multi-value field into separate events for each value of the multi-value field.

Changes a specified field into a multi-value field during a search.

Changes a specified multi-value field into a single-value field at search time.

Removes outlying numerical values.

Executes a given search query and export events to a set of chunk files on local disk.

Outputs search results to the specified CSV file.

Save search results to specified static lookup table.

Outputs search results in a simple

Outputs the raw text (_raw) of results into the _xml field.

Finds events in a summary index that overlap in time or have missed events.

Allows user to run pivot searches against a particular datamodel object.

Predict future values of fields.

See what events from a file will look like when indexed without actually indexing the file.

Displays the least common values of a field.

Removes results that do not match the specified regular expression.

Calculates how well the event matches the query.

Renames a specified field (wildcards can be used to specify multiple fields).

Replaces values of specified fields with a specified new value.

Specifies a Perl regular expression named groups to extract fields while you search.

Buffers events from real-time search to emit them in ascending time order when possible

The select command is deprecated. If you want to compute aggregate statistics

Makes calls to external Perl or Python programs.

Returns a random sampling of N search results.

Returns the search results of a saved search.

Emails search results to specified email addresses.

Sets the field values for all results to a common value.

Extracts values from structured data (XML or JSON) and stores them in a field or fields.

Turns rows into columns.

Filters out repeated adjacent results

Retrieves event metadata from indexes based on terms in the <logical-expression>

Filters results using keywords

Performs set operations on subsearches.

Clusters similar events together.

Produces a symbolic 'shape' attribute describing the shape of a numeric multivalued field

Sorts search results by the specified fields.

Puts search results into a summary index.

Adds summary statistics to all search results in a streaming manner.

Adds summary statistics to all search results.

Provides statistics

Concatenates string values.

Summary indexing friendly versions of stats command.

Summary indexing friendly versions of top command.

Summary indexing friendly versions of rare command.

Summary indexing friendly versions of chart command.

Summary indexing friendly versions of timechart command.

Annotates specified fields in your search results with tags.

Computes the moving averages of fields.

Creates a time series chart with corresponding table of statistics.

Displays the most common values of a field.

Writes the result table into *.tsidx files using indexed fields format.

Performs statistics on indexed fields in tsidx files

Groups events into transactions.

Returns typeahead on a specified prefix.

Generates suggested eventtypes. Deprecated: preferred command is 'findtypes'

Calculates the eventtypes for the search results

Runs an eval expression to filter the results. The result of the expression must be Boolean.

Causes UI to highlight specified terms.

Converts results into a format suitable for graphing.

Extracts XML key-value pairs.

Un-escapes XML characters.

Extracts the xpath value from FIELD and sets the OUTFIELD attribute.

Extracts location information from IP addresses using 3rd-party databases.

Processes the given file as if it were indexed.

Sets RANGE field to the name of the ranges that match.

Returns statistics about the raw field.

Sets the 'reltime' field to a human readable value of the difference between 'now' and '_time'.

Anonymizes the search results.

Returns a list of source

Performs a debug command.

Performs a deletion from the index.

Returns the number of events in an index.

Generates suggested event types.

convenient way to return values up from a subsearch

Internal command used to execute scripted alerts

finds transaction events given search constraints

Runs the search script

Remove seasonal fluctuations in fields.

Saturday, July 26, 2014

As I mentioned in the previous post, we are going to write a custom command that transforms search results into xml. Something like :
        SearchResults::iterator it;
        for (it = results.begin(); it != results.end(); ++it) {
            SearchResult r = **it;
            _output.append("<SearchResult>");
            std::set<Str> allFields;
            results.getAllKeys(allFields);
            for (std::set<Str>::const_iterator sit = allFields.begin(); sit !=
                allFields.end(); ++sit) {
               sr_index_t index = results.getIndex(*sit);
                // check all xml tags are constructed without whitespaces
                if (r.exists(index)){
                    _output.append("<" + (*sit).trim() + ">");
                    _output.append(r.getValue(index));
                    _output.append("</" + (*sit).trim() + ">");
                }
            }
            _output.append("</SearchResult>");
        }

but Splunk already has xpath.py that makes event value valid xml i.e it makes <data>%s<data> where the innerxml is the value corresponding to _raw in the event. This is different from above.
There are data-structure to xml python recipes on the web such as Recipe #577268 here.

There's also another way described in the Splunk sdk as follows:
To use the reader, instantiate :class:`ResultsReader` on a search result stream
as follows:::

    reader = ResultsReader(result_stream)
    for item in reader:
        print(item)

We try to do it this way :
rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

# untested
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        results,dummyresults,settings = si.getOrganizedResults()
        myxml = "<searchResults>"
fields = ["host", "source", "sourcetype", "_raw", "_time"]
        outfield = options.get('outfield', 'xml')
        for result in results:
            element = "<searchResult>"
            for i in fields:
                field = options.get('field', str(i))
                 val = result.get(field, None)
                 if val != None:
                    element += "<" + str(field).strip() + ">" + str(val) + "</" + str(field).strip() + ">"
            element += "/<searchResult>"
            myxml += element
        myxml += "</searchResults>"
         result[outfield] = myxml
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

Friday, July 25, 2014

Today I'm going to talk about writing custom search commands in python. You can use them with search operators in Splunk this way :
index=_internal | head 1 | smplcmd

rrajamani-mbp15r:splunkb rrajamani$ cat etc/system/local/commands.conf
[smplcmd]
filename = smplcmd.py
streaming = true
local = true
retainsevents = true
overrides_timeorder = false
supports_rawargs = true

rrajamani-mbp15r:splunkb rrajamani$ cat ./etc/apps/search/bin/smplcmd.py
#!/usr/bin/python
import splunk.Intersplunk as si
import time
if __name__ == '__main__':
    try:
        keywords,options = si.getKeywordsAndOptions()
        defaultval = options.get('default', None)
        results,dummyresults,settings = si.getOrganizedResults()
        # pass through
        si.outputResults(results)
    except Exception, e:
        import traceback
        stack = traceback.format_exc()
        si.generateErrorResults("Error '%s'. %s" % (e, stack))

we will write a custom command that transforms search results to xml

This SUMMER I'm going to devote a series of detailed posts to implement Splunk entirely in .Net and being a git based developer, we will write some light weight packages with .nuget and force a test driven development and a continuous integration on a git repository to go with it. Effectively we will build SplunkLite in .Net

Wednesday, July 23, 2014

In tonight's post we continue the discussion on file security checks for path names. Some of these checks are internalized by the APIs of the operating system. The trouble with path names is that it comes from untrusted users and as with all strings generates risks of buffer overruns. In addition it might point to device or pseudo device location that may pass for a path but can amount to a security breach. Even if the application is running with low privilege or not requiring administrator privileges, not validating path names adequately on Windows will cause vulnerabilities that can be exploited. These include gaining access to the application or redirection to invoke a malicious software. The application can be compromised from what it was intended. Checks to safeguard against this include validating local and UNC paths as well as securing access with ACLs. Device driver, printer and registry paths should be avoided. It is preferable to treat the path as opaque and interpreted with OS API rather than parsing it. Some simple checks are not ruled out though and the level of security should be modulated with the rest of the application. It is not right to block the window if the door is open. Also choice of API matters. For example a single API call can perform most of the checks we want.

Tuesday, July 22, 2014

We will discuss some configuration file entries for Splunk particularly one related to path specifiers say for certificates to launch Splunk in https mode, its syntax, semantics and migration issues. When Splunk is configured to run in https mode, the user indicates a flag called enableSplunkWebSSL and two paths for the certificates - the private cert (privKeyPath) and the certification authority cert (caCertPath). The path specified with these keys is considered relative to the 'splunk_home' directory. However, users could choose to keep the certificates wherever they like and so the paths could have '..' specifiers included. Paths could also start with '/' the specifier on unix style machines but these are generally not supported when the path is taken as relative. The '/' prefix to path is considered to be an absolute path specifier.
Since the user can store certificates anywhere on the machine, the path could be read as an absolute path. This way the user can directly specify path without the cumbersome '..' notation and the paths will be treated the same as the other configuration keys for Splunk. Other than that there are no advantages.
Now lets look at the caveats for converting relative to absolute paths.
First if the keys were specified then Splunk was working in https mode, so the certificates exist on the target. If the certificates are found under splunk_home then during migration, we can normalize and convert them to absolute paths. If the certificates are found under the root by way of '..' entries in the path, then this too can be made absolute with something like os.path.normpath(join(os.getcwd(), path)) in the migration script. If the certificates are not found by either means, then these keys should be removed so that Splunk can launch in default http mode ( although this will constitute a change in behavior )
Now that absolute paths have been specified in the configuration files, splunkd can assume that these directly point to the certificates and need not prepend them with splunk_home. So it first checks that the certificates are pointed to where the path is available. Next it checks where the certificates are found under splunk_home with the path specified. This step could not have been avoided because we cannot rely on the migration script all the time. The user can change the settings anytime after first run. We could rely on the prefix '/' since the migration script makes paths absolute with a '/' prefix and if it is missing we proceed to look for the certificates under the splunk_home. However the '/' prefix is only for linux. On windows we don't have that luxury. The os.path.isabs(x) may need to be implemented and used by splunkd. Besides path on windows has several security issues : for example we should not allow paths to begin with \\?\ etc and device and pseudo-device specifiers. Merely checking whether the path exists may not be enough. Besides, certificates should not be on remote machines.
Finally with the new change to support absolute and relative paths, the splunkd process assumes that most paths encountered are absolute. These paths need to be checked for prefixes, length and validity before the certificates are found under them. If the certificates are not found either because they don't exist or because they are not accessible, then if the path is relative we look for the certificates under splunk_home and if that doesn't work we error out.
if (absolute)
check_and_return
if (relative)
check_and_return
error_and_escape