Cluster computing

Sunday, July 13, 2014

In this post like in the previous, we will continue to look at Splunk integration with SQL and NoSQL systems. Specifically we will look at log parser and Splunk interaction. Splunk users know how to tran slate SQL queries to Splunk search queries. We use search operators for this. For non-Splunk users we could provide Splunk as a data store with log parser as a SQL interface. Therefore, we will look into providing Splunk searchable data as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Together splunk and log parser brings the power of splunk to log parser users without requiring them to know about Splunk search commands. At the same time, they have the choice to search the Splunk indexes directly. The ability to use SQL makes Splunk more common and inviting to windows users.

<SCRIPTLET>
<registration
    Description=“Splunk Input Log Parser Scriptlet"
    Progid="Splunk.Input.LogParser.Scriptlet"
    Classid="{fb947990-aa8c-4de5-8ff3-32a59fb66a6c}"
    Version="1.00"
    Remotable="False" />
<comment>
EXAMPLE: logparser "SELECT * FROM MAIN" -i:COM -iProgID:Splunk.Input.LogParser.Scriptlet
</comment>
<implements id="Automation" type="Automation">
    <method name="OpenInput">
      <parameter name="strValue"/>
    </method>
    <method name="GetFieldCount" />
    <method name="GetFieldName">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="GetFieldType">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="ReadRecord" />
    <method name="GetValue">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="CloseInput">
      <parameter name="blnAbort"/>
    </method>
</implements>
<SCRIPT LANGUAGE="VBScript">

Option Explicit

Const MAX_RECORDS = 5

Dim objAdminManager, objResultDictionary
Dim objResultsSection, objResultsCollection
Dim objResultElement
Dim objResultsElement, objResultElement
Dim intResultElementPos, intResult, intRecordIndex
Dim clsResult
Dim intRecordCount

' --------------------------------------------------------------------------------
' Open the input Result.
' --------------------------------------------------------------------------------

Public Function OpenInput(strValue)
    intRecordIndex = -1
Set objResultDictionary = CreateObject("Scripting.Dictionary")
Set objResultsSection = GetSearchResults(“index=main”);
Set objResultsCollection = objResultsSection.Collection
If IsNumeric(strValue) Then
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("id", strValue))
Else
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("name", strValue))
End If
If intResultElementPos > -1 Then
    Set objresultElement = objResultsCollection.Item(intResultElementPos)
    Set objFtpServerElement = objResultElement.ChildElements.Item(“SearchResults”)
    Set objResultsElement = objFtpServerElement.ChildElements.Item(“SearchResult).Collection
    For intResult = 0 To CLng(objResultsElement.Count)-1
       Set objResultElement = objResultsElement.Item(intResult)
       Set clsResult = New Result
       clsResult.Timestamp = objResultElement.GetPropertyByName(“timestamp”).Value
       clsResult.Host = objResultElement.GetPropertyByName(“host”).Value
       clsResult.Source = objResultElement.GetPropertyByName(“source”).Value
       clsResult.SourceType = objResultElement.GetPropertyByName(“sourcetype”).Value
       clsResult.Raw = objResultElement.GetPropertyByName(“raw”).Value
       objResultDictionary.Add intResult,clsResult
    Next
End If
End Function

' --------------------------------------------------------------------------------
' Close the input Result.
' --------------------------------------------------------------------------------

Public Function CloseInput(blnAbort)
intRecordIndex = -1
objResultDictionary.RemoveAll
End Function

' --------------------------------------------------------------------------------
' Return the count of fields.
' --------------------------------------------------------------------------------

Public Function GetFieldCount()
    GetFieldCount = 5
End Function

' --------------------------------------------------------------------------------
' Return the specified field's name.
' --------------------------------------------------------------------------------

Public Function GetFieldName(intFieldIndex)
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldName = “Timestamp”
        Case 1:
            GetFieldName = “Host”
        Case 2:
            GetFieldName = “Source”
        Case 3:
            GetFieldName = “Sourcetype”
        Case 4:
            GetFieldName = “Raw”
        Case Else
            GetFieldName = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's type.
' --------------------------------------------------------------------------------

Public Function GetFieldType(intFieldIndex)
    ' Define the field type constants.
    Const TYPE_STRING   = 1
    Const TYPE_REAL      = 2
    Const TYPE_TIMESTAMP    = 3
    Const TYPE_NULL = 4
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldType = TYPE_TIMESTAMP
        Case 1:
            GetFieldType = TYPE_STRING
        Case 2:
            GetFieldType = TYPE_STRING
        Case 3:
            GetFieldType = TYPE_STRING
        Case 4:
            GetFieldType = TYPE_STRING
        Case Else
            GetFieldType = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's value.
' --------------------------------------------------------------------------------

Public Function GetValue(intFieldIndex)
If objResultDictionary.Count > 0 Then
    Select Case CInt(intFieldIndex)
        Case 0:
            GetValue = objResultDictionary(intRecordIndex).Timestamp
        Case 1:
            GetValue = objResultDictionary(intRecordIndex).Host
        Case 2:
            GetValue = objResultDictionary(intRecordIndex).Source
        Case 3:
            GetValue = objResultDictionary(intRecordIndex).SourceType
        Case 4:
            GetValue = objResultDictionary(intRecordIndex).Raw
        Case Else
            GetValue = Null
    End Select
End If
End Function

' --------------------------------------------------------------------------------
' Read the next record, and return true or false if there is more data.
' --------------------------------------------------------------------------------

Public Function ReadRecord()
If objResultDictionary.Count > 0 Then
    If intRecordIndex < (objResultDictionary.Count-1) Then
    intRecordIndex = intRecordIdndex + 1
        ReadRecord = True
    Else
        ReadRecord = False
    End If
End If
End Function

Class Result
Public Timestamp
Public Host
Public Source
Public SourceType
Public Raw
End Class

</SCRIPT>

</SCRIPTLET>

Scriptlet Courtesy : Robert McMurray's blog

I will provide a class library for the COM callable wrapper to Splunk searchable data in C#.

The COM library that returns the search results can implement methods like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Splunk;
using SplunkSDKHelper;
using System.Xml;

namespace SplunkComponent
{

    [System.Runtime.InteropServices.ComVisible(false)]
    public class SplunkComponent
    {
        public SplunkComponent()
        {
            // Load connection info for Splunk server in .splunkrc file.
            var cli = Command.Splunk("search");
            cli.AddRule("search", typeof(string), "search string");
            cli.Parse(new string[] {"--search=\"index=main\""});
            if (!cli.Opts.ContainsKey("search"))
            {
                System.Console.WriteLine("Search query string required, use --search=\"query\"");
                Environment.Exit(1);
            }

            var service = Service.Connect(cli.Opts);
            var jobs = service.GetJobs();
            job = jobs.Create((string)cli.Opts["search"]);
            while (!job.IsDone)
            {
                System.Threading.Thread.Sleep(1000);
            }
        }

        [System.Runtime.InteropServices.ComVisible(false)]
        public string GetAllResults()
        {
            var outArgs = new JobResultsArgs
            {
                OutputMode = JobResultsArgs.OutputModeEnum.Xml,

                // Return all entries.
                Count = 0
            };

            using (var stream = job.Results(outArgs))
            {
                var setting = new XmlReaderSettings
                {
                    ConformanceLevel = ConformanceLevel.Fragment,
                };

                using (var rr = XmlReader.Create(stream, setting))
                {
                    return rr.ReadOuterXml();
                }
            }
        }

        private Job job { get; set; }
    }
}

https://github.com/ravibeta/csharpexamples/tree/master/SplunkComponent.

Today we look at a comparison between Splunk clustering and a Hadoop instance. In Hadoop for instance, the MapReduce used is a high performance parallel data processing technique. It does not guarantee ACID properties and supports forward only parsing. Data is stored in Hadoop such that the column names, column count and column datatypes don’t matter. The data is retrieved in two steps – with a Map function and a Reduce function. The Map function selects keys from each line and the values to hold resulting in a big hashtable. The Reduce function aggregates results. The database stores these key-values as columns in a column family and each row can have more than one column family. Splunk uses key maps to index the data but has a lot to do in terms of Map-Reduce and database.Splunk stores events. Its indexing is about events - together with their raw data andtheir index files and metadata. These are stored in directories organized by agecalled buckets. Splunk clustering is about keeping multiple copies of data to preventdata loss and improving data availability for searching. Search heads co-ordinatesearches across all the peer nodes.

Saturday, July 12, 2014

In this post, we talk about support for clustering in Splunk. Clustering is about replicating buckets and searchable data for tolerating failures in a distributed environment. There are two configuration settings to aid with replication. One that determines the replication of the raw data and another that determines the replication of searchable data. Both are in the configuration file in the master. Master talks to the peer over HTTP. Peers talk to each other on s2s. The design is such that the peers talk to the master and vice versa but the peers don't need to talk to one another. Basic configuration involves forwarder sending data to peers and search heads talking to both master and peer, Master does most of the management and the peers are the work horses. The hot buckets are created by the indexes but clustering improves the names so as to differentiate them for the nodes. We have a cluster wide bucket id that comprises : index plus id plus guid. We replicate by slices of data in these hot buckets.
We don’t enforce clustering policy on standalone buckets. On each bucket roll, we inform the master. The master keeps track of the states and does ‘fixup’.We schedule a ‘fixup’ on all failures.
Fixup is what happens when a node goes down and we lose the buckets it was working on
Rebuilding was a big problem because it took a lot of time.
Fixup level is broken down into six different levels (streaming, data_safety, generation, replication factor, search factor and checksum)
We schedule the highest priority work at all times.
When peers come up, they get the latest bundle from master
when a cluster node goes down, we could avoid messy state by going offline.
There are two versions of offline -
wait for master to complete (permanent)
second is allow rebalancing primaries by informing master while participating in searches till master gets back to you.
the states are offline->inputs(closed)->wait->done
Primary means there is an in-memory bit mask for that generation.
generation means snap-shoting the states of the primaries across the system.
master tracks which are participating in my current generation
each peer knows which generation it is a primary for.

Friday, July 11, 2014

In today's post we will continue our discussion. We will explore and describe
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system
Before we look at SQL integration, we want to look at the ways Splunk uses SQL lite. With that disclaimer and rain check, I will proceed to what I want: to create SQL queries for externalized search and types out of fields
First we are looking at a handful of SQL queries.
Next, we use the same schema as we have key maps.
I want to describe the use of a user defined search processor. Almost all search processors implement a set of common methods. These methods already describe a set of expected behavior for any processor that handles input and output of search results.if these methods were to be exposed to the user via a programmable interface, then users can plug in any processor of their own. To expose these methods to the user, we need callbacks that we can invoke and these can be registered as REST api by the user. The internal implementation of this custom search processor can then make these REST calls and marshal the parameters and the results.

Thursday, July 10, 2014

Another search processor for Splunk could be type conversion. That is support for user defined types in the search bar. Today we have fields that we can extract from the data. Fields are like key-value pairs. So users define their queries in terms of key values. Splunk also indexes key-value pairs so that their look-ups are easier. Key-Value pairs are very helpful in associations between different SearchResults and in working with different processors. However, support for user defined types can change the game and become a tremendous benefit to the user. This is because user defined types associate not just one fields but more than one fields with the data and in a way the user defines. This is different from tags. Tags can also come in helpful to the user for his labeling and defining the groups he cares about. However support for types and user defined types goes beyond mere fields. This is quite involving in that it affects the parser, the indexer, the search result retrieval and the display.
But first let us look at a processor that can support extract, transform and load kind of operations. We support these via search pipeline operators where the search results are piped to different operators that can handle one or more of the said operations. For example, if we wanted to transform the raw data behind the search results into XML, we can have a 'xml' processor that transforms it into a single result with the corresponding XML as the raw data. This lends itself to other data transformations or XML style querying by downstream systems.XML as we know is a different form of data than tabular or relational. Tabular or relational data can have compositions that describe entities and types. We don't have a way to capture the type information today but that doesn't mean we cannot plug into a system that does. For example, database servers handle types and entities. If Splunk were to have a connector where it could send XML downstream to a SQL lite database and shred the XML to relational data, then Splunk doesn't even have the onus to implement a type based system. It can then choose to implement just the SQL queries that lets the downstream databases handle it. These SQL queries can even be saved and reused later. Splunk uses SQL lite today. However, the indexes that Splunk maintains is different from the indexes that a database maintains. Therefore, extract transform and load of a data to downstream systems could be very helpful. Today atom feeds may be one way to do that but search results are even more intrinsic to Splunk.

In this post, I hope I can address some of the following objectives, otherwise I will try to elaborate over them in the next few.
Define why we need xml operator
The idea behind converting tables or CSVs to XML is that it provides another avenue for integration with data systems that rely on such data format. Why are there special systems using XML data ? Because data in xml can be validated independently with XSD, provide a hierarchical and well defined tags, enable a very different and useful querying system etc. Uptil now, Splunk relied on offline and file based dumping of XML. Such offline methods did not improve the workflow users have when integrating with systems such as a database. To facilitate the extract, transform and load of search results into databases, one has to have better control over the search results. XML is easy to import and shred in databases for further analysis or archival. The ability to integrate Splunk with a database does not diminish the value proposition of Splunk. If anything, it improves the usability and customer base of Splunk by adding more customers who rely on database for analysis.
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system

Tuesday, July 8, 2014

I wonder why we don't have a search operator that translates the search results to XML ?

I'm thinking something like this :
Conversion from:
Search Result 1 : key1=value1, key2=value2, key3=value3
Search Result 2 : key1=value1, key2=value2, key3=value3
Search Result 3 : key1=value1, key2=value2, key3=value3

To:
<SearchResults>
<SearchResult1>
<key1>value1 </key1>
<key2> value2 </key2>
<key3> value3 </key3>
</SearchResult1>
:
</SearchResults>

This could even operate on tables and convert them to XML.

And it seems straightforward to implement a Search processor that does this.

The main thing to watch out for is the memory growth for the XML conversion. The search results can be an arbitrary number potentially causing unbounded growth as a string for XML we are better off writing it to a file. At the same time, the new result with the converted XML is useful only when the format and content of the XML is required in a particular manner and serves as an input to other search operators. Otherwise the atom feed of Splunk already has an output XML mode.

Monday, July 7, 2014

Today we review search processors in Splunk. There are several processors that can be invoked to get the desired search results. These often translate to the search operators in the expression and follow a pipeline model. The pipeline is a way to redirect the output of one operator into the input of another. All of these processors implement a similar execute method that takes SearchResults and SearchResultsInfo as arguments. The processors also have a setup and initialization method where they process the arguments to the operators. The model is simple and portable to any language
We will now look at some of these processors in detail.
We have the Rex processor that implements the Rex operations on the search results. If needed, it generates a field that contains the start/end offset of each match. And creates a mapping for groupId to key index if and only if not in sed mode.
We have the query suggestor operator which suggests useful keywords to be added to your search. This works by ignoring some keywords and keeping a list of samples.
The Head processor iterated over the results to display only the truncated set.
The tail processor shows the last few results.