Cluster computing

Tuesday, July 15, 2014

In today's post, we look at Two mode networks as a social network method of study from Hanneman lectures. Brieger (1974) first highlighted the dual focus on social network analysis on how individuals, by their agency create social structure and at the same time those structures impose constrains and shapes the behavior of the individuals embedded in them. Social network analysis measure the relations at micro level and use it to infer the presence of structure at the macro level. For example, the ties of individuals (micro) allow us to infer the cliques (macro).
Davis study showed that there can be different levels of analysis. This study finds ties between actors and events and as such is not membership to a clique but affiliations. By seeing which actors participate in which events, we can infer the meaning of the event by the affiliations of the actors while seeing the influence of the event on the choices of the actors.
Further, we can see examples of this macro-micro social structure at different levels. This is referred to as nesting where individuals are part of a social structure and the structure can be part of a larger structure. At each level of the nesting, there's tension between the structure and the agency i.e macro and micro group.
There are some tools to examine this two-mode data. It involves finding both qualitative and quantitative patterns. If we take an example where we look at the contributions of donors to campaigns supporting and opposing ballot initiatives over a period of time, our data set has two modes - donors and initiatives. A binary data for whether there was a contribution or not could describe what a donor did. A valued data could describe the relations between donors and initiatives using a simple ordinal scale.
A rectangular matrix of actors (rows) and events(columns) could describe this dual mode data.
This could then be converted into two one mode data sets where we measure the strength of ties between actors by the number of times they contributed to the same side of initiatives and where we measure the initiative by initiative ties where we measure the number of donors that each pair of initiatives had in common.
To create actor by actor relations, we could use a cross -product method that takes entry of the row for actor A and multiplies it with that of actor B and then sums the result. This gives an indication of co-occurrence and works well with binary data where each product is 1 only when both actors are present.
Instead of the cross-product, we could also take the minimum of the two values which goes to say the tie is the weaker of the ties of the two actors to the event.
Two mode data are sometimes stored in a second way called the bipartite matrix. A bipartite matrix is one where the same rows as in the original matrix are added as additional columns and the same columns as in the original matrix are added as additional rows. Actors and events are being treated as social objects at a single level of analysis.
This is different from a bipartite graph also called a digraph which is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge. In the context of word similarity extractions, we used terms and their N-gram contexts as the two partites and used random walks to connect them.

I will cover random walks in more detail.

Sunday, July 13, 2014

In this post like in the previous, we will continue to look at Splunk integration with SQL and NoSQL systems. Specifically we will look at log parser and Splunk interaction. Splunk users know how to tran slate SQL queries to Splunk search queries. We use search operators for this. For non-Splunk users we could provide Splunk as a data store with log parser as a SQL interface. Therefore, we will look into providing Splunk searchable data as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Together splunk and log parser brings the power of splunk to log parser users without requiring them to know about Splunk search commands. At the same time, they have the choice to search the Splunk indexes directly. The ability to use SQL makes Splunk more common and inviting to windows users.

<SCRIPTLET>
<registration
    Description=“Splunk Input Log Parser Scriptlet"
    Progid="Splunk.Input.LogParser.Scriptlet"
    Classid="{fb947990-aa8c-4de5-8ff3-32a59fb66a6c}"
    Version="1.00"
    Remotable="False" />
<comment>
EXAMPLE: logparser "SELECT * FROM MAIN" -i:COM -iProgID:Splunk.Input.LogParser.Scriptlet
</comment>
<implements id="Automation" type="Automation">
    <method name="OpenInput">
      <parameter name="strValue"/>
    </method>
    <method name="GetFieldCount" />
    <method name="GetFieldName">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="GetFieldType">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="ReadRecord" />
    <method name="GetValue">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="CloseInput">
      <parameter name="blnAbort"/>
    </method>
</implements>
<SCRIPT LANGUAGE="VBScript">

Option Explicit

Const MAX_RECORDS = 5

Dim objAdminManager, objResultDictionary
Dim objResultsSection, objResultsCollection
Dim objResultElement
Dim objResultsElement, objResultElement
Dim intResultElementPos, intResult, intRecordIndex
Dim clsResult
Dim intRecordCount

' --------------------------------------------------------------------------------
' Open the input Result.
' --------------------------------------------------------------------------------

Public Function OpenInput(strValue)
    intRecordIndex = -1
Set objResultDictionary = CreateObject("Scripting.Dictionary")
Set objResultsSection = GetSearchResults(“index=main”);
Set objResultsCollection = objResultsSection.Collection
If IsNumeric(strValue) Then
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("id", strValue))
Else
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("name", strValue))
End If
If intResultElementPos > -1 Then
    Set objresultElement = objResultsCollection.Item(intResultElementPos)
    Set objFtpServerElement = objResultElement.ChildElements.Item(“SearchResults”)
    Set objResultsElement = objFtpServerElement.ChildElements.Item(“SearchResult).Collection
    For intResult = 0 To CLng(objResultsElement.Count)-1
       Set objResultElement = objResultsElement.Item(intResult)
       Set clsResult = New Result
       clsResult.Timestamp = objResultElement.GetPropertyByName(“timestamp”).Value
       clsResult.Host = objResultElement.GetPropertyByName(“host”).Value
       clsResult.Source = objResultElement.GetPropertyByName(“source”).Value
       clsResult.SourceType = objResultElement.GetPropertyByName(“sourcetype”).Value
       clsResult.Raw = objResultElement.GetPropertyByName(“raw”).Value
       objResultDictionary.Add intResult,clsResult
    Next
End If
End Function

' --------------------------------------------------------------------------------
' Close the input Result.
' --------------------------------------------------------------------------------

Public Function CloseInput(blnAbort)
intRecordIndex = -1
objResultDictionary.RemoveAll
End Function

' --------------------------------------------------------------------------------
' Return the count of fields.
' --------------------------------------------------------------------------------

Public Function GetFieldCount()
    GetFieldCount = 5
End Function

' --------------------------------------------------------------------------------
' Return the specified field's name.
' --------------------------------------------------------------------------------

Public Function GetFieldName(intFieldIndex)
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldName = “Timestamp”
        Case 1:
            GetFieldName = “Host”
        Case 2:
            GetFieldName = “Source”
        Case 3:
            GetFieldName = “Sourcetype”
        Case 4:
            GetFieldName = “Raw”
        Case Else
            GetFieldName = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's type.
' --------------------------------------------------------------------------------

Public Function GetFieldType(intFieldIndex)
    ' Define the field type constants.
    Const TYPE_STRING   = 1
    Const TYPE_REAL      = 2
    Const TYPE_TIMESTAMP    = 3
    Const TYPE_NULL = 4
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldType = TYPE_TIMESTAMP
        Case 1:
            GetFieldType = TYPE_STRING
        Case 2:
            GetFieldType = TYPE_STRING
        Case 3:
            GetFieldType = TYPE_STRING
        Case 4:
            GetFieldType = TYPE_STRING
        Case Else
            GetFieldType = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's value.
' --------------------------------------------------------------------------------

Public Function GetValue(intFieldIndex)
If objResultDictionary.Count > 0 Then
    Select Case CInt(intFieldIndex)
        Case 0:
            GetValue = objResultDictionary(intRecordIndex).Timestamp
        Case 1:
            GetValue = objResultDictionary(intRecordIndex).Host
        Case 2:
            GetValue = objResultDictionary(intRecordIndex).Source
        Case 3:
            GetValue = objResultDictionary(intRecordIndex).SourceType
        Case 4:
            GetValue = objResultDictionary(intRecordIndex).Raw
        Case Else
            GetValue = Null
    End Select
End If
End Function

' --------------------------------------------------------------------------------
' Read the next record, and return true or false if there is more data.
' --------------------------------------------------------------------------------

Public Function ReadRecord()
If objResultDictionary.Count > 0 Then
    If intRecordIndex < (objResultDictionary.Count-1) Then
    intRecordIndex = intRecordIdndex + 1
        ReadRecord = True
    Else
        ReadRecord = False
    End If
End If
End Function

Class Result
Public Timestamp
Public Host
Public Source
Public SourceType
Public Raw
End Class

</SCRIPT>

</SCRIPTLET>

Scriptlet Courtesy : Robert McMurray's blog

I will provide a class library for the COM callable wrapper to Splunk searchable data in C#.

The COM library that returns the search results can implement methods like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Splunk;
using SplunkSDKHelper;
using System.Xml;

namespace SplunkComponent
{

    [System.Runtime.InteropServices.ComVisible(false)]
    public class SplunkComponent
    {
        public SplunkComponent()
        {
            // Load connection info for Splunk server in .splunkrc file.
            var cli = Command.Splunk("search");
            cli.AddRule("search", typeof(string), "search string");
            cli.Parse(new string[] {"--search=\"index=main\""});
            if (!cli.Opts.ContainsKey("search"))
            {
                System.Console.WriteLine("Search query string required, use --search=\"query\"");
                Environment.Exit(1);
            }

            var service = Service.Connect(cli.Opts);
            var jobs = service.GetJobs();
            job = jobs.Create((string)cli.Opts["search"]);
            while (!job.IsDone)
            {
                System.Threading.Thread.Sleep(1000);
            }
        }

        [System.Runtime.InteropServices.ComVisible(false)]
        public string GetAllResults()
        {
            var outArgs = new JobResultsArgs
            {
                OutputMode = JobResultsArgs.OutputModeEnum.Xml,

                // Return all entries.
                Count = 0
            };

            using (var stream = job.Results(outArgs))
            {
                var setting = new XmlReaderSettings
                {
                    ConformanceLevel = ConformanceLevel.Fragment,
                };

                using (var rr = XmlReader.Create(stream, setting))
                {
                    return rr.ReadOuterXml();
                }
            }
        }

        private Job job { get; set; }
    }
}

https://github.com/ravibeta/csharpexamples/tree/master/SplunkComponent.

Today we look at a comparison between Splunk clustering and a Hadoop instance. In Hadoop for instance, the MapReduce used is a high performance parallel data processing technique. It does not guarantee ACID properties and supports forward only parsing. Data is stored in Hadoop such that the column names, column count and column datatypes don’t matter. The data is retrieved in two steps – with a Map function and a Reduce function. The Map function selects keys from each line and the values to hold resulting in a big hashtable. The Reduce function aggregates results. The database stores these key-values as columns in a column family and each row can have more than one column family. Splunk uses key maps to index the data but has a lot to do in terms of Map-Reduce and database.Splunk stores events. Its indexing is about events - together with their raw data andtheir index files and metadata. These are stored in directories organized by agecalled buckets. Splunk clustering is about keeping multiple copies of data to preventdata loss and improving data availability for searching. Search heads co-ordinatesearches across all the peer nodes.

Saturday, July 12, 2014

In this post, we talk about support for clustering in Splunk. Clustering is about replicating buckets and searchable data for tolerating failures in a distributed environment. There are two configuration settings to aid with replication. One that determines the replication of the raw data and another that determines the replication of searchable data. Both are in the configuration file in the master. Master talks to the peer over HTTP. Peers talk to each other on s2s. The design is such that the peers talk to the master and vice versa but the peers don't need to talk to one another. Basic configuration involves forwarder sending data to peers and search heads talking to both master and peer, Master does most of the management and the peers are the work horses. The hot buckets are created by the indexes but clustering improves the names so as to differentiate them for the nodes. We have a cluster wide bucket id that comprises : index plus id plus guid. We replicate by slices of data in these hot buckets.
We don’t enforce clustering policy on standalone buckets. On each bucket roll, we inform the master. The master keeps track of the states and does ‘fixup’.We schedule a ‘fixup’ on all failures.
Fixup is what happens when a node goes down and we lose the buckets it was working on
Rebuilding was a big problem because it took a lot of time.
Fixup level is broken down into six different levels (streaming, data_safety, generation, replication factor, search factor and checksum)
We schedule the highest priority work at all times.
When peers come up, they get the latest bundle from master
when a cluster node goes down, we could avoid messy state by going offline.
There are two versions of offline -
wait for master to complete (permanent)
second is allow rebalancing primaries by informing master while participating in searches till master gets back to you.
the states are offline->inputs(closed)->wait->done
Primary means there is an in-memory bit mask for that generation.
generation means snap-shoting the states of the primaries across the system.
master tracks which are participating in my current generation
each peer knows which generation it is a primary for.

Friday, July 11, 2014

In today's post we will continue our discussion. We will explore and describe
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system
Before we look at SQL integration, we want to look at the ways Splunk uses SQL lite. With that disclaimer and rain check, I will proceed to what I want: to create SQL queries for externalized search and types out of fields
First we are looking at a handful of SQL queries.
Next, we use the same schema as we have key maps.
I want to describe the use of a user defined search processor. Almost all search processors implement a set of common methods. These methods already describe a set of expected behavior for any processor that handles input and output of search results.if these methods were to be exposed to the user via a programmable interface, then users can plug in any processor of their own. To expose these methods to the user, we need callbacks that we can invoke and these can be registered as REST api by the user. The internal implementation of this custom search processor can then make these REST calls and marshal the parameters and the results.

Thursday, July 10, 2014

Another search processor for Splunk could be type conversion. That is support for user defined types in the search bar. Today we have fields that we can extract from the data. Fields are like key-value pairs. So users define their queries in terms of key values. Splunk also indexes key-value pairs so that their look-ups are easier. Key-Value pairs are very helpful in associations between different SearchResults and in working with different processors. However, support for user defined types can change the game and become a tremendous benefit to the user. This is because user defined types associate not just one fields but more than one fields with the data and in a way the user defines. This is different from tags. Tags can also come in helpful to the user for his labeling and defining the groups he cares about. However support for types and user defined types goes beyond mere fields. This is quite involving in that it affects the parser, the indexer, the search result retrieval and the display.
But first let us look at a processor that can support extract, transform and load kind of operations. We support these via search pipeline operators where the search results are piped to different operators that can handle one or more of the said operations. For example, if we wanted to transform the raw data behind the search results into XML, we can have a 'xml' processor that transforms it into a single result with the corresponding XML as the raw data. This lends itself to other data transformations or XML style querying by downstream systems.XML as we know is a different form of data than tabular or relational. Tabular or relational data can have compositions that describe entities and types. We don't have a way to capture the type information today but that doesn't mean we cannot plug into a system that does. For example, database servers handle types and entities. If Splunk were to have a connector where it could send XML downstream to a SQL lite database and shred the XML to relational data, then Splunk doesn't even have the onus to implement a type based system. It can then choose to implement just the SQL queries that lets the downstream databases handle it. These SQL queries can even be saved and reused later. Splunk uses SQL lite today. However, the indexes that Splunk maintains is different from the indexes that a database maintains. Therefore, extract transform and load of a data to downstream systems could be very helpful. Today atom feeds may be one way to do that but search results are even more intrinsic to Splunk.

In this post, I hope I can address some of the following objectives, otherwise I will try to elaborate over them in the next few.
Define why we need xml operator
The idea behind converting tables or CSVs to XML is that it provides another avenue for integration with data systems that rely on such data format. Why are there special systems using XML data ? Because data in xml can be validated independently with XSD, provide a hierarchical and well defined tags, enable a very different and useful querying system etc. Uptil now, Splunk relied on offline and file based dumping of XML. Such offline methods did not improve the workflow users have when integrating with systems such as a database. To facilitate the extract, transform and load of search results into databases, one has to have better control over the search results. XML is easy to import and shred in databases for further analysis or archival. The ability to integrate Splunk with a database does not diminish the value proposition of Splunk. If anything, it improves the usability and customer base of Splunk by adding more customers who rely on database for analysis.
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system

Tuesday, July 8, 2014

I wonder why we don't have a search operator that translates the search results to XML ?

I'm thinking something like this :
Conversion from:
Search Result 1 : key1=value1, key2=value2, key3=value3
Search Result 2 : key1=value1, key2=value2, key3=value3
Search Result 3 : key1=value1, key2=value2, key3=value3

To:
<SearchResults>
<SearchResult1>
<key1>value1 </key1>
<key2> value2 </key2>
<key3> value3 </key3>
</SearchResult1>
:
</SearchResults>

This could even operate on tables and convert them to XML.

And it seems straightforward to implement a Search processor that does this.

The main thing to watch out for is the memory growth for the XML conversion. The search results can be an arbitrary number potentially causing unbounded growth as a string for XML we are better off writing it to a file. At the same time, the new result with the converted XML is useful only when the format and content of the XML is required in a particular manner and serves as an input to other search operators. Otherwise the atom feed of Splunk already has an output XML mode.