Friday, July 18, 2014

Continuing from the previous post ...

The number of steps taken between any two nodes is called hitting time. The hitting time between an unclassified and domain term and can be averaged over all the walks that connect these two nodes.
Since the walks are selected based on transition probabilities, the most probable paths are selected first. The same pair of nodes can be connected with many paths and the same unclassified term can be connected to many other domain terms.
The contextual similarity of a classification pair n,m can then be described as the relative frequency of the hitting of those two nodes and other normalized nodes linked to that start node.
This is calculated as Contextual Similarity L(n,m) = H(n, m) / Sigma-i(H(n,m)).
We can also look at Stability of a random walk or a Markov chain in general.
Stability refers to the convergence of probabilities as the series becomes infinite. When the recurrence is positive and the series is not reducible, the average (called Cesaro average) 1/n Sum (PXk = x) converges to pi (x), as n -> infinity.
Stability is interesting because a Markov chain is a simple model of a stochastic dynamical system that remains within a small region. For example, when a pendulum swings, it finally comes to a stable position with dissipation of energy. Even if there were no friction, it would be deemed stable because it cannot swing too far away.
What the stability tells us is that when a Markov chain has certain properties ( irreducibility, positive recurrence, unique and stationary distribution pi) , the n-step transition matrix converges to a matrix with rows all equal to pi. This is called the fundamental stability theorem.
Stability works based on coupling.
Coupling refers to the various methods for constructing a combination of the two random variables. If the random variables are independent, then they can be combined in a straightforward manner taking co-occurrence. Coupling helps us define a third Markov chain Z from an arbitrary distribution X and  a stationary distribution Y where Z is X prior to the meeting with Y and Z is Y after the meeting point. This then shows that the transition matrix converges to all rows as pi.
By choosing the most probable paths, the random walk follows the preferred state transitions. Thus while not all paths may end within the predetermined steps, we know that when it does, it would have chosen the higher transition probabilities.

A simple random walk has equal probabilities to move from one state to another.
To implement a simple random walk in (x,y) dimension, we can have a naive one like this:
for ( i = 1; i < 2^n; i++ )
     if move in x-did
         x[i] = x[i-1] + sample(step, 1);
         y[i] = y[i-1];
     else
         x[i] = x[i-1];
         y[i] = y[i-1] + sample(step,1);
     print(x,y);

We can have the following metrics in a simple random walk:
first return time to zero
total number of visits to a state.
For all the random walks, we can have metrics like
total number of paths between two nodes
total number of paths in which a node appears


Thursday, July 17, 2014

We mentioned that Random walk is a special case of a Markov chain. To understand Markov chain, we take the example of a mouse in a cage with two cells - the first has fresh cheese and the second has stinky cheese. If at time n, the mouse is in cell 1, then at time n+1, the mouse is in either cell1 or cell 2. Let us say the probability to move from cell 1 to cell 2 is alpha and the reverse is beta.  In this case, alpha is much smaller than beta.  Then the transition diagram shows transitions as follows:
1->1 with a  probability of 1- alpha
1 -> 2 with a probability of alpha
2 -> 1 with a probability of beta
and 2->2 with a probability of 1 - beta.
The mouse makes the choice of staying or moving with equal probability, the time it takes for the mouse to make the first move from 1 to 2 is  the mean of the binomial distribution = 1/alpha.
Thus the moves are described in terms of random variables Xi. Let T denote the subset of integers in a  sequence of random variables {Xn : n belongs to T}.  The Markov property is then defined as follows:
any n belonging to T, the future process Xm  (m > n) is independent of past process Xm (m < n) and conditionally dependent on Xn.
In the case of the mouse, the probability of the move is regardless of where the mouse was at earlier times i.e the future is independent of the past given the present.
This dependence of the future on the present is easy to generalize with random numbers instead of deterministic objects. The states are countable and can be denoted by state space S and since the moves are between states, Xn is called a Markov chain.
The Markov property talks about all before and after processes so (Xn+1 .... Xn+m) and (X0, ... Xn-1) are independent conditionally on Xn.
The walk based on a Markov chain is thus dependent on transition probabilities. If the transition probabilities are defined, then the walk is just iterative.
The joint probability distribution can be expressed in terms of two functions -
one for the initial states for each i in the state space S
and the other for the transitions pi,j(n, n+1) = P(Xn+1 = j | Xn = i), i, j belongs to S, n >= 0
We can generalize the transition from one step to more than one steps with
 pi,j(m,n) = P(Xn = j | Xm = i).
We call Markov chain- time homogeneous when the single step transition probability does not depend on n
In the case of the mouse, the probability of move is independent of where it was at earlier times. Each single step transition probability, denoted by just P, is the same no matter how many times it is repeated and the transition from P(m,n) = P * P * P ... n-m times.
In a random walk, we add a few more properties we discussed earlier.
Now let us consider a random walk on the bipartite graph starting at an unclassified term and arriving at another semantically equivalent domain term. The walker starts from any node and then moves to any other node with transition probability Pij.
Pij is the normalized probability of the co-occurrence counts of the terms and the corresponding context.The transition probability between any two nodes i,j is defined as Pi,j = Wi,j / Sum(Wik) for all k.
The terms are either unclassified or domain and cannot be connected with each other except through context.
Each walk uses the transition probabilities for the move.
Each walk starts at some unclassified term and ends at a domain term or exceeds the maximum number of moves without finishing.
Several walks upto say K times can be initiated.

Wednesday, July 16, 2014

Let us look at performing a random walk between terms and context.
 We discussed in the previous post that terms and context are different for machine data than for human text. The context in the case of machine data is the metadata fields. A random walk connects the terms and the context in a bipartite graph.
A random walk is a special kind of Markov chain. The latter has the property that it has time homogeneous transition probabilities. A random walk has an additional property that it has space homogeneous transition probabilities. This calls for a structure where px,y = Px+z,y+z for any translation z. A random walk is given by px,y = p(y - x)
In a undirected connected graph, the degree of each vertex $i = number of neighbors of i
For the same graph, a random walk is defined with following transition probabilities
 pij = { 1 / degree(i), if j is a neighbor of i
         {  0 otherwise
The stationary distribution of RWG is given by pi(i) = C degree(i) where C is a constant.
This comes from the fact that pi(i) pij = pi(j) pji  since both are equal to C. On each hand of the equation the first term is the stationery distribution and the second term is the inverse of degree(i).
This concludes that the stationary distribution is proportional to the degree(i). This is easy to work out with a sample connected graph of five vertices arranged inside a square. Moreover, the equations suggest that RWG is a reversible chain. The constant C is found by normalization.  Given that the degress add up to twice the number of edges, we have C = 1/(2 |E| )

There are several ways to express transition probabilities resulting in different RWGs
p(x) = C2^ (-|x1| - ... - |xd|)

p(x) = pj if x = ej, j = 1, ... d
        = qj if x = -ej, j = 1, ... d
           0   otherwise
and all of the p(x) add upto 1.

if d = 1 then p(1) = p, p(-1) = q and 0 otherwise where p + q = 1 . This is the drunkards walk which is homogeneous in time and space and given that it must pass through all intermediate nodes because each hop is 1, the transitions are also skip free.

Machine data has several traits that are different from texts and discourses. When we search machine data we see fields such as messages, timestamps, applications, source, sourcetype, host, sizes, counts, protocols and protocol data formats etc. In other words, there is a set of domain terms that is finite and covers a large portion of the logs seen.
There is a high number of repetitions of these domain terms due to the fact that machines and repeaters often spit out the same set of messages at different times. At the same time, there are messages that have a high degree of variation in content such as web traffic capture. But these two different types are well delineated in terms of flows.
Machine data therefore can be separated into three different types:
Data that doesn't vary a lot in terms of content and terms
Data that varies a lot in terms of content only, metadata is still same
Data that mixes the two above
For the the first two cases, when they constitute a bulk of messages, we can treat them differently. The mixture of the two cases amounts to noise and can be ignored for the moment. We will treat this as any other arbitrary text later.
Predictive patterns in the data help with analysis so the first two case are interesting and lend themselves to field extraction, summarization etc.
Terms extracted from the data are already falling into a frequency table that is skewed as opposed to the concordance from human text.  So the selection of these terms is different from that of regular text.
Moreover context for the data can be described in terms of the metadata fields. These fields are predetermined or even specified by the forwarders of the data. The indexers that store the raw and searchable data have both data and metadata from a wide variety of sources.
Association between the terms and the context let us look for those that can be considered the semantics of the machine data. These terms are special in that they tag the data in a special way. Finding these terms is helpful in knowing what the machine data is about.
We can associate the terms and the context with bipartite graphs.
When we perform a random walk to connect the terms and the context, we are performing a special case of Markov chain which is iterative. In a random walk, we follow  a stochastic process with random variables X1, X2, ..., Xk such that X1 = 0 and X i+1 is a vertex chosen uniformly at random from the neighbors of Xi. The number pv,w,k(G) is the probability that a random walk of length k starting at v ends at w. If the edges are considered in terms of an ohm resistor, then the resistance between a point to infinity is finite when the graph is transient. We will  review Aldous and Fill book on random walks on graph next.

Tuesday, July 15, 2014

I will resume my previous post but I want to take a short break to discuss another software application. In some earlier posts I discussed a fiddler like application for mobile devices. In particular I was looking for a fiddler like application for its devices. I pointed out a proxy switching code for IoS available on the net. Changing the proxy helps with the packet capture.
When we talked about the viability of a fiddler like application for mobile devices, we wanted to to use it to test the APIs.
Before continuing any further, I have a work item to look at machine data semantics and I will return shortly.
In today's post, we look at Two mode networks as a social network method of study from Hanneman lectures. Brieger (1974) first highlighted the dual focus on social network analysis on how individuals, by their agency create social structure and at the same time those structures impose constrains and shapes the behavior of the individuals embedded in them. Social network analysis measure the relations at micro level and use it to infer the presence of structure at the macro level. For example, the ties of individuals (micro) allow us to infer the cliques (macro).
Davis study showed that there can be different levels of analysis. This study finds ties between actors and events and as such is not membership to a clique but affiliations. By seeing which actors participate in which events, we can infer the meaning of the event by the affiliations of the actors while seeing the influence of the event on the choices of the actors.
Further, we can see examples of this macro-micro social structure at different levels.  This is referred to as nesting where individuals are part of a social structure and the structure can be part of a larger structure. At each level of the nesting, there's tension between the structure and the agency i.e macro and micro group.
There are some tools to examine this two-mode data. It involves finding both qualitative and quantitative patterns. If we take an example where we look at the contributions of donors to campaigns supporting and opposing ballot initiatives over a period of time, our data set has two modes - donors and initiatives. A binary data for whether there was a contribution or not could describe what a donor did. A valued data could describe the relations between donors and initiatives using a simple ordinal scale.
A rectangular matrix of actors (rows) and events(columns) could describe this dual mode data.
This could then be converted into two one mode data sets where we measure the strength of ties between actors by the number of times they contributed to the same side of initiatives and where we measure the initiative by initiative ties where we measure the number of donors that each pair of initiatives had in common.
To create actor by actor relations, we could use a cross -product method that takes entry of the row for actor A and multiplies it with that of actor B and then sums the result. This gives an indication of co-occurrence and works well with binary data where each product is 1 only when both actors are present.
Instead of the cross-product, we could also take the minimum of the two values which goes to say the tie is the weaker of the ties of the two actors to the event.
Two mode data are sometimes stored in a second way called the bipartite matrix.  A bipartite matrix is one where the same rows as in the original matrix are added as additional columns and the same columns as in the original matrix are added as additional rows.  Actors and events are being treated as social objects at a single level of analysis.
This is different from a bipartite graph also called a digraph which is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge. In the context of word similarity extractions, we used terms and their N-gram contexts as the two partites and used random walks to connect them.


I will cover random walks in more detail.

Sunday, July 13, 2014

In this post like in the previous, we will continue to look at Splunk integration with SQL and NoSQL systems. Specifically we will look at log parser and Splunk interaction. Splunk users know how to tran slate SQL queries to Splunk search queries.  We use search operators for this. For non-Splunk users we could provide Splunk as a data store with log parser as a SQL interface.  Therefore, we will look into providing Splunk searchable data as a COM input to log parser. A COM input simply implements a few methods for the log parser and abstracts the data store. These methods are :
OpenInput: Opens your data source and sets up any initial environment settings
GetFieldCount: returns the number of fields that the plugin provides
GetFieldName: returns the name of a specified field
GetFieldType : returns the datatype of a specified field
GetValue : returns the value of a specified field
ReadRecord : reads the next record from your data source
CloseInput: closes the data source and cleans up any environment settings
Together splunk and log parser brings the power of splunk to log parser users without requiring them to know about Splunk search commands. At the same time, they have the choice to search the Splunk indexes directly. The ability to use SQL makes Splunk more common and inviting to windows users.

<SCRIPTLET>
  <registration
    Description=“Splunk Input Log Parser Scriptlet"
    Progid="Splunk.Input.LogParser.Scriptlet"
    Classid="{fb947990-aa8c-4de5-8ff3-32a59fb66a6c}"
    Version="1.00"
    Remotable="False" />
  <comment>
  EXAMPLE: logparser "SELECT * FROM MAIN" -i:COM -iProgID:Splunk.Input.LogParser.Scriptlet
  </comment>
  <implements id="Automation" type="Automation">
    <method name="OpenInput">
      <parameter name="strValue"/>
    </method>
    <method name="GetFieldCount" />
    <method name="GetFieldName">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="GetFieldType">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="ReadRecord" />
    <method name="GetValue">
      <parameter name="intFieldIndex"/>
    </method>
    <method name="CloseInput">
      <parameter name="blnAbort"/>
    </method>
  </implements>
  <SCRIPT LANGUAGE="VBScript">

Option Explicit

Const MAX_RECORDS = 5

Dim objAdminManager, objResultDictionary
Dim objResultsSection, objResultsCollection
Dim objResultElement
Dim objResultsElement, objResultElement
Dim intResultElementPos, intResult, intRecordIndex
Dim clsResult
Dim intRecordCount

' --------------------------------------------------------------------------------
' Open the input Result.
' --------------------------------------------------------------------------------

Public Function OpenInput(strValue)
    intRecordIndex = -1
  Set objResultDictionary = CreateObject("Scripting.Dictionary")
  Set objResultsSection = GetSearchResults(“index=main”);
Set objResultsCollection = objResultsSection.Collection
  If IsNumeric(strValue) Then
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("id", strValue))
  Else
    intResultElementPos = FindElement(objResultsCollection, "Result", Array("name", strValue))
  End If
  If intResultElementPos > -1 Then
    Set objresultElement = objResultsCollection.Item(intResultElementPos)
    Set objFtpServerElement = objResultElement.ChildElements.Item(“SearchResults”)
    Set objResultsElement = objFtpServerElement.ChildElements.Item(“SearchResult).Collection
    For intResult = 0 To CLng(objResultsElement.Count)-1
       Set objResultElement = objResultsElement.Item(intResult)
       Set clsResult = New Result
       clsResult.Timestamp = objResultElement.GetPropertyByName(“timestamp”).Value
       clsResult.Host = objResultElement.GetPropertyByName(“host”).Value
       clsResult.Source = objResultElement.GetPropertyByName(“source”).Value
       clsResult.SourceType = objResultElement.GetPropertyByName(“sourcetype”).Value
       clsResult.Raw = objResultElement.GetPropertyByName(“raw”).Value
       objResultDictionary.Add intResult,clsResult
    Next
  End If
End Function

' --------------------------------------------------------------------------------
' Close the input Result.
' --------------------------------------------------------------------------------

Public Function CloseInput(blnAbort)
  intRecordIndex = -1
  objResultDictionary.RemoveAll
End Function

' --------------------------------------------------------------------------------
' Return the count of fields.
' --------------------------------------------------------------------------------

Public Function GetFieldCount()
    GetFieldCount = 5
End Function

' --------------------------------------------------------------------------------
' Return the specified field's name.
' --------------------------------------------------------------------------------

Public Function GetFieldName(intFieldIndex)
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldName = “Timestamp”
        Case 1:
            GetFieldName = “Host”
        Case 2:
            GetFieldName = “Source”
        Case 3:
            GetFieldName = “Sourcetype”
        Case 4:
            GetFieldName = “Raw”
        Case Else
            GetFieldName = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's type.
' --------------------------------------------------------------------------------

Public Function GetFieldType(intFieldIndex)
    ' Define the field type constants.
    Const TYPE_STRING   = 1
    Const TYPE_REAL      = 2
    Const TYPE_TIMESTAMP    = 3
    Const TYPE_NULL = 4
    Select Case CInt(intFieldIndex)
        Case 0:
            GetFieldType = TYPE_TIMESTAMP
        Case 1:
            GetFieldType = TYPE_STRING
        Case 2:
            GetFieldType = TYPE_STRING
        Case 3:
            GetFieldType = TYPE_STRING
        Case 4:
            GetFieldType = TYPE_STRING   
        Case Else
            GetFieldType = Null
    End Select
End Function

' --------------------------------------------------------------------------------
' Return the specified field's value.
' --------------------------------------------------------------------------------

Public Function GetValue(intFieldIndex)
  If objResultDictionary.Count > 0 Then
    Select Case CInt(intFieldIndex)
        Case 0:
            GetValue = objResultDictionary(intRecordIndex).Timestamp
        Case 1:
            GetValue = objResultDictionary(intRecordIndex).Host
        Case 2:
            GetValue = objResultDictionary(intRecordIndex).Source
        Case 3:
            GetValue = objResultDictionary(intRecordIndex).SourceType
        Case 4:
            GetValue = objResultDictionary(intRecordIndex).Raw
        Case Else
            GetValue = Null
    End Select
  End If
End Function
 
' --------------------------------------------------------------------------------
' Read the next record, and return true or false if there is more data.
' --------------------------------------------------------------------------------

Public Function ReadRecord()
  If objResultDictionary.Count > 0 Then
    If intRecordIndex < (objResultDictionary.Count-1) Then 
    intRecordIndex = intRecordIdndex + 1
        ReadRecord = True
    Else
        ReadRecord = False
    End If
  End If
End Function

Class Result
  Public Timestamp
  Public Host
  Public Source
  Public SourceType
  Public Raw
End Class

  </SCRIPT>

</SCRIPTLET>

Scriptlet Courtesy : Robert McMurray's blog


I will provide a class library for the COM callable wrapper to Splunk searchable data in C#.

The COM library that returns the search results can implement  methods like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Splunk;
using SplunkSDKHelper;
using System.Xml;

namespace SplunkComponent
{

    [System.Runtime.InteropServices.ComVisible(false)]
    public class SplunkComponent
    {
        public SplunkComponent()
        {
            // Load connection info for Splunk server in .splunkrc file.
            var cli = Command.Splunk("search");
            cli.AddRule("search", typeof(string), "search string");
            cli.Parse(new string[] {"--search=\"index=main\""});
            if (!cli.Opts.ContainsKey("search"))
            {
                System.Console.WriteLine("Search query string required, use --search=\"query\"");
                Environment.Exit(1);
            }

            var service = Service.Connect(cli.Opts);
            var jobs = service.GetJobs();
            job = jobs.Create((string)cli.Opts["search"]);
            while (!job.IsDone)
            {
                System.Threading.Thread.Sleep(1000);
            }
        }

        [System.Runtime.InteropServices.ComVisible(false)]
        public string GetAllResults()
        {
            var outArgs = new JobResultsArgs
            {
                OutputMode = JobResultsArgs.OutputModeEnum.Xml,

                // Return all entries.
                Count = 0
            };

            using (var stream = job.Results(outArgs))
            {
                var setting = new XmlReaderSettings
                {
                    ConformanceLevel = ConformanceLevel.Fragment,
                };

                using (var rr = XmlReader.Create(stream, setting))
                {
                    return rr.ReadOuterXml();
                }
            }
        }

        private Job job { get; set; }
    }
}

https://github.com/ravibeta/csharpexamples/tree/master/SplunkComponent.