Cluster computing

Tuesday, February 11, 2014

Splunk monitors machine data There are some concepts specific to Splunk. We briefly review these now. Index time processing : Splunk reads data from a source such as a file or a port and classifies that source into a source type. Data is broken into events that consist of single or multiple lines and writes each event into an index on disk, for later retrieval with a search.
When search starts, events are retrieved and classified based on eventtypes and the matching events are transformed to generate reports and displayed on dashboards.
By default, the events go into a main index unless a specified index is created or stored.
Fields are searchable name/value pairings in event data - usually the default fields are host, source and sourcetype. Tags are aliases to field values. Event types are dynamic tags attached to an event. Saved Splunk objects such as savedsearches, eventtypes, reports and tags are not only persisted but also secured with permissions based on users and roles. Thus events are enriched before indexing. When sets of events are grouped into one, they are called transactions.
Apps are a collection of splunk configurations, object and code. They help the user in organization of targeted activities.
Splunk instances can work in three different modes - forwarder, indexer and search head. A forwarder is a version of Splunk that allows you to send data to a central Splunk indexer or a group of indexers. An indexer provides indexing capability for local and remote data. An indexer is usually added for every 50-100 GB per day depending on search load. A Splunk search head is typically added for every 10-20 active users depending on searches.
One of the sources for machine data is message queuing. This is particularly interesting because message queues are increasingly being used as the backbone of logging architectures for applications. Subscribing to these message queues is a good way to debug problems in complex applications. As the Exploring Splunk book mentions, you can see exactly what the next component down the chain received from the prior component. However, the depth of support to subscribe to message queues and on different operating system varies. At the very least, transactional and non-transactional message queues on Windows could be supported directly out of the box.

Monday, February 10, 2014

We return to our discussion on Splunk commands. The eval command calculates an expression and puts the resulting value into a field. Eval recognizes functions such as :
abs(x), case(X, "Y",...), ceil(x), cidrmatch("x",Y) that identifies ip addresses that belong to a subnet, coalesce(X,...) that returns the first value that is not null, exact(X) that uses double precision, exp(x), floor(x), if (X,Y, Z), isbool(X), isint(X), isnotnull(X), isnull(X), isnum(x), isstr(), len(), like(X,"Y"), ln(X), log(X,Y), lower(X), ltrim(X,Y), match(X,Y) - which matches regex pattern, max(X,...), md5(x) which gives an md5 hash, min(X,...) which returns the min, mvcount(X) which returns the number of values of X, mvfilter(X) whcih filters the multivalued field, mvjoin(X,Y) which joins the field based on the specified delimiter, now which gives current time, null(), nullif(), pi(), pow(X,Y), random(), relative_time(X,Y), replace(X,Y,Z), round(X,Y), rtrim(X,Y), searchmatch(X) and split(X,"Y"), sqrt(X) and strftime(X, Y)which returns the time as specified by the format, strptime() that parses time from str, substr, time, tonumber, tostring, trim, typeof(X) that returns a string representation of its type, upper(X), urldecode(X) and validate(X,Y,...)
Common stats function include avg, count, dc that returns distinct values, first, last, list, max, median, min, mode, perc<x>(Y) that returns percentile, range that returns difference between max and min values, stdev that returns sample standard deviation, stdevp that returns population standard deviation, sum, sumsq, values and var that returns variance of X.
Regular expressions can be specified with the following meta characters:
\s for white space, \S for not white space, \d for Digit, \D for not digit, \w for word character, \W for not a word character, [...] for any included character, [^...] for no included character, * for zero or more, + for one or more, ? for zero or one, | for Or, (?P<var>...) for named extraction such as for SSN, (?:...) for logical grouping, ^ start of line, $ for end of line, {...} for number of repetitions, \ for Escape, (?= ...) for Lookahead, and (?!...) for negative lookahead. The same regular expressions can be specified in more than one ways but the parser will attempt to simplify/expand it to cover all cases as applicable by the pattern. For example a repetition of x 2 to 5 times x{2,5} will be written as xx(x(x(x)?)?)?

Sunday, February 9, 2014

In regular expression search, matches can be partial or full. In order that we process all the groups in a full match, we iterate multiple times.
In either case, there are a few approaches to pattern matching.
The first is backtracking
The second is finite automata
The third approach is RE2.
Often production implementations mix and match more than one of the approaches above.
We discuss these briefly here.

Kerninghan and Pike mention a very fast and simple backtracking approach. Matches are determined one position at a time. Positions need not be contiguous. If the match occurs, the remainder of the pattern is checked against the rest of the string. A recursive function helps here and the recursion can be nested upto the length of the pattern. This pattern is maintained based on the literals and metacharacters provided. The empty string can match with an aseterisk. A separate recursive method may exist for the asterisk wild card character since it is matched in one of three different ways, depending on the parameters to the function.

RE2 approach is used by PCRE, perl and python. Unlike the backtracking approach above that can take an exponential order of time even though such an approach can support a variety of features, RE2 is efficient in that it uses automata theory and supports leftmost-first match. This approach has the following steps:
Step 1: Parse (Walking a Regexp)
Step 2: Simplify
Step 3: Compilstea.de
Step 4: Matchns in

The above steps are explained as below:

The parser could be simple but there's a variety of regular expressions that can be specified. Some parsers maintain an explicit stack instead of using recursion. Some parsers allow a single linear scan by keeping reverse polish notation. Others use trees or explicit stack. Its possible to parse a regular expression as a decision tree on the specified input text to see if any of the substrings at a position match the pattern. I take that back, we could do with a graph of predefined operations instead. These operations are concatenation, repetition and alternation.
The next step is to simplify.The purpose of simplifying is to reduce the memory footprint.
The result of the next step i.e. compilation is an instruction graph that makes it easy for pattern matching. This way parser doesn't need to be invoked again and again.
Lastly, the step for matching can be both partial or full.

Saturday, February 8, 2014

In the previous post, we talked about regular expression. Their implementation s quite interesting. I would like to cover that in a subsequent post. Meanwhile I wanted to mention the different forms regular expression can take. The given pattern can have wild card and meta characters. This means the patterns can match a variety of forms. Typically we need one expanded form that will match most patterns.
By expanding the regular expression I mean, we use a form which has all the different possible groups enumerated. By expanding the groups, we attempt to cover all the patterns that were intended to be matched. Recall how in our final desired output, we order the results based on groups and captures. Expanding the pattern helps us with this enumeration.
There are many editors for regular expression including Regular Expression Buddy that enables users of regular expressions to try out their expression with sample data in an interactive manner. The idea is that there may need to be some experimentation to decide which regular expression best suits the need so that there is little surprise between the desired and actual output. Although the regular expression buddy is a downloadable software, there are webapplications such as regexpr that we can try out for the same. The User interface allows you to interactively and visually see the results with highlighted matches.

Friday, February 7, 2014

At the heart of every text searching, there is a pattern match that's expected. The regex operator is very widely used and it is equally important as with any software for searching and particularly in Splunk.
Let us look at this operator more closely and see if we can find an implementation that works well. There is a need to optimize some codepaths for fast pattern matching especially for simple patterns. However, here we want to focus on the semantics, organization and the implementation.
Patterns are best described by Group and Captures.
A Group can be a literal or a pattern. Groups can be nested and indicate one or more occurrences of their elements.
A Capture is a match between a group and the text.
A Capture has such things as index and length of the match within the original string.
A group can have many captures often referred to as CaptureCollection.
A match may have many groups each identified by a group number for that match
Matches can follow one after the other in a string. It's necessary to find all. The caller can call Match.NextMatch() to iterate over them.
The results of the output should look something like this:
Original text
Match found :
Group 1=
Capture 0 =   value      Index=      Length=
                Capture 1 =   value      Index=      Length=
Group 2=
Capture 0 =   value      Index=      Length=
                :
and so on.
Since wild cards and other meta characters are supported, it is important to match the group for each possible candidate capture.
All captures are unique in the sense that they have a distinct index and length pair. Indexes and Length won't be sequential but the larger captures precede the smaller captures because the smaller are typically the subset of the bigger.

I've been reading a book on exploring Splunk to index and search machine data. I want to continue on that discussion and include my take from a training video I've seen. Today I want to take a break to paint a vision of my text processing system that can do both keyword weighting and topic extraction from text. I've attempted different projects on different kinds of implementations - both in algorithms and implementations. Most of them have not been satisfactory except perhaps the more recent ones and even there there are some refinements still to be done. But I've learned some and can associate an appropriate algorithm for the task at hand. For the most part, it will follow conventional wisdom. By that I mean where documents are treated as term vectors and vectors are reduced from a high number of dimensions before they are clustered together. I've tried thinking about alternative approaches to avoid the curse of dimensions and I don't feel I have done enough on that front but the benefits of following the convention is that there is plenty of literature on what has worked before. In many cases, there is a lot of satisfaction if it just works. Take for instance the different algorithms to weigh the terms and the clustering of topics. We chose some common principles from most of the implementation discussion in papers and left out the fancy ones.We know that there are soft memberships to differ topics, different ways in which the scope of the search changes and there are different tools to be relied on but overall we have experimented with different pieces of the puzzle so that they can come together.
I now describe the overall layout and organization of this system. We will have layers for different levels of engagement and functionalities starting with the backend all the way to the front end. The distinguishing feature of this system will be that it will allow different algorithms to be switched in and out for the execution of the system. And a scorecard to be maintained for each of the algorithms that can then be evaluated against the text to choose what's best. Due to the nature of the input and the the emphasis of each algorithm, such a strategy design pattern becomes a salient feature of the core of our system. The engine may have to do several mining techniques and may even work with big data, hence it should have a distributed framework where the execution can be forked out to different agents. Below the processing engine layer will be a variety of large data sources and a data access layer. There could also be node initiators and participants from a cluster. The processing engine can sit on top of this heterogenous system.
Above the processing engine comes the management layer that can handle remote commands and queries. These remote commands could be assumed to come over http and may talk to one or more of the interfaces that the customer uses.These could include command line interfaces, User Interface and an administration panel.
The size of the data and the scalability of the processing as well as the distributed tasks may require modular components with communication so that they can be independently tested and they can be switched in and out. Also, the system may perform very differently for data that doesn't fit in main memory be it at the participant or the initiator machine.

Thursday, February 6, 2014

Today I will discuss Named return value optimization and copy elision. Copy elision is a compiler optimization technique that eliminates unnecessary copying of objects. For example, copying can be eliminated in the case of temporary objects of class type that has not been bound to a reference. This is the case in return value optimization. Take the following code as given in msdn :
class RVO
{
// constructor prints call
// copy constructor prints call
// destructor prints call
// declare data variable
}

Now if there was code for a method to return an object of this class after assignment of the data variable, it would call the consructor twice and the copy constructor once and the destructor thrice in that order because one object is created within the method and another object is created in the caller and a temporary object is created in the return value between the method and the caller.

This temporary object can now be done away with in an optimization without affecting the program logic because both the caller and the method will have access to their objects. This optimization is called return value optimization.
Hence the program output with return value optimization will print two constructors followed by two destructors and without the line for copy constructor and a destructor. This saves on memory space particularly if the object can be of arbitrary size.

Compilers such as that for Visual Studio has a switch to kick in this optimization. The effect of this optimization should be clear with the memory growth and can be resource monitored to see the difference.

A side effect of optimization is that programmer should not depend on the temporary objects being created. For example, if the programmer increments the reference count an object at its creation via both the constructor and the copy constructor, he should not differentiate between the two. Further the constructors and destructors will be paired so that programmer can rely on the lifetime of the object without losing the semantics.