Cluster computing

Friday, February 7, 2014

At the heart of every text searching, there is a pattern match that's expected. The regex operator is very widely used and it is equally important as with any software for searching and particularly in Splunk.
Let us look at this operator more closely and see if we can find an implementation that works well. There is a need to optimize some codepaths for fast pattern matching especially for simple patterns. However, here we want to focus on the semantics, organization and the implementation.
Patterns are best described by Group and Captures.
A Group can be a literal or a pattern. Groups can be nested and indicate one or more occurrences of their elements.
A Capture is a match between a group and the text.
A Capture has such things as index and length of the match within the original string.
A group can have many captures often referred to as CaptureCollection.
A match may have many groups each identified by a group number for that match
Matches can follow one after the other in a string. It's necessary to find all. The caller can call Match.NextMatch() to iterate over them.
The results of the output should look something like this:
Original text
Match found :
Group 1=
Capture 0 =   value      Index=      Length=
                Capture 1 =   value      Index=      Length=
Group 2=
Capture 0 =   value      Index=      Length=
                :
and so on.
Since wild cards and other meta characters are supported, it is important to match the group for each possible candidate capture.
All captures are unique in the sense that they have a distinct index and length pair. Indexes and Length won't be sequential but the larger captures precede the smaller captures because the smaller are typically the subset of the bigger.

I've been reading a book on exploring Splunk to index and search machine data. I want to continue on that discussion and include my take from a training video I've seen. Today I want to take a break to paint a vision of my text processing system that can do both keyword weighting and topic extraction from text. I've attempted different projects on different kinds of implementations - both in algorithms and implementations. Most of them have not been satisfactory except perhaps the more recent ones and even there there are some refinements still to be done. But I've learned some and can associate an appropriate algorithm for the task at hand. For the most part, it will follow conventional wisdom. By that I mean where documents are treated as term vectors and vectors are reduced from a high number of dimensions before they are clustered together. I've tried thinking about alternative approaches to avoid the curse of dimensions and I don't feel I have done enough on that front but the benefits of following the convention is that there is plenty of literature on what has worked before. In many cases, there is a lot of satisfaction if it just works. Take for instance the different algorithms to weigh the terms and the clustering of topics. We chose some common principles from most of the implementation discussion in papers and left out the fancy ones.We know that there are soft memberships to differ topics, different ways in which the scope of the search changes and there are different tools to be relied on but overall we have experimented with different pieces of the puzzle so that they can come together.
I now describe the overall layout and organization of this system. We will have layers for different levels of engagement and functionalities starting with the backend all the way to the front end. The distinguishing feature of this system will be that it will allow different algorithms to be switched in and out for the execution of the system. And a scorecard to be maintained for each of the algorithms that can then be evaluated against the text to choose what's best. Due to the nature of the input and the the emphasis of each algorithm, such a strategy design pattern becomes a salient feature of the core of our system. The engine may have to do several mining techniques and may even work with big data, hence it should have a distributed framework where the execution can be forked out to different agents. Below the processing engine layer will be a variety of large data sources and a data access layer. There could also be node initiators and participants from a cluster. The processing engine can sit on top of this heterogenous system.
Above the processing engine comes the management layer that can handle remote commands and queries. These remote commands could be assumed to come over http and may talk to one or more of the interfaces that the customer uses.These could include command line interfaces, User Interface and an administration panel.
The size of the data and the scalability of the processing as well as the distributed tasks may require modular components with communication so that they can be independently tested and they can be switched in and out. Also, the system may perform very differently for data that doesn't fit in main memory be it at the participant or the initiator machine.

Thursday, February 6, 2014

Today I will discuss Named return value optimization and copy elision. Copy elision is a compiler optimization technique that eliminates unnecessary copying of objects. For example, copying can be eliminated in the case of temporary objects of class type that has not been bound to a reference. This is the case in return value optimization. Take the following code as given in msdn :
class RVO
{
// constructor prints call
// copy constructor prints call
// destructor prints call
// declare data variable
}

Now if there was code for a method to return an object of this class after assignment of the data variable, it would call the consructor twice and the copy constructor once and the destructor thrice in that order because one object is created within the method and another object is created in the caller and a temporary object is created in the return value between the method and the caller.

This temporary object can now be done away with in an optimization without affecting the program logic because both the caller and the method will have access to their objects. This optimization is called return value optimization.
Hence the program output with return value optimization will print two constructors followed by two destructors and without the line for copy constructor and a destructor. This saves on memory space particularly if the object can be of arbitrary size.

Compilers such as that for Visual Studio has a switch to kick in this optimization. The effect of this optimization should be clear with the memory growth and can be resource monitored to see the difference.

A side effect of optimization is that programmer should not depend on the temporary objects being created. For example, if the programmer increments the reference count an object at its creation via both the constructor and the copy constructor, he should not differentiate between the two. Further the constructors and destructors will be paired so that programmer can rely on the lifetime of the object without losing the semantics.

Wednesday, February 5, 2014

We discussed alerts, actions, charts, graphs, visualizations, and dashboards in the previous post. We now review recipes for monitoring and alerting. These are supposed to be brief solutions for common problems. Monitoring helps you see what is happening to your data. As an example, let us say we want to monitor how many concurrent users are there at any given time. This is a useful metric to see if a server is overloaded. To do this, we search for the relevant events. Then we use the concurrency command to find the number of users that overlap. Then we use a time chart reporting command to display a chart of the number of concurrent users.
We specify this as search sourcetype=login_data | concurrency duration=ReqTime | timechart max(concurrency)
Let us say next that we want to monitor the inactive hosts.
we use the metadata command that gives information on host, source and source types
Here we specify
| metadata type=hosts | sort recentTime | convert ctime(recentTime) as Latest_Time
We can use tags to categorize data and use it with our searches.
In the above example, we could specify:
... | top 10 tag::host to specify top ten host types.
Since we talked about tag, we might as well see an example about event type
we could display a chart of how host types perform using only event types that end in _host with the following:
... | eval host_types=mvfilter(match(eventtype, "_host$"))
| timechart avg(delay) by host_types
Another common question that we could help answer with monitoring is how did today perform compared to previous month ?
For example we might want to view the hosts that were more popular today than previous month.
This we do with the following steps:
1. get the monthly usage for each host
2. get the daily usage for each host and append
3. use the stats to join the monthly and daily usages by host.
4. use sort and eval to format the results.
Let's try these commands without seeing the book.
| metadata type=hosts | sort duration | earliest = -30d@d | stat sum(duration) as monthly_usage by host | sort 10 - monthly_usage | streamstats count as MonthRank.
Cut and paste the above with changes for daily as
append[ | metadata type=hosts | sort duration | earliest = -1d@d | stat sum(duration) as daily_usage by host | sort 10 - daily_usage | streamstats count as DailyRank]
Next join the monthly and the daily rankings with stats command:
stats first(MonthRank) as MonthRank first(DayRank) as DayRank by host
Then we format the output :
eval diff=MonthRank-DayRank | sort DayRank | table DayRank, host, diff, MonthRank
Each of the steps can now be piped to the other and the overall search query can be pipe-concatenated to form a single composite query.

Tuesday, February 4, 2014

In today's post we will cover another chapter in Exploring Splunk book. This chapter is on enriching data. We can use command like top and stats to explore the data. We can also add spark lines which are small line graphs to the data so that data patterns can be quickly and easily visualized.
With Splunk it is easy to exclude data that has already been seen. We do this with tagging. This helps detecting interesting events from noise.
When we have identified the fields and explored the data, the next step is to categorize and report the data
Different event types can be created to categorize the data. There are only two rules to keep in mind with event types. There are no pipes to be used with event type declaration and there cannot be nested searches aka subsearches to create event types. For example status = 2* to define success cases and status = 4* for client_errors.
More specific event types can be built on type of more general event types. For example web_error can include both client_errors and server_errors. The granularity of event types is left to user discretion since the results matter to the user.
Event types can also have tags. A more descriptive tag about the errors enhances the event type.
As an example, user_impact event tag can be used to report on the events separately.
Together event types and tags allow data categorization and reporting for voluminous machine data.Refining this model is usually an iterative effort. We could start with a few useful fields and then expand the search. All the while, this adds more input to Splunk to organize and label the data.
We mentioned visualizing data with sparklines. We can also visualize data with charts and graphs. This is done from the create report tab of the search page.
For example, we can search with a query such as sourcetype=access* status=404 | stats count by category_id and then create a pie chart on the results. Hovering over the chart now gives details of the data.
Dashboards are yet another visualization tool. Here we present many different charts/graphs and other visualizations in a reporting panel. As with most reporting, a dashboard caters to an audience and effectively answers a few questions that the audience would be most interested in. This can be gathered from user input and feedback iterations. As with charts and graphs, its best to start with a few high level fields before making it more sophisticated.
Alerts are another tool that can run periodically or on events when search results evaluate against a condition.There are three options to schedule an alert. First is to monitor whenever the condition happens. The second is to monitor on a scheduled basis as a less urgent information. Third is to monitor using a realtime rolling window if certain number of things happen within a certain time-period.
Alerts can have associated actions that make them all the more useful. The actions can be specified via the wizard. Some actions could be say send an email, run a script, and show triggered alerts.

To make the data more usable, Splunk allows enriching of data with additional information so that Splunk can classify it better. Data can be saved in reports and dashboards that make it easier to understand. And alerts can be added so that potential issues can be addressed proactively and not after the effect.
The steps in organizing data usually involve identifying fields in the data, categorizing data as a pre-amble to aggregation and reporting etc. Preconfigured settings can be used to identify fields. These utilize hidden attributes embedded in machine data. When we search, Splunk automatically extracts fields by identifying common patterns in the data.
Configuring field extraction can be done in two ways - Splunk can automate the configuration by using the Interactive field extractor or we can manually specify the configuration.
Another way to extract fields is to use search commands. The rex commands comes in very useful for this. It takes a regular expression and then extracts fields that match the expression. To extract fields from multiline tabular data, we use multikv and to extract from xml and json data, we use spath or xmlkv
The search dashboard's field sidebar gives immediate information for each field such as :
the basic data type of the field with abbreviations such as a for text and # for numeric
the number of occurrences of the field in the events list (following the field name)

Monday, February 3, 2014

We discuss the various kinds of processors in Splunk within a pipeline. We have the monitor processor that looks for files and the entries at the end of the file. Files are read one at a time, in 64KB chunks, and until EOF. Large files and archives could be read in parallel. Next we have a UTF-8 processor that converts different char set to UTF-8.
We have a LineBreaker processor that introduces line breaks in the data.
We also have a LineMerge processor that does the reverse.
We have a HeadProcessor that multiplexes different data streams into one channel such as TCP inputs.
We have a Regex replacement processor that searches and replaces the patterns.
We have an annotator processor that adds puncts to raw events. Similar events can now be found.
The Indexer pipeline has TCP output and Syslog output both of which send data to remote server. The indexer processor sends data to disk. Data goes to remote server or disk but usually not both.
An Exec processor is used to handle scripted input.