Cluster computing

Sunday, July 13, 2014

Today we look at a comparison between Splunk clustering and a Hadoop instance. In Hadoop for instance, the MapReduce used is a high performance parallel data processing technique. It does not guarantee ACID properties and supports forward only parsing. Data is stored in Hadoop such that the column names, column count and column datatypes don’t matter. The data is retrieved in two steps – with a Map function and a Reduce function. The Map function selects keys from each line and the values to hold resulting in a big hashtable. The Reduce function aggregates results. The database stores these key-values as columns in a column family and each row can have more than one column family. Splunk uses key maps to index the data but has a lot to do in terms of Map-Reduce and database.Splunk stores events. Its indexing is about events - together with their raw data andtheir index files and metadata. These are stored in directories organized by agecalled buckets. Splunk clustering is about keeping multiple copies of data to preventdata loss and improving data availability for searching. Search heads co-ordinatesearches across all the peer nodes.

Saturday, July 12, 2014

In this post, we talk about support for clustering in Splunk. Clustering is about replicating buckets and searchable data for tolerating failures in a distributed environment. There are two configuration settings to aid with replication. One that determines the replication of the raw data and another that determines the replication of searchable data. Both are in the configuration file in the master. Master talks to the peer over HTTP. Peers talk to each other on s2s. The design is such that the peers talk to the master and vice versa but the peers don't need to talk to one another. Basic configuration involves forwarder sending data to peers and search heads talking to both master and peer, Master does most of the management and the peers are the work horses. The hot buckets are created by the indexes but clustering improves the names so as to differentiate them for the nodes. We have a cluster wide bucket id that comprises : index plus id plus guid. We replicate by slices of data in these hot buckets.
We don’t enforce clustering policy on standalone buckets. On each bucket roll, we inform the master. The master keeps track of the states and does ‘fixup’.We schedule a ‘fixup’ on all failures.
Fixup is what happens when a node goes down and we lose the buckets it was working on
Rebuilding was a big problem because it took a lot of time.
Fixup level is broken down into six different levels (streaming, data_safety, generation, replication factor, search factor and checksum)
We schedule the highest priority work at all times.
When peers come up, they get the latest bundle from master
when a cluster node goes down, we could avoid messy state by going offline.
There are two versions of offline -
wait for master to complete (permanent)
second is allow rebalancing primaries by informing master while participating in searches till master gets back to you.
the states are offline->inputs(closed)->wait->done
Primary means there is an in-memory bit mask for that generation.
generation means snap-shoting the states of the primaries across the system.
master tracks which are participating in my current generation
each peer knows which generation it is a primary for.

Friday, July 11, 2014

In today's post we will continue our discussion. We will explore and describe
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system
Before we look at SQL integration, we want to look at the ways Splunk uses SQL lite. With that disclaimer and rain check, I will proceed to what I want: to create SQL queries for externalized search and types out of fields
First we are looking at a handful of SQL queries.
Next, we use the same schema as we have key maps.
I want to describe the use of a user defined search processor. Almost all search processors implement a set of common methods. These methods already describe a set of expected behavior for any processor that handles input and output of search results.if these methods were to be exposed to the user via a programmable interface, then users can plug in any processor of their own. To expose these methods to the user, we need callbacks that we can invoke and these can be registered as REST api by the user. The internal implementation of this custom search processor can then make these REST calls and marshal the parameters and the results.

Thursday, July 10, 2014

Another search processor for Splunk could be type conversion. That is support for user defined types in the search bar. Today we have fields that we can extract from the data. Fields are like key-value pairs. So users define their queries in terms of key values. Splunk also indexes key-value pairs so that their look-ups are easier. Key-Value pairs are very helpful in associations between different SearchResults and in working with different processors. However, support for user defined types can change the game and become a tremendous benefit to the user. This is because user defined types associate not just one fields but more than one fields with the data and in a way the user defines. This is different from tags. Tags can also come in helpful to the user for his labeling and defining the groups he cares about. However support for types and user defined types goes beyond mere fields. This is quite involving in that it affects the parser, the indexer, the search result retrieval and the display.
But first let us look at a processor that can support extract, transform and load kind of operations. We support these via search pipeline operators where the search results are piped to different operators that can handle one or more of the said operations. For example, if we wanted to transform the raw data behind the search results into XML, we can have a 'xml' processor that transforms it into a single result with the corresponding XML as the raw data. This lends itself to other data transformations or XML style querying by downstream systems.XML as we know is a different form of data than tabular or relational. Tabular or relational data can have compositions that describe entities and types. We don't have a way to capture the type information today but that doesn't mean we cannot plug into a system that does. For example, database servers handle types and entities. If Splunk were to have a connector where it could send XML downstream to a SQL lite database and shred the XML to relational data, then Splunk doesn't even have the onus to implement a type based system. It can then choose to implement just the SQL queries that lets the downstream databases handle it. These SQL queries can even be saved and reused later. Splunk uses SQL lite today. However, the indexes that Splunk maintains is different from the indexes that a database maintains. Therefore, extract transform and load of a data to downstream systems could be very helpful. Today atom feeds may be one way to do that but search results are even more intrinsic to Splunk.

In this post, I hope I can address some of the following objectives, otherwise I will try to elaborate over them in the next few.
Define why we need xml operator
The idea behind converting tables or CSVs to XML is that it provides another avenue for integration with data systems that rely on such data format. Why are there special systems using XML data ? Because data in xml can be validated independently with XSD, provide a hierarchical and well defined tags, enable a very different and useful querying system etc. Uptil now, Splunk relied on offline and file based dumping of XML. Such offline methods did not improve the workflow users have when integrating with systems such as a database. To facilitate the extract, transform and load of search results into databases, one has to have better control over the search results. XML is easy to import and shred in databases for further analysis or archival. The ability to integrate Splunk with a database does not diminish the value proposition of Splunk. If anything, it improves the usability and customer base of Splunk by adding more customers who rely on database for analysis.
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system

Tuesday, July 8, 2014

I wonder why we don't have a search operator that translates the search results to XML ?

I'm thinking something like this :
Conversion from:
Search Result 1 : key1=value1, key2=value2, key3=value3
Search Result 2 : key1=value1, key2=value2, key3=value3
Search Result 3 : key1=value1, key2=value2, key3=value3

To:
<SearchResults>
<SearchResult1>
<key1>value1 </key1>
<key2> value2 </key2>
<key3> value3 </key3>
</SearchResult1>
:
</SearchResults>

This could even operate on tables and convert them to XML.

And it seems straightforward to implement a Search processor that does this.

The main thing to watch out for is the memory growth for the XML conversion. The search results can be an arbitrary number potentially causing unbounded growth as a string for XML we are better off writing it to a file. At the same time, the new result with the converted XML is useful only when the format and content of the XML is required in a particular manner and serves as an input to other search operators. Otherwise the atom feed of Splunk already has an output XML mode.

Monday, July 7, 2014

Today we review search processors in Splunk. There are several processors that can be invoked to get the desired search results. These often translate to the search operators in the expression and follow a pipeline model. The pipeline is a way to redirect the output of one operator into the input of another. All of these processors implement a similar execute method that takes SearchResults and SearchResultsInfo as arguments. The processors also have a setup and initialization method where they process the arguments to the operators. The model is simple and portable to any language
We will now look at some of these processors in detail.
We have the Rex processor that implements the Rex operations on the search results. If needed, it generates a field that contains the start/end offset of each match. And creates a mapping for groupId to key index if and only if not in sed mode.
We have the query suggestor operator which suggests useful keywords to be added to your search. This works by ignoring some keywords and keeping a list of samples.
The Head processor iterated over the results to display only the truncated set.
The tail processor shows the last few results.

Sunday, July 6, 2014

In the previous post we mentioned fields and table operators. We will round up the discussion with the removal of cell values, columns and rows in a result set. When we want to filter a result set, we can work at it by removing one cell after the other as we traverse the columns and rows. This gives us the highest granularity to cut the result set in the shape we want. However, this as you can see is also an expensive operation. Is there a better way to optimize that ? Yes there is. One way to do that would be to remove just the fields so the cells remain but are not listed and the user has specified the choices at the field level. Note that iterating over the rows to filter out the ones that don't match the criteria is still required but it is as inexpensive as not including a pointer. Thus we can project exclusively on fields to handle the functionality of both the fields as well as the table operator. In the case where we have a large number of results to run through, this method is fairly efficient and takes little time to execute. In order to match the fields with the choices of (+/-) to include or exclude them we can just test for whether the - sign or remove attribute has been specified and compare it with the match for a search criteria. If the remove attributes is present and there is a match, we can exclude the field. If the remove attribute is absent and there is no match, then too the field can be excluded. This way we succinctly check for whether the fields are to be removed. This is not the same for the table operator. In the case of table there is no syntax for remove attribute. Hence the check for the natch to include only the columns specified is required.