Cluster computing

Thursday, July 10, 2014

Another search processor for Splunk could be type conversion. That is support for user defined types in the search bar. Today we have fields that we can extract from the data. Fields are like key-value pairs. So users define their queries in terms of key values. Splunk also indexes key-value pairs so that their look-ups are easier. Key-Value pairs are very helpful in associations between different SearchResults and in working with different processors. However, support for user defined types can change the game and become a tremendous benefit to the user. This is because user defined types associate not just one fields but more than one fields with the data and in a way the user defines. This is different from tags. Tags can also come in helpful to the user for his labeling and defining the groups he cares about. However support for types and user defined types goes beyond mere fields. This is quite involving in that it affects the parser, the indexer, the search result retrieval and the display.
But first let us look at a processor that can support extract, transform and load kind of operations. We support these via search pipeline operators where the search results are piped to different operators that can handle one or more of the said operations. For example, if we wanted to transform the raw data behind the search results into XML, we can have a 'xml' processor that transforms it into a single result with the corresponding XML as the raw data. This lends itself to other data transformations or XML style querying by downstream systems.XML as we know is a different form of data than tabular or relational. Tabular or relational data can have compositions that describe entities and types. We don't have a way to capture the type information today but that doesn't mean we cannot plug into a system that does. For example, database servers handle types and entities. If Splunk were to have a connector where it could send XML downstream to a SQL lite database and shred the XML to relational data, then Splunk doesn't even have the onus to implement a type based system. It can then choose to implement just the SQL queries that lets the downstream databases handle it. These SQL queries can even be saved and reused later. Splunk uses SQL lite today. However, the indexes that Splunk maintains is different from the indexes that a database maintains. Therefore, extract transform and load of a data to downstream systems could be very helpful. Today atom feeds may be one way to do that but search results are even more intrinsic to Splunk.

In this post, I hope I can address some of the following objectives, otherwise I will try to elaborate over them in the next few.
Define why we need xml operator
The idea behind converting tables or CSVs to XML is that it provides another avenue for integration with data systems that rely on such data format. Why are there special systems using XML data ? Because data in xml can be validated independently with XSD, provide a hierarchical and well defined tags, enable a very different and useful querying system etc. Uptil now, Splunk relied on offline and file based dumping of XML. Such offline methods did not improve the workflow users have when integrating with systems such as a database. To facilitate the extract, transform and load of search results into databases, one has to have better control over the search results. XML is easy to import and shred in databases for further analysis or archival. The ability to integrate Splunk with a database does not diminish the value proposition of Splunk. If anything, it improves the usability and customer base of Splunk by adding more customers who rely on database for analysis.
Define SQL integration
Define user defined type system
Define common type system
Define user defined search operator
Define programmable operator
Define user programming interface for type system

Tuesday, July 8, 2014

I wonder why we don't have a search operator that translates the search results to XML ?

I'm thinking something like this :
Conversion from:
Search Result 1 : key1=value1, key2=value2, key3=value3
Search Result 2 : key1=value1, key2=value2, key3=value3
Search Result 3 : key1=value1, key2=value2, key3=value3

To:
<SearchResults>
<SearchResult1>
<key1>value1 </key1>
<key2> value2 </key2>
<key3> value3 </key3>
</SearchResult1>
:
</SearchResults>

This could even operate on tables and convert them to XML.

And it seems straightforward to implement a Search processor that does this.

The main thing to watch out for is the memory growth for the XML conversion. The search results can be an arbitrary number potentially causing unbounded growth as a string for XML we are better off writing it to a file. At the same time, the new result with the converted XML is useful only when the format and content of the XML is required in a particular manner and serves as an input to other search operators. Otherwise the atom feed of Splunk already has an output XML mode.

Monday, July 7, 2014

Today we review search processors in Splunk. There are several processors that can be invoked to get the desired search results. These often translate to the search operators in the expression and follow a pipeline model. The pipeline is a way to redirect the output of one operator into the input of another. All of these processors implement a similar execute method that takes SearchResults and SearchResultsInfo as arguments. The processors also have a setup and initialization method where they process the arguments to the operators. The model is simple and portable to any language
We will now look at some of these processors in detail.
We have the Rex processor that implements the Rex operations on the search results. If needed, it generates a field that contains the start/end offset of each match. And creates a mapping for groupId to key index if and only if not in sed mode.
We have the query suggestor operator which suggests useful keywords to be added to your search. This works by ignoring some keywords and keeping a list of samples.
The Head processor iterated over the results to display only the truncated set.
The tail processor shows the last few results.

Sunday, July 6, 2014

In the previous post we mentioned fields and table operators. We will round up the discussion with the removal of cell values, columns and rows in a result set. When we want to filter a result set, we can work at it by removing one cell after the other as we traverse the columns and rows. This gives us the highest granularity to cut the result set in the shape we want. However, this as you can see is also an expensive operation. Is there a better way to optimize that ? Yes there is. One way to do that would be to remove just the fields so the cells remain but are not listed and the user has specified the choices at the field level. Note that iterating over the rows to filter out the ones that don't match the criteria is still required but it is as inexpensive as not including a pointer. Thus we can project exclusively on fields to handle the functionality of both the fields as well as the table operator. In the case where we have a large number of results to run through, this method is fairly efficient and takes little time to execute. In order to match the fields with the choices of (+/-) to include or exclude them we can just test for whether the - sign or remove attribute has been specified and compare it with the match for a search criteria. If the remove attributes is present and there is a match, we can exclude the field. If the remove attribute is absent and there is no match, then too the field can be excluded. This way we succinctly check for whether the fields are to be removed. This is not the same for the table operator. In the case of table there is no syntax for remove attribute. Hence the check for the natch to include only the columns specified is required.

In this post, we will discuss some of the core features of Splunk. In particular we will be discussing how the fields operator is different from the table operator. Both of these operator are something that can be specified in the search bar. So they work to project different fields or columns of the data. In the case of the fields operator, raw results are returned that are similar to the original search results but only that that are satisfied by the presence of the fields. If we wish to exclude the fields, we can specify the negative sign as the first argument before the fields. The presence of the positive sign is optional as it is understood. The table operator works in selecting columns in the way just like any projection operator will do. These are based on enumerating all the available columns and selecting only a few of the columns for projection. As you can see both the fields as well as the table operator are both similar in selecting fields specified by the user from the available list of fields. These fields have to be those available from the header and or defined by the user. The fields are not restricted in earlier versions to exclude indexed fields or reserved fields. But there is an argument favoring their exclusion since the users sees the fields extracted anyways and there won't be any change in behavior otherwise. The presence of the indexed fields is different though. The indexed fields are different because they are used and should not be excluded from the search results. Again this means that there won't be any change in behavior to the user because these fields are automatically extracted and displayed to the user. Behind the scenes, how this happens in earlier versions is that the different reserved fields are added to the operators during the search dispatch internally but just not handled within the processor of the operator itself. So the user doesn't see a change when we remove the explicit addition of some reserved fields when they could have become obsolete or replaced. They would have been better consolidated into the processor logic itself. The most important thing here is that the table and the fields operator have different output formats and a fields operator can specify the table operator to modify the results.

Saturday, July 5, 2014

Today I'm going to cover a new book as discussed in the previous post. This book talks about fostering and maintaining organizational excellence. It's not only about executing in the short term but also about maintaining health over a long term. To create a culture of continuous improvement, the authors recommend a process with steps to aspire; assess; architect, act and advance. Health is defined in terms of nine elements.
Direction- where to head
Leadership-
Culture and climate-
Accountability-
Co ordination and control
capabilities
motivation
external orientation
Innovation and learning
To aspire, we can be leadership driven, or have an execution edge or build from market focus or have a knowledge core.
To architect, we need to identity the right set of initiatives and define each initiative with a compelling story.
To act we choose the right delivery model and define the change engine with a structure, ownership and evaluation. To advance we seek a continuous improvement infrastructure that is built on meaning, framing, connecting and engaging.

Thursday, July 3, 2014

Today we will continue to discuss the book we started in yesterday's blogpost. Tomorrow we will discuss book Beyond Performance. In yesterday's post we discussed the five principles as : define the purpose, engage multiple perspectives, frame the issues; set the scene and make it an experience. We will review this in detail now.
When defining the purpose, it is important to design the session well so as to unlock the interdisciplinary solution. If the participants are unfamiliar with the issue, they may need an educative session first. If the participants are well aware of the issues but are spinning the wheels, they may need a making choices session.
When engaging multiple perspectives, it is important to find the right mix of people. Sometimes this can mean a dream team other times it could mean fresh blood
A common platform should be created to improve creative collaboration. This can be done with a group identity, a common target, interactions and sharing.
When framing the issues, it is a key practice to establish boundaries and scope. The mindsets of the participants should be stretched but not broken by carefully keeping the contents and perspectives in balance.
When setting the scene, it is important to pay attention to every detail that can help with the collaboration.
When making it an experience, it is important to consider the whole person and his/her thinking inside box.