Cluster computing

Tuesday, February 4, 2014

To make the data more usable, Splunk allows enriching of data with additional information so that Splunk can classify it better. Data can be saved in reports and dashboards that make it easier to understand. And alerts can be added so that potential issues can be addressed proactively and not after the effect.
The steps in organizing data usually involve identifying fields in the data, categorizing data as a pre-amble to aggregation and reporting etc. Preconfigured settings can be used to identify fields. These utilize hidden attributes embedded in machine data. When we search, Splunk automatically extracts fields by identifying common patterns in the data.
Configuring field extraction can be done in two ways - Splunk can automate the configuration by using the Interactive field extractor or we can manually specify the configuration.
Another way to extract fields is to use search commands. The rex commands comes in very useful for this. It takes a regular expression and then extracts fields that match the expression. To extract fields from multiline tabular data, we use multikv and to extract from xml and json data, we use spath or xmlkv
The search dashboard's field sidebar gives immediate information for each field such as :
the basic data type of the field with abbreviations such as a for text and # for numeric
the number of occurrences of the field in the events list (following the field name)

Monday, February 3, 2014

We discuss the various kinds of processors in Splunk within a pipeline. We have the monitor processor that looks for files and the entries at the end of the file. Files are read one at a time, in 64KB chunks, and until EOF. Large files and archives could be read in parallel. Next we have a UTF-8 processor that converts different char set to UTF-8.
We have a LineBreaker processor that introduces line breaks in the data.
We also have a LineMerge processor that does the reverse.
We have a HeadProcessor that multiplexes different data streams into one channel such as TCP inputs.
We have a Regex replacement processor that searches and replaces the patterns.
We have an annotator processor that adds puncts to raw events. Similar events can now be found.
The Indexer pipeline has TCP output and Syslog output both of which send data to remote server. The indexer processor sends data to disk. Data goes to remote server or disk but usually not both.
An Exec processor is used to handle scripted input.

Sunday, February 2, 2014

With this post, I will now return to my readings on Splunk from the book - Exploring Splunk. Splunk has a server and client. The Engine of the Splunk is exposed via REST based APIs to CLI Interface, web interface and other interfaces.
The Engine has multiple layers of software. At the bottom layer are components that read from different source types such as files, network ports or scripts. The layer above is used for routing, cloning, and load balancing the data feeds and this are dependent on the load. This load is generally distributed for better performance. All the data is subject to Indexing and an index is build Note that both the indexing layer and the layer below i.e. routing, cloning and load balancing are deployed and set up with user access controls. This is essentially the where what gets indexed by whom is decided. The choice is left to users because we don't want sharing or privacy violations and by leaving it configurable we are independent of how much or how little is sent our way for processing.
The layer on top of Index is Search which determines the processing involved for retrieving the results from the index. The search query language is used to describe the processing and the searching is distributed across workers so that the results can be parallel-ly processed. The layer on top of the Search is the Scheduling/Alerting, Reporting and Knowledge each of which is a dedicated component in itself. The results from these are sent through the REST based API.
Pipeline is used to refer to the data transformations as it changes shape, form and meaning before being indexed. Multiple pipe-lines may be involved before indexing. Processor performs small but logical unit of work. Processors are logically contained within a pipeline.
Processors perform small but logical unit of work. Queues hold the data between Pipelines. Producers and consumers operate on the two ends of a queue.
The file input is monitored in two ways - one the file watcher that scans directory or finds files and the other that reads files at the tail where the data is being added.

In today's post, I want to talk about a nuance of K-means clustering. Here the vectors are assigned to the cluster that is nearest as compared by the centroids. There are ways to assign clusters without centroids and these are based on single link, complete link etc. However, centroid based clusters are the easiest in that the computations are limited.
Note that the number of clusters is pre-specified before the start of the program. This means that we don't change the expected outcome. That is we don't return fewer than the expected clusters. Even if all the data points belong to one cluster, this method aims to partition the n data-points into k clusters each with its own mean or centroid.
The mean is recomputed after each round of assignment of data-points to different clusters. We start with a cluster with three different means initialized. At each step, the data points are assigned to the nearest cluster. That means that the clusters cannot be empty. If any cluster becomes empty because its members join other clusters, then that cluster should take the outliers of an already populated cluster. This way the cluster coherency goes up for each of the clusters.
If the number of clusters is large to begin with and the data set is fewer, this will lead to highly partitioned data set that is not adequately represented by one or more of the clusters and the resulting aggregation of clusters may need to be taken together to see the overall distribution. However, this is not an undesirable outcome. This is the expected outcome for the number of parititions specified. The number of partitions was incorrectly specified to be too high and this will be reflected by the chi square goodness of fit. The next step then should be to reduce the number of clusters.
If we specify only two clusters and all the data points are visually close to a predominant cluster, then too the other cluster need not be kept empty. It can improve the previous cluster by taking one of the outliers into the secondary cluster.

Saturday, February 1, 2014

In this post, I give some examples on DTrace:
DTrace is a tracing tool that we can use dynamically and safely on production systems to diagnose issues across layers. The common DTrace providers are :
dtrace - start, end and error probes
syscall - entry and return probes for all system calls
fbt - entry and return probes for all kernel calls
profile - timer driven probes
proc - process creation and lifecycle probes
pid - entry and return probes for all user-level processes
io - probes for all I/O related events.
sdt/usdt - developer defined probes
sched - for all scheduling related events
lockstat - for all locking behavior within the operating system
Syntax to specify commands probe-description/predicate/{action}
Variables (eg self->varname = 123) and associative arrays (eg name[key] = expression) can be declared. They can be global, thread local or clause local. Associative arrays are looked up based on keys
Common builtin variables include :
args: the typed arguments to the current probe,
ourpsinfo: the process state for the current thread
execname : the name passed in
pid : the process id of the current process
probefunc | probemod | probename | probeprov: the function name, module name, name and providername, of the current probe
timestamp vtimestamp - timestamp and the amount of time the current thread has been running
Aggregate functions include count, sum, avg, min, max, lquantize, quantize, clear, trunc etc.
Actions include trace, printf, printa, stack, ustack, stop, copyinstr, strjoin and strlen.
DTrace oneliners:
Trace new processes:
dtrace -n 'proc:::exec_success { trace(ourspsinfo->pr_psargs); }'
Trace files opened
dtrace -n 'syscall::openat*:entry { printf("%s,%s", execname, copyinstr(arg1)); }'
Trace number of syscalls
dtrace -n 'syscall:::entry {@num[execname] = count(); trace(execname); }'
Trace lock times by process name
dtrace -n 'lockstat:::adaptive_block { @time[execname] = sum(arg1); }'
Trace file I/O by process name
dtrace -n 'io:::start { printf("%d %s %d", pid, execname, args[0]->b_bcount);}'
Trace the writes in bytes by process name
dtrace -n 'sysinfo:::writeoh { @bytes[execname] = sum(arg0); }'

Friday, January 31, 2014

In this post, we look at a few of the examples from the previous post:
    source = job_listings | where salary > industry_average
uses the predicate to filter the results.
    dedup source sortby -delay
shows the first of few unique sources sorted by the delay field in descending order
head (action="startup")
returns the first few events until the one matching startup is found
   transaction clientip maxpause=5s
group events that share the same client IP address and have no gaps or pauses longer than five seconds
Results often include duration and event count
The stats calculate statistical values on events grouped by the value of the fields.
stats dc(host) returns distinct count of host values
stats count(eval(method="GET")) as GET by host returns the number of GET requests for each webserver. percentage and range are other functions that can be used with it.
timechart only not applicable to chart or stats
   chart max(delay) over host
returns max(delay) for each value of host
   timechart span=1m avg(CPU) by host
charts the average value of CPU usage each minute for each host.
Filtering, modifying and adding fields can be done with commands such as eval, rex, and lookup.
The eval command calculates the value of a new field based on an existing field.
The rex command is used to create new fields by using regular expressions
The lookup commands add fields based on a lookup table for value lookups
fields can be specified as a set col1-colN format or with wild card characters

Thursday, January 30, 2014

We will now describe the Search Processing Language. We mentioned earlier that Splunk shifts focus from organizing data to useful queries. The end result may only be a few records from a mountain of original data set. Its the ease of use of a query language to be able to retrieve that result.
Search commands can be separated by pipe operator. This is the well known operator to redirect output of one command as input to another. For example, we could specify the column attributes of the top few rows of an input data set as search | top | fields commands with their qualifiers.
If you are not sure about what to filter on, then we can list all events, group them and even cluster them. There are some mining tips available as well. This method of exploration has been termed 'spelunking' and hence the term for the product.
Some tips for using the search commands include using quotation marks, using the case-insensitivity to arguments specified, boolean logic of AND as default between search commands unless explicitly specified with OR which has higher precedence, using subsearches where one search command is the argument to another search command and specified with square brackets etc.
The common SPL commands include the following:
Sorting results - ordering the results and optionally limiting the number of results with the sort command
filtering results - selecting only a subset of the original set of events and executed with one or more of the following commands: search, where, dedup, head, tail etc.
grouping results - grouping the events based on some pattern as with the transaction command
Reporting results - displaying the summary of the search results such as with top/rare, stats, chart, timechart etc.
Filtering Modifying and Adding fields - this enhances or transforms the results by removing, modifying or adding new fields such as with the fields, replace, eval, rex and lookup events.
Commands often work with about 10,000 events at a time by default unless explicitly overriden to include all. No, there is no support for C like statements as with dtrace. And its not as UI oriented as Instruments. However a variety of arguments can be passed to each of the search commands and its platform agnostic. Perhaps it should support indexing and searching its own logs. These include operators such as startswith, endswith etc and key-values operands.
Statistical fuctions is available with the stats command and supports a variety of builtin functions
chart and timechart commands are used with report builders