Cluster computing

Sunday, February 2, 2014

With this post, I will now return to my readings on Splunk from the book - Exploring Splunk. Splunk has a server and client. The Engine of the Splunk is exposed via REST based APIs to CLI Interface, web interface and other interfaces.
The Engine has multiple layers of software. At the bottom layer are components that read from different source types such as files, network ports or scripts. The layer above is used for routing, cloning, and load balancing the data feeds and this are dependent on the load. This load is generally distributed for better performance. All the data is subject to Indexing and an index is build Note that both the indexing layer and the layer below i.e. routing, cloning and load balancing are deployed and set up with user access controls. This is essentially the where what gets indexed by whom is decided. The choice is left to users because we don't want sharing or privacy violations and by leaving it configurable we are independent of how much or how little is sent our way for processing.
The layer on top of Index is Search which determines the processing involved for retrieving the results from the index. The search query language is used to describe the processing and the searching is distributed across workers so that the results can be parallel-ly processed. The layer on top of the Search is the Scheduling/Alerting, Reporting and Knowledge each of which is a dedicated component in itself. The results from these are sent through the REST based API.
Pipeline is used to refer to the data transformations as it changes shape, form and meaning before being indexed. Multiple pipe-lines may be involved before indexing. Processor performs small but logical unit of work. Processors are logically contained within a pipeline.
Processors perform small but logical unit of work. Queues hold the data between Pipelines. Producers and consumers operate on the two ends of a queue.
The file input is monitored in two ways - one the file watcher that scans directory or finds files and the other that reads files at the tail where the data is being added.

In today's post, I want to talk about a nuance of K-means clustering. Here the vectors are assigned to the cluster that is nearest as compared by the centroids. There are ways to assign clusters without centroids and these are based on single link, complete link etc. However, centroid based clusters are the easiest in that the computations are limited.
Note that the number of clusters is pre-specified before the start of the program. This means that we don't change the expected outcome. That is we don't return fewer than the expected clusters. Even if all the data points belong to one cluster, this method aims to partition the n data-points into k clusters each with its own mean or centroid.
The mean is recomputed after each round of assignment of data-points to different clusters. We start with a cluster with three different means initialized. At each step, the data points are assigned to the nearest cluster. That means that the clusters cannot be empty. If any cluster becomes empty because its members join other clusters, then that cluster should take the outliers of an already populated cluster. This way the cluster coherency goes up for each of the clusters.
If the number of clusters is large to begin with and the data set is fewer, this will lead to highly partitioned data set that is not adequately represented by one or more of the clusters and the resulting aggregation of clusters may need to be taken together to see the overall distribution. However, this is not an undesirable outcome. This is the expected outcome for the number of parititions specified. The number of partitions was incorrectly specified to be too high and this will be reflected by the chi square goodness of fit. The next step then should be to reduce the number of clusters.
If we specify only two clusters and all the data points are visually close to a predominant cluster, then too the other cluster need not be kept empty. It can improve the previous cluster by taking one of the outliers into the secondary cluster.

Saturday, February 1, 2014

In this post, I give some examples on DTrace:
DTrace is a tracing tool that we can use dynamically and safely on production systems to diagnose issues across layers. The common DTrace providers are :
dtrace - start, end and error probes
syscall - entry and return probes for all system calls
fbt - entry and return probes for all kernel calls
profile - timer driven probes
proc - process creation and lifecycle probes
pid - entry and return probes for all user-level processes
io - probes for all I/O related events.
sdt/usdt - developer defined probes
sched - for all scheduling related events
lockstat - for all locking behavior within the operating system
Syntax to specify commands probe-description/predicate/{action}
Variables (eg self->varname = 123) and associative arrays (eg name[key] = expression) can be declared. They can be global, thread local or clause local. Associative arrays are looked up based on keys
Common builtin variables include :
args: the typed arguments to the current probe,
ourpsinfo: the process state for the current thread
execname : the name passed in
pid : the process id of the current process
probefunc | probemod | probename | probeprov: the function name, module name, name and providername, of the current probe
timestamp vtimestamp - timestamp and the amount of time the current thread has been running
Aggregate functions include count, sum, avg, min, max, lquantize, quantize, clear, trunc etc.
Actions include trace, printf, printa, stack, ustack, stop, copyinstr, strjoin and strlen.
DTrace oneliners:
Trace new processes:
dtrace -n 'proc:::exec_success { trace(ourspsinfo->pr_psargs); }'
Trace files opened
dtrace -n 'syscall::openat*:entry { printf("%s,%s", execname, copyinstr(arg1)); }'
Trace number of syscalls
dtrace -n 'syscall:::entry {@num[execname] = count(); trace(execname); }'
Trace lock times by process name
dtrace -n 'lockstat:::adaptive_block { @time[execname] = sum(arg1); }'
Trace file I/O by process name
dtrace -n 'io:::start { printf("%d %s %d", pid, execname, args[0]->b_bcount);}'
Trace the writes in bytes by process name
dtrace -n 'sysinfo:::writeoh { @bytes[execname] = sum(arg0); }'

Friday, January 31, 2014

In this post, we look at a few of the examples from the previous post:
    source = job_listings | where salary > industry_average
uses the predicate to filter the results.
    dedup source sortby -delay
shows the first of few unique sources sorted by the delay field in descending order
head (action="startup")
returns the first few events until the one matching startup is found
   transaction clientip maxpause=5s
group events that share the same client IP address and have no gaps or pauses longer than five seconds
Results often include duration and event count
The stats calculate statistical values on events grouped by the value of the fields.
stats dc(host) returns distinct count of host values
stats count(eval(method="GET")) as GET by host returns the number of GET requests for each webserver. percentage and range are other functions that can be used with it.
timechart only not applicable to chart or stats
   chart max(delay) over host
returns max(delay) for each value of host
   timechart span=1m avg(CPU) by host
charts the average value of CPU usage each minute for each host.
Filtering, modifying and adding fields can be done with commands such as eval, rex, and lookup.
The eval command calculates the value of a new field based on an existing field.
The rex command is used to create new fields by using regular expressions
The lookup commands add fields based on a lookup table for value lookups
fields can be specified as a set col1-colN format or with wild card characters

Thursday, January 30, 2014

We will now describe the Search Processing Language. We mentioned earlier that Splunk shifts focus from organizing data to useful queries. The end result may only be a few records from a mountain of original data set. Its the ease of use of a query language to be able to retrieve that result.
Search commands can be separated by pipe operator. This is the well known operator to redirect output of one command as input to another. For example, we could specify the column attributes of the top few rows of an input data set as search | top | fields commands with their qualifiers.
If you are not sure about what to filter on, then we can list all events, group them and even cluster them. There are some mining tips available as well. This method of exploration has been termed 'spelunking' and hence the term for the product.
Some tips for using the search commands include using quotation marks, using the case-insensitivity to arguments specified, boolean logic of AND as default between search commands unless explicitly specified with OR which has higher precedence, using subsearches where one search command is the argument to another search command and specified with square brackets etc.
The common SPL commands include the following:
Sorting results - ordering the results and optionally limiting the number of results with the sort command
filtering results - selecting only a subset of the original set of events and executed with one or more of the following commands: search, where, dedup, head, tail etc.
grouping results - grouping the events based on some pattern as with the transaction command
Reporting results - displaying the summary of the search results such as with top/rare, stats, chart, timechart etc.
Filtering Modifying and Adding fields - this enhances or transforms the results by removing, modifying or adding new fields such as with the fields, replace, eval, rex and lookup events.
Commands often work with about 10,000 events at a time by default unless explicitly overriden to include all. No, there is no support for C like statements as with dtrace. And its not as UI oriented as Instruments. However a variety of arguments can be passed to each of the search commands and its platform agnostic. Perhaps it should support indexing and searching its own logs. These include operators such as startswith, endswith etc and key-values operands.
Statistical fuctions is available with the stats command and supports a variety of builtin functions
chart and timechart commands are used with report builders

In this post we look at how to get data into Splunk. Splunk divides raw discrete data into events. When you do a search, it looks for matching events.Events can be visualized as structured data with attributes. It can also be viewed as a set of keyword/value pairs. Since events are timestamped, Splunk's indexes can efficiently retrieve events in time-series order. However, events need to be textual not binary, image, sound or data files. Coredumps can be converted to stacktrace. User can specify custom transformation before indexing. Data sources can include files, network and scripted inputs. Downloading, installing and starting splunk is easy and then when you reach the welcome screen, there's an add data button to import the data. Indexing is unique and efficient in that it associates the time to the words in the event without touching the raw data. With this map of time based words, the index looks up the corresponding events. A stream of data can be divided into individual events. The timestamp field enables Splunk to retrieve events within a time range.
Splunk has a user interface called the Summary Dashboard. It gives you a quick overview of the data. It has a search bar, a time range picker, and a running total of the indexed data, three panels - one each for sources, source types, and hosts. The sources panel shows which sources (files, network or scripted inputs) the data comes from. The source type is the type of the source. The hosts is the hosts the data comes from. The contents of the search dashboard include the following:
Timeline - this indicates the matching events for the search over time.
Fields Sidebar: these are the relevant fields along with the events
Fields discovery switch : This turns automatic field discovery on or off.
Results area: Events are ordered by timestamp and includes raw text for each event including the fields selected in the fields sidebar along with their values.

Wednesday, January 29, 2014

Today I'm going to talk about Splunk.
And perhaps I will first delve into one of the features. As you probably know Splunk allows great analytics with Machine Data. And it treats data as key value pairs that can be looked up just as niftily and as fast as with any Big Data. This is the crux of the splunk in that it allows search over machine data to find the relevant information when its otherwise difficult to navigate the data due to its volume. Notice that it eases the transition from organizing data to better query. The queries can be expressed in select form and language.
While I will go into these in detail including the technical architecture shortly, I want to cover the regex over the data. Regex is powerful because it allows for matching and extracting data. The patterns can be specified separately. They use the same meta characters for describing the pattern as anywhere else.
The indexer can selectively filter out events based on this Regex. This is specified via two configuration files Props.conf and Transforms.conf - one for configuring Splunks processing properties and another for configuring data transformations.
Props.conf is used for linebreaking multiline events, setting up character set encoding, processing binary files, recognizing timestamps, setting up rules based source type recognition, anonymizing or obfuscating data, routing select data, creating new index time field extractions, creating new search time field extractions and setting up lookup tables for fields from external sources. Transforms.conf is used for configuring similar attributes. All of these require corresponding settings in props.conf
This feature adds a powerful capability to the user by transforming the events, selectively filtering the events and adding enhanced information. Imagine not only working with original data but working on something that can be transformed to more meaningful representations. Such a feature not only helps with search and results but also helps better visualize the data.