Cluster computing

Wednesday, February 5, 2014

We discussed alerts, actions, charts, graphs, visualizations, and dashboards in the previous post. We now review recipes for monitoring and alerting. These are supposed to be brief solutions for common problems. Monitoring helps you see what is happening to your data. As an example, let us say we want to monitor how many concurrent users are there at any given time. This is a useful metric to see if a server is overloaded. To do this, we search for the relevant events. Then we use the concurrency command to find the number of users that overlap. Then we use a time chart reporting command to display a chart of the number of concurrent users.
We specify this as search sourcetype=login_data | concurrency duration=ReqTime | timechart max(concurrency)
Let us say next that we want to monitor the inactive hosts.
we use the metadata command that gives information on host, source and source types
Here we specify
| metadata type=hosts | sort recentTime | convert ctime(recentTime) as Latest_Time
We can use tags to categorize data and use it with our searches.
In the above example, we could specify:
... | top 10 tag::host to specify top ten host types.
Since we talked about tag, we might as well see an example about event type
we could display a chart of how host types perform using only event types that end in _host with the following:
... | eval host_types=mvfilter(match(eventtype, "_host$"))
| timechart avg(delay) by host_types
Another common question that we could help answer with monitoring is how did today perform compared to previous month ?
For example we might want to view the hosts that were more popular today than previous month.
This we do with the following steps:
1. get the monthly usage for each host
2. get the daily usage for each host and append
3. use the stats to join the monthly and daily usages by host.
4. use sort and eval to format the results.
Let's try these commands without seeing the book.
| metadata type=hosts | sort duration | earliest = -30d@d | stat sum(duration) as monthly_usage by host | sort 10 - monthly_usage | streamstats count as MonthRank.
Cut and paste the above with changes for daily as
append[ | metadata type=hosts | sort duration | earliest = -1d@d | stat sum(duration) as daily_usage by host | sort 10 - daily_usage | streamstats count as DailyRank]
Next join the monthly and the daily rankings with stats command:
stats first(MonthRank) as MonthRank first(DayRank) as DayRank by host
Then we format the output :
eval diff=MonthRank-DayRank | sort DayRank | table DayRank, host, diff, MonthRank
Each of the steps can now be piped to the other and the overall search query can be pipe-concatenated to form a single composite query.

Tuesday, February 4, 2014

In today's post we will cover another chapter in Exploring Splunk book. This chapter is on enriching data. We can use command like top and stats to explore the data. We can also add spark lines which are small line graphs to the data so that data patterns can be quickly and easily visualized.
With Splunk it is easy to exclude data that has already been seen. We do this with tagging. This helps detecting interesting events from noise.
When we have identified the fields and explored the data, the next step is to categorize and report the data
Different event types can be created to categorize the data. There are only two rules to keep in mind with event types. There are no pipes to be used with event type declaration and there cannot be nested searches aka subsearches to create event types. For example status = 2* to define success cases and status = 4* for client_errors.
More specific event types can be built on type of more general event types. For example web_error can include both client_errors and server_errors. The granularity of event types is left to user discretion since the results matter to the user.
Event types can also have tags. A more descriptive tag about the errors enhances the event type.
As an example, user_impact event tag can be used to report on the events separately.
Together event types and tags allow data categorization and reporting for voluminous machine data.Refining this model is usually an iterative effort. We could start with a few useful fields and then expand the search. All the while, this adds more input to Splunk to organize and label the data.
We mentioned visualizing data with sparklines. We can also visualize data with charts and graphs. This is done from the create report tab of the search page.
For example, we can search with a query such as sourcetype=access* status=404 | stats count by category_id and then create a pie chart on the results. Hovering over the chart now gives details of the data.
Dashboards are yet another visualization tool. Here we present many different charts/graphs and other visualizations in a reporting panel. As with most reporting, a dashboard caters to an audience and effectively answers a few questions that the audience would be most interested in. This can be gathered from user input and feedback iterations. As with charts and graphs, its best to start with a few high level fields before making it more sophisticated.
Alerts are another tool that can run periodically or on events when search results evaluate against a condition.There are three options to schedule an alert. First is to monitor whenever the condition happens. The second is to monitor on a scheduled basis as a less urgent information. Third is to monitor using a realtime rolling window if certain number of things happen within a certain time-period.
Alerts can have associated actions that make them all the more useful. The actions can be specified via the wizard. Some actions could be say send an email, run a script, and show triggered alerts.

To make the data more usable, Splunk allows enriching of data with additional information so that Splunk can classify it better. Data can be saved in reports and dashboards that make it easier to understand. And alerts can be added so that potential issues can be addressed proactively and not after the effect.
The steps in organizing data usually involve identifying fields in the data, categorizing data as a pre-amble to aggregation and reporting etc. Preconfigured settings can be used to identify fields. These utilize hidden attributes embedded in machine data. When we search, Splunk automatically extracts fields by identifying common patterns in the data.
Configuring field extraction can be done in two ways - Splunk can automate the configuration by using the Interactive field extractor or we can manually specify the configuration.
Another way to extract fields is to use search commands. The rex commands comes in very useful for this. It takes a regular expression and then extracts fields that match the expression. To extract fields from multiline tabular data, we use multikv and to extract from xml and json data, we use spath or xmlkv
The search dashboard's field sidebar gives immediate information for each field such as :
the basic data type of the field with abbreviations such as a for text and # for numeric
the number of occurrences of the field in the events list (following the field name)

Monday, February 3, 2014

We discuss the various kinds of processors in Splunk within a pipeline. We have the monitor processor that looks for files and the entries at the end of the file. Files are read one at a time, in 64KB chunks, and until EOF. Large files and archives could be read in parallel. Next we have a UTF-8 processor that converts different char set to UTF-8.
We have a LineBreaker processor that introduces line breaks in the data.
We also have a LineMerge processor that does the reverse.
We have a HeadProcessor that multiplexes different data streams into one channel such as TCP inputs.
We have a Regex replacement processor that searches and replaces the patterns.
We have an annotator processor that adds puncts to raw events. Similar events can now be found.
The Indexer pipeline has TCP output and Syslog output both of which send data to remote server. The indexer processor sends data to disk. Data goes to remote server or disk but usually not both.
An Exec processor is used to handle scripted input.

Sunday, February 2, 2014

With this post, I will now return to my readings on Splunk from the book - Exploring Splunk. Splunk has a server and client. The Engine of the Splunk is exposed via REST based APIs to CLI Interface, web interface and other interfaces.
The Engine has multiple layers of software. At the bottom layer are components that read from different source types such as files, network ports or scripts. The layer above is used for routing, cloning, and load balancing the data feeds and this are dependent on the load. This load is generally distributed for better performance. All the data is subject to Indexing and an index is build Note that both the indexing layer and the layer below i.e. routing, cloning and load balancing are deployed and set up with user access controls. This is essentially the where what gets indexed by whom is decided. The choice is left to users because we don't want sharing or privacy violations and by leaving it configurable we are independent of how much or how little is sent our way for processing.
The layer on top of Index is Search which determines the processing involved for retrieving the results from the index. The search query language is used to describe the processing and the searching is distributed across workers so that the results can be parallel-ly processed. The layer on top of the Search is the Scheduling/Alerting, Reporting and Knowledge each of which is a dedicated component in itself. The results from these are sent through the REST based API.
Pipeline is used to refer to the data transformations as it changes shape, form and meaning before being indexed. Multiple pipe-lines may be involved before indexing. Processor performs small but logical unit of work. Processors are logically contained within a pipeline.
Processors perform small but logical unit of work. Queues hold the data between Pipelines. Producers and consumers operate on the two ends of a queue.
The file input is monitored in two ways - one the file watcher that scans directory or finds files and the other that reads files at the tail where the data is being added.

In today's post, I want to talk about a nuance of K-means clustering. Here the vectors are assigned to the cluster that is nearest as compared by the centroids. There are ways to assign clusters without centroids and these are based on single link, complete link etc. However, centroid based clusters are the easiest in that the computations are limited.
Note that the number of clusters is pre-specified before the start of the program. This means that we don't change the expected outcome. That is we don't return fewer than the expected clusters. Even if all the data points belong to one cluster, this method aims to partition the n data-points into k clusters each with its own mean or centroid.
The mean is recomputed after each round of assignment of data-points to different clusters. We start with a cluster with three different means initialized. At each step, the data points are assigned to the nearest cluster. That means that the clusters cannot be empty. If any cluster becomes empty because its members join other clusters, then that cluster should take the outliers of an already populated cluster. This way the cluster coherency goes up for each of the clusters.
If the number of clusters is large to begin with and the data set is fewer, this will lead to highly partitioned data set that is not adequately represented by one or more of the clusters and the resulting aggregation of clusters may need to be taken together to see the overall distribution. However, this is not an undesirable outcome. This is the expected outcome for the number of parititions specified. The number of partitions was incorrectly specified to be too high and this will be reflected by the chi square goodness of fit. The next step then should be to reduce the number of clusters.
If we specify only two clusters and all the data points are visually close to a predominant cluster, then too the other cluster need not be kept empty. It can improve the previous cluster by taking one of the outliers into the secondary cluster.

Saturday, February 1, 2014

In this post, I give some examples on DTrace:
DTrace is a tracing tool that we can use dynamically and safely on production systems to diagnose issues across layers. The common DTrace providers are :
dtrace - start, end and error probes
syscall - entry and return probes for all system calls
fbt - entry and return probes for all kernel calls
profile - timer driven probes
proc - process creation and lifecycle probes
pid - entry and return probes for all user-level processes
io - probes for all I/O related events.
sdt/usdt - developer defined probes
sched - for all scheduling related events
lockstat - for all locking behavior within the operating system
Syntax to specify commands probe-description/predicate/{action}
Variables (eg self->varname = 123) and associative arrays (eg name[key] = expression) can be declared. They can be global, thread local or clause local. Associative arrays are looked up based on keys
Common builtin variables include :
args: the typed arguments to the current probe,
ourpsinfo: the process state for the current thread
execname : the name passed in
pid : the process id of the current process
probefunc | probemod | probename | probeprov: the function name, module name, name and providername, of the current probe
timestamp vtimestamp - timestamp and the amount of time the current thread has been running
Aggregate functions include count, sum, avg, min, max, lquantize, quantize, clear, trunc etc.
Actions include trace, printf, printa, stack, ustack, stop, copyinstr, strjoin and strlen.
DTrace oneliners:
Trace new processes:
dtrace -n 'proc:::exec_success { trace(ourspsinfo->pr_psargs); }'
Trace files opened
dtrace -n 'syscall::openat*:entry { printf("%s,%s", execname, copyinstr(arg1)); }'
Trace number of syscalls
dtrace -n 'syscall:::entry {@num[execname] = count(); trace(execname); }'
Trace lock times by process name
dtrace -n 'lockstat:::adaptive_block { @time[execname] = sum(arg1); }'
Trace file I/O by process name
dtrace -n 'io:::start { printf("%d %s %d", pid, execname, args[0]->b_bcount);}'
Trace the writes in bytes by process name
dtrace -n 'sysinfo:::writeoh { @bytes[execname] = sum(arg0); }'