Cluster computing: February 2014

Friday, February 28, 2014

I found an implementation of an event loop and would like to compare it with an overlapped IO. Both code can work on Windows but we will leave the nuances of using it on any one platform. The semantics of an Event Loop is that some events are added and at the end of a timeout these events are triggered. The timeout is determined if the events needs to be triggered the next time and if so the timeslice to wait, otherwise zero. Several helper threads can wait on a set of poll able resources. The threads are kicked off by the loop, then waited on and then shut down. The wait is for the timeout determined. If this is successful, we will then recalibrate the before, timeout and after for the next round.
If the wait set was not successful, then it was probably because events got added or removed, so we check all the states and start all over.
There are a set of three lists maintained : a doubly linked list that is not full , a doubly linked list that is full, and a doubly linked list that is empty. The waiter thread moves between these queues.
In an overlapped IO this is very different. Where the events are serviced by different producers and consumers and can even associate it with multiple poll able resources.
There is no prediction to which order the events will get triggered in either case but one proceed in terms of beats and another proceeds in terms of the items in the IO.

In the post on using the SQL Server service broker as a modular input for Splunk, we introduced a technique but we now describe the complete solution. We mentioned that we could read messages from the Broker by opening a SqlConnection and executing a SQL statement. For every such message received we can create a Splunk modular input event wrapper and send off the data to Splunk.
The program implements the Splunk script object and implements the methods that are required to take the configuration from the user, apply the configuration and extract events. These events are extracted from the message broker as mentioned above.
The sampling rate is determined by the number of messages We fork a thread to process the messages for the designated queue. The config determines the queue names, the sampling intervals etc. The default sampling interval is zero.
We invoke the CLI command object from the SDK. Specifically, we say Command.Splunk("Search").
And then we add the parameter we want to search with. we can check the cli.Opts to see if our search command and parameter were added. After defining the command object, we then create a job object to invoke it. We do it with the following:
var service = Service.Connect(cli.Opts)
var jobs = service.GetJobs();
var job = jobs.Create((string)cli.Opts["search"]);
We wait till the job is done. This we can check with the done flag on the job.
We retrieve the results of the jobs with the job.getResults command which returns a stream. We can then open a ResultsReaderXml on this stream to retrieve the events.

Thursday, February 27, 2014

Today we discuss more on modular input. In this case, we talk about the properties we can set on the modular input data structure. One of the properties is the index. This talks about which index the events should goto. There us a default index but its preferable to store out events by their own index. The second property is the stanza name a unique source type for these kind of events. In fact the stanza is user specified and the events are often retrieved based on these literals.

The idea behind modular inputs seems to be to envelop the events to be differentiated from others. These help even before parsing and indexing have taken place.
When we use the script object with the input we are able to govern the flow of these events. We increase it slowly from a trickle to a deluge if we wanted by varying the polling interval or the sampling rate

In addition to differentiating and regulating the flow, the modular inut and scripts can filter On raw data before they are sent over. For example the inbound and outbound messages may not only be differentiated but one selected over another

A Splunk app that uses modular input uses the script object. It uses such things as methods to run and stream events. Examples are provided in the Splunk documentation but I would like to discuss how to integrate it with a service. A script conforms to an executable much like the service. Only in this case it follows the conventions of Splunk Apps. Otherwise both the script and the service perform data collection. While the service may have methods to process the incoming and outgoing messages, the script has a method for streaming events. In addition the script is setup for polling to monitor the changes and to validate the user input. In essence a modular input script is suited to interface with the user for the configuration of the monitor and to set up the necessary index, queue, stanza etc for the events to be written to. With this framework, no matter what the events are, they can join the pipeline to Splunk.
The StreamingEvents method invokes a while loop to poll or monitor the changes at regular intervals. This method sleeps for a few intervals thus relieving the CPU between polling. Sleeping helps particularly with otherwise high CPU usage on uniprocessors and on earlier systems such as what Sustaining Engineering sees occasionally with customers. Introducing the sleep at the correct place during polling alleviates this. Even a Sleep(0) will be sufficient.
In the case of the service that monitors the service broker queue, sample code on MSDN, it has the same run method that the Splunk Script object has and it also begins with loading the configuration and it forks thread to watch each queue. Each thread for its lifetime opens up a SQL connection object and retrieves the service broker message in a loop based on polling just like the extract events method of the Splunk object.
Thus the easiest way to integrate the service broker method is to use the service broker queue reading logic for all specified queues in the extract events method of the modular input.

Wednesday, February 26, 2014

One of the ways we can look at logging on Windows for any component is with WPP Tracing. This is true for any logs including system components, device drivers, applications, services and any registered trace provider. The trace providers usually are found by their GUID that they register with or that information is extracted from the pdb. The WindowsDDK ships a tool called traceview that can collect and display these traces.
This tool may not be up-to-date on the trace log format but we can easily convert the trace captured in a .etl log file by using eventvwr->open saved log and saving it to the newer format.
Here 's an example of how the logs look like :
00111375 drv 7128 9388 2 111374 02\26\2014-15:59:34:202 Driver::Registry call back: filter out event due to machine or user path set in config. operation = QueryValueKey
The events are displayed we cause we have the formatting for it. This is usually contained in the trace file format maintained by the providers or part of their pdbs. If we don't have the formatting information, the events look something like this:
00111386 Unknown 7128 9388 2 111385 00\00\ 0-00:00:00:00 Unknown( 40): GUID=bbd47d81-a1f8-551f-b37f-8ce988bb02f2 (No Format Information found).
This may not mean that we can use the same fields as we see in TraceView to use with the filters in the event viewer filter. The latter is maintaining its own filter fields, attributes and levels
The event viewer logs have several features.
First off it conforms to a template that's universally recognized. And it identifies events by their source, ids etc.
Second, it can collect a variety of logs, application, system and security. These provide a sink for all the event tracing information on the system. These can be saved and viewed offline.
Third, eventvwr can connect to remote computers and display the events from the logs there.
This is critical when it comes to viewing information across machines.
If our interest is only in filtering certain events, the logman tool can come helpful in filtering events based on provider guid. There are some other options available as well such as to start stop update and delete data collector, to query the data collectors properties and to import or export the XML file.

Tuesday, February 25, 2014

I'm going to blog about Splunk CLI commands. By the way I'm going to check if fifo input is discontinued. Meanwhile lets talk about some basic CLI commands now.
There are several basic commands and it may take a while to cover all of them. I'll try going case by case such as say for a given task at hand. This way we will know how to use it. Again, there's plenty of literature on docs.splunk.com but my goal here is to mention the ones I've used.
Here's a command to register perfmon. You can modify the inputs.conf file with the details of the perfmon config
splunk add exec scripts\splunk-perfmon.path -interval 60
and splunk enable perfmon
The CLI commands are based on verbs and objects.
You can start or stop splunk with : splunk start splunkd --debug
but you can only do that with splunkd and splunkweb. Also, since we are talking about perfmon events, we can use the CLI to see what perfmon will be collecting with our command:
splunk list perfmon
In this case, it will give you output such as :
Monitored Perfmon Collections:
        LogicalDisk
                _TCP_ROUTING:windowsIndex
                counters:*
                disabled:0
                host:RRAJAMANIPC
                index:windows_perfmon
                interval:10
                object:LogicalDisk
These are what we define in the inputs.conf file.
Note that individual perfmon items can also be enabled or disabled separately
splunk enable perfmon LogicalDisk
and similarly we can disable them individually as follows:
splunk disable perfmon LogicalDisk
CLI commands enable to activate a configuration change with the reload command as
splunk reload perfmon which makes it effective immediately.

Program execution logging seems an art. While it can be dismissed as a chore, for sustaining engineering, this seems an invaluable diagnostic. What would make it easier to troubleshoot problems is when there is a descriptive message when errors occur. Typically these messages are for at-the-moment errors without any indication of what customer could do to mitigate it. I don't mean that error messages need to be expanded to include corrective actions in all cases. That would help but perhaps an association between error messages and corrective actions could be maintained. Say if we keep all our error message strings in one place then it could be easy to correlate the errors to the actions by keeping them side by side.
The corrective action strings need not even be in the logging but the association could help support and sustaining to diagnose issues. Especially when the workarounds are something that's domain knowledge. These will avoid a lot of communication and even help the engineers on the field.
At the same time, this solution may not be appropriate in all cases. For example, where we don't want to be too informative to our customers and where we don't found confound them with too much details. Even in such cases, being elaborate in the error conditions and the descriptive messages may help the appropriate audience to target their actions.
Lastly, I want to add that many feature developers might already be aware of common symptoms and mitigations during their development phase. Capturing these artifacts will help in common troubleshooting with the feature at a later point of time. Building a history or a database of such knowledge via simple bug tracking would immensely help. Since troubleshooters ofter search the bug database to see for similar problems reported.
Another consideration is that the application maintain data structures exclusively for supportability. For example, if there is an enumeration of all the workers for a given component, their tasks, their objects and states and if these can be queried in a pull operation independent of the method they are working on, it would be great. These pull operations could be invoked by views specific to runtime diagnostics. So they can be exposed via methods specific to management. These are different from logging in the sense that they are actually calls to the product to retrieve enhanced runtime information.

Monday, February 24, 2014

I finally wrote a program that demonstrates the high performance reading of Message queues from MSMQ on Windows. No matter the number of queues and the number of messages on the queues, this program illustrates reading via the use of IO completion ports. In brief, what the application does is it has a monitor that collects and configures the queues to be read
and creates a completion port for all the queues. Then it forks off threads that are get notified on this completion port whenever messages arrive. The threads then exchange this data read from messages with the monitor that can file/dump the data away
The completion port enables to specify multiple queues and tolerate any load.
The workers are spawned and closed when the port is ready. Closing the port signals the threads to terminate.
This is convenient for initialization and cleanup. Memory usage is limited to the copying of messages in transit and consequently very small as compared with the overall number of messages.
Secondly the application allows for threads to service any messages from the completion port. The messages are tied back to the queue names based on the overlapped key parameter that the threads set when reading a message. The threads know which queue handle the data is coming from and when reading it they can flag the necessary queue so that proper association can take place.
Another thing to keep track of is the task for all the threads is the same and simply to get notified on messages, to read them and post to the monitor. This way there is no restriction to the concurrency from the applications perspective. However, that said, the concurrency value is typically determined by the number of processors. Since these are OS threads, we rely on what they suggest. We do follow their recommendation to use a completion port but the threadpool we use with the completion port is something we can tweak based on what works. Lastly, I wanted to mention the properties we use for message queue receiving are determined by the application. While we can retrieve a large number of properties for each receive we are typically interested in the message buffer and size. So we need to determine these application chosen properties before we make Receive calls. The threads assume the structure of this context when receiving.

Sunday, February 23, 2014

Splunk 6 has a web framework with documentation on their dev portal that seems super easy to use. Among other things, it can help to gain App Intelligence i.e by improving semantic logging where the meaning can be associated via simple queries, to integrate and extend Splunk, such as with business systems or customer facing applications and to build real time applications that add a variety of input to Splunk.
One such example could be SQL Server Message Broker Queue. The Message broker keeps track of messages based on a "conversation_handle" which is a Guid.
Using a SQL data reader and a SQL query, we can get these messages which can then be added as input to Splunk. We issue RECEIVE commands like this :RECEIVE top (@count) conversation_handle,service_name,message_type_name,message_body,message_sequence_number
FROM <queue_name>
Unless the messages have messageType as "http://schemas.microsoft.com/SQL/ServiceBroker/EndDialog the message body can be read.
The queue listener that drives this should allow methods to configure the listener, and to start and stop.
Using a simple call back routine, the thread can actively get the messages as described and send it to a processor for completion. A transaction scope could be used in this routine.
Inbound and outbound processor queues could be maintained independently to be invoked separately. Both should have methods for process messages and to save failed messages.
The processed messages can then be written to a file for input to Splunk or to use the framework for directly indexing this input.
There are several channels for sending data from SQL server and this is one that could potentially do with a Splunk app.
In general writing such apps in languages such as CSharp or JavaScript has documentation but it would not be advisable to push it any further into the Splunk stack. This is because the systems are different and Splunk is not hosted on SQL server.
If Splunk is hosted on say one of the operating sytems, then certain form of input that is a nuance for that operating system could be considered but in general Splunk foundation on which the apps are built focuses on generic source types and leaves it to used discretion to send it through one of the established channels.

I'm taking a look at the windows IO completion ports today and writing about it. When an IO completion port is created by a process, there is a queue associated with this port that services the multiple asynchronous IO requests. This works well with a thread pool. The IO completion port is associated with one or more file handles and when an asynchronous IO operation on the file completes, an IO completion packet is queued in the First in First out order to the completion port.
Note that the file handle can be in any arbitrary overlapped IO endpoint ranging from file, sockets, named pipes or mail slots etc.
The thread pool is maintained in the Last In First Out manner. This is done so that the running thread can continuously pick up the next queued completion packet and there is no time lost in context switches.This is hardly the case though since thread may switch ports or put in sleep or terminate and the other workers get to service the queue. When threads waiting on a GetQueuedCompletionStatus call can process a completion packet when another running thread enters a wait state. The system also prevents any new active threads until the number of active threads falls below the concurrency value.
In general, the concurrency value is chosen as the number of processors but this is subject to change and its best to use to profiling tool to see the benefits before thrashing. I've a case where I want to read from multiple mail slots and these are best serviced by a thread pool. The Threads from the pool can read the mail slots and place the data packets directly on the completion port queue. The consumer for the completion port will then dequeue it for processing. In this example, The threads are all polling the mail slots directly for messages. and place them on the completion port. This is fast and efficient and polling can delay for queues with same or no current message. However, this is not the same model as a completion port notification for that mail slot or a call back routine for that mail slot. In the latter model, there is a notification, subscription model and it is better at utilizing system resources. These resources can be quite some if the number of mail slots or their number of messages are high. we can make the polling model fast as well with a timeout value of zero for any calls to read the mail slots and skipping those that dont' have actionable messages. However, the notification model helps with little or no time spent on anything other than servicing the messages in the mail slots as and when they appear. The receive call seems to have a builtin wait that relieves high cpu usage.

Friday, February 21, 2014

Yesterday I saw a customer report for a failure of our application and it seemed at first a disk space issue. however, file system problems are generally something that applications cannot workaround.
Here the file system was a NFS mount even though it had the label of a GPFS mount. Further disk space was not an issue. Yet the application reported that it could not proceed because the open/read/write was failing. Mount showed the file system mount point and the remote server it mapped to. Since the mount was for a remote file system, we needed to check both the network connectivity and the file system read and writes.
A simple test that was suggested was to
Try writing a file outside the application with the dd utility to the remote server
Something like
dd -if /dev/zero -of /remotefs/testfile -b blocksize
And if that succeeds, read it back again as follows:
dd -if /remotefs/testfile -of /etc/null -b blocksize
With a round trip like that, the file system problems
could be detected.
The same diagnostics can be made part of the application diagnostics.

Thursday, February 20, 2014

I'm not finding time tonight but I wanted to take a moment to discuss an application for data input to Splunk. We talked about user applications for Splunk and sure they can be written in any language but when we are talking performance reading orders such as for an MSMQ cluster, we want it to be efficient in memory and CPU. What better way to do it than to push it down the way to the bottom of the Splunk stack.This is as close as it can get to the Splunk engine. Besides MSMQ clusters are high volume queues and there can be a large number of such queues. While we could subscribe to notifications at different layers, there is probably nothing better than having something out of the box from the Splunk application.
I've a working prototype but I just need to tighten it. What is missing out of this is the ability to keep the user configuration small. The configuration currently takes one queue at a time but there is possibility to scale that. One of the things I want to do for example is to enable a regular expression for specifying the queues. This way users can specify multiple queues or all queues on a host or cluster with .* like patterns. The ability to enumerate queues on clusters is via name resolution. and adding it to the prefix for the queue names. With an iterator like approach all queues can be enumerated.
One of the things that I want is to do is to enable transactional as well as non-transactional message reading. This will cover all the queues on a variety of deployments. Other than the system reserved queues most other queues including the special queues can be processed by the mechanism above. Making the message queue monitoring as first class citizen of the input specifications for Splunk, we now have the ability to transform and process as part of the different T-shirt size deployments and Splunk roles. This will come in very useful to scale on different sizes from small, medium to enterprise level systems.
I also want to talk about system processing versus app processing of the same queues. There are several comparisons to be drawn here and consequently different merits and de-merits. For example, we talked about different deployments. The other comparisons include such thing as performance, being close to pipelines and processors, shared transformations and obfuscations, indexing of data and no translation to other channels, etc.
Lastly I wanted to add that as opposed to any other channels where there is at least one level of redirection, this directly taps into a source that forms a significant part of enterprise level systems.
Further more, journaling and other forms of input lack the same real time processing of machine data and is generally not turned on in production systems. However Splunk forwarders are commonly available to read machine data.

Wednesday, February 19, 2014

We will look at advanced Splunk server configuration. We look at modifying data input. This is important because once data is written by Splunk, it will not be changed. Data transformation is handled by the different configuration files as indicated earlier. These are props.conf, inputs. conf and transforms.conf. The props.conf is typically only one and for different forwarders. At the input phase , we look only at the data in bulk and put tags around it such as host, source and source type but we don't process them as events. This is what we specify in inputs.conf. In props.conf, we add information to tags such as character set, user-defined stanza etc. Stanza is specified to a group of attribute-value pairs and can be host, source and source type specified within square brackets where we can differentiate between source type for overriding automatic source type. Note that props.conf affects all stages of processing globally as opposed to the other configuration files. The stanzas in a props.conf is similar to the others. Also, user inputs alleviates the processing down the line or afterwards.
In the parsing phase, we take these tags off and process them as individual events. We will find start and stop of events in this phase and perform other event level processing. There are processing that could be performed in input phase as well as parsing phase. Typically they are done once and not repeated elsewhere. That said, parsing is usually performed on the indexer or the heavy forwarder.
In the indexing phase, the events are indexed and written to disk.
Splunk indexing is read write intensive and consequently requires better disks. The recommended RAID setup is RAID 10 which provides fast read and write with greatest redundancy. RAID 5 duplicate writes is not recommended. SAN and NAS storage is not recommended for recently indexed data. They are preferable for older data.
Search heads are far more cpu bound than indexers.

We will look at Splunk server administration today. Here we talk about the best practices and the configuration details for Splunk administration in a medium to large deployment environment. A common spunk topology is a self-contained Splunk instance. It gathers inputs, indexes and acts as a search interface. If the Indexer is separate, then it gathers and/or receives data from forwarders and writes them to disk. It can operate alone or with other indexers load balanced and can also act as a search interface. A search head runs Splunk Web, generally does not index and connects to indexers with distributed search. It is used in large implementations with high numbers of concurrent users/searches.
A light forwarder is a Splunk agent installed on a non-Splunk system to gather data locally but it can't parse or index. The purpose here is to keep the hardware footprint as small as possible on production systems.
If there are no restrictions and the hardware can support more, a heavy forwarder is installed that can also parse the spunk data. No data is written to the disk and does not support indexing. That is left to indexers and search head. It generally works as a remote collector, intermediate forwarder and possible data filter.
A deployment server acts as a configuration manager for a Splunk install. It can run on an indexer or search head or a dedicated machine depending on the size of the installation.
Key considerations when planning a topology include such things as how much data per day is being indexed, how many concurrent users are there and how many scheduled searches or alerts. We want to know about the data, its location, its persistence, its growth, its security, its connectivity and its redundancy to plan the deployment.
Generally as the T-shirt sizes of the deployments increases, the number of indexers, forwarders and syslog devices increases. A dedicated search head is deployed for handling the search requests. But the indexers and search head are typically kept together and secured as Splunk internal while everything else feed into it. An Intermediate forwarder may consolidate input from syslog devices and together with the feed from the forwarders, they are consolidated with load balancing feed to Splunk indexers.

Tuesday, February 18, 2014

The scaling of threads to process a concurrent queue was discussed. In this post we talk about integrating the data and metadata passed over the queue.

In terms of storage, we discussed that local storage is preferable for each worker. The resources are scoped for the lifetime of a worker. There is no co-ordination required between producers and consumers for access to resources. Storage can have a collection of data structures. With the partitioning of data structures, we improve fault tolerance.
In our case we have n queues with arbitrary number of messages each from a cluster. To best process these, we could enumerate and partition the queues to different worker threads from a pool. The pool itself can have different number of workers as configurable and the number of queues assigned to any worker could be determined based on dividing the total number of queues by the number of threads.
The queues are identified by their names so as such we work a global list of queue names that the workers are allotted to. This list is further qualified to select only those that are candidates for monitoring.
Whether a queue is candidate for monitoring is determined by a regular expression match between what the user provides and the name of the queue. The regular expression and pattern matching is evaluated against each name one by one to select the filter of candidate queues.
The queues are enumerated based on windows API and these are with the corresponding begin and get next methods. Each queue retrieved will have a name that can be matched with the regex provided.
The queues may have different number of messages but each monitor thread works on only the current message on any queue. If that message is read or timeout, it moves to the current message of the next queue to work on . All candidate queues are treated equally with the optimization that no messages are fixed costs that we could try to reduce with say smaller timeouts
If we consider this round robin method of retrieving the current message from each of the queues, there is fair treatment of all queues and a guarantee for progress. What we will be missing is whether we can accelerate on queues where the same or no messages are current. If we could do that, we would be processing the queues with more number of messages faster. If we didn't do round robin, we wouldn't fair to all queues. Therefore we do not identify the priority queues based on the number of distinct messages they carry. The method we have will process the queues with more number of messages and will scale without additional logic or complexity.
Each set of queues are partitioned for workers so there is no need to solve any contention and load is optimal per worker.
The number of threads could be taken as one more than the number of available processors.

Monday, February 17, 2014

We review command line tools used for support of Splunk here.
cmd tool can invoke other tools by including the required preset environment variables. These can be displayed with the splunk envvars command.
The btoollllllll can be used to view or validate the Splunk configuration files. This is taking into account configuration file layering and user / app context i.e the configuration data visible to the given user and from the given app or from an absolute path or with extra debug information.
btprobe queries the fish bucket for file records stored by tailing by specifying the directory or crc compute file. Using the given key or file, this tool queries the specified BTree
classify cmd is used for classifying files with types.
fsck diagnoses the health of the buckets and can rebuild search data as necessary.
hot, warm, thawed or cold buckets can be specified separately or together with all.
locktest command tests the locks
locktool command can be used to set and unset the tool
parsetest command can be used to parse log files
pcregextest command is a simple utility tool for testing modular regular expressions.
searchtest command is another tool to test search functionality of Splunk.
signtool is used for verification and signing spunk index buckets.
tsidxprobe will take a look at your time series index files or tsidx and verify the formatting
or identify a problem file. It can look at each of the index files.
tsidx_scan.py is a utility script to search for tsidx files at a specified starting starting location, runs tsidxprobe for each one, and outputs the results to a file.
Perhaps one more tool that could be added to this belt is one that helps with monitoring and resource utilization to see if the number of servers or settings can be better adjusted

Saturday, February 15, 2014

What's a good way to scale the concurrent processing of a queue both on the producer and consumer side ? This seems a text book question but think about support on cross platform and high performance. Maybe if we narrow down our question to windows platform, that would help. Anyways, its the number of concurrent workers we can have for a queue. The general rule of thumb was that you could have as many threads as one more than the number of processors to keep everyone busy. And if its a light weight worker without any overhead of TLS storage, we could scale to as many as we want. The virtual workers can use the same pool of physical threads. I'm not talking fibers which don't have the TLS storage as well. Fibers are definitely welcome over OS threads. But I'm looking at a way to parallelize as much as we can in terms of number of concurrent calls on the same system.
In addition we consider the inter worker communication both in a failsafe, reliable manner. OS provides mechanisms for thread completion based on handles returned from the CreateThread and then there's a completion port on windows that could be used with multiple threads. The threads can then close when the port is closed.
Maybe the best way to do this would be to keep a pool of threads and partitioned tasks instead of timeslicing.
Time-slicing or sharing does not really improve concurrent progress if the tasks are partitioned.
Again, it helps if the tasks are partitioned. The .Net task parallel library enables both declarative parallel queries and imperative parallel algorithms. By declarative we mean we can use notations such as 'AsParallel' to indicate we want routines to be executed in parallel. By Imperative we mean we can use partitions, permutations and combinations with linear data.
In general a worker helps with parallelization when it has little or no communication and works on isolated data.
I want to mention diagnostics and boost. Diagnostics on a workers activity to detect hangs or for identifying a worker among a set of workers are enabled with such things as logging or tracing and
identifiers for workers. Call level logging and tracing enable detection of activity by a worker. Between a set of workers, IDs can tell apart a worker from the test. Logging can include this ID to detect the worker with a problem activity.
There can also be a dedicated activity or worker to monitor others.
Boosting a workers performance is in terms of cpu speed and memory. Both are variables that depend on hardware. Memory and caches come very helpful in improving the performance of the same activity by a worker.

Friday, February 14, 2014

Thread windows style

I found this WINgnam style implementation on the web:


#ifndef _WIN_T_H_

#define _WIN_T_H_ 1
#include <iostream>

nclude <cassert>

#include <memory>
#
i#include <windows.h>

class Runnable {
pub
#include <process.h>

lic:

virtual ~Runnable() = 0;

 virtual void* run() = 0;

 };

class Thread {
public:

le> run);
 Thread();
 virtual ~Thread
 Thread(std::auto_ptr<Runna
b();
 void start();
 void* join();
private:
 HANDLE hThread;

matically
 std::auto
 unsigned wThreadID;
 // runnable object will be deleted aut
o_ptr<Runnable> runnable;
 Thread(const Thread&);

d when run() completes
 void setComplete
 const Thread& operator=(const Thread&);
 // call
ed();
 // stores return value from run()
 void* result;
 virtual void* run() {return 0;}

hread(LPVOID pVoid);
 void printError(LPSTR lpszFunction, 
 static unsigned WINAPI startThreadRunnable(LPVOID pVoid);
 static unsigned WINAPI start
TLPSTR fileName, int lineNumber);
};

class simpleRunnable: public Runnable {
public:
 simpleRunnable(int ID) : myID(ID) {}
 virtual void* run() {

ad: public Thread {
public:
 simpleThread(int ID) : myID(ID) {}
  std::cout << "Thread " << myID << " is running" << std::endl;
  return reinterpret_cast<void*>(myID);
 }
private:
 int myID;
};

class simpleThr
e
 virtual void* run() {
  std::cout << "Thread " << myID << " is running" << std::endl;
  return reinterpret_cast<void*>(myID);
 }
private:
 int myID;
};


#endif

Thursday, February 13, 2014

We look at some more inputs to Splunk today. The SDK offers the following TcpSplunkInput class, UdpInput class, WindowsActiveDirectoryInput class, WindowsEventLogInput class, WindowsPerfmonInput class, WindowsRegistryInput class, WindowsWmiInput class
. In addition there's ScriptInput and MonitorInput class.
The Input class is the Splunk Input class from which all specific inputs are derived. The InputCollection is a collection of inputs and it has heterogenous members with each member mentioning its type of input. The type of input is identified by the InputKind class. The different InputKinds are monitor, script, tcp/raw, tcp/cooked, udp, ad, win-event-log-collections, registry, WinRegMon, win-wmi-collections.
The monitor input monitors files, directory, script or network for new data. It has a blacklist, whitelist and crcSalt. A crcSalt is the string that Splunk has for a matching cyclic redundancy check.
As with all inputs, the corresponding Args class are used to specify the arguments to the inputs.
A ScriptInput represents a scripted data input. There is no limit to the format or content of this kind of data. A TcpInput is for raw TCP data as in the capture directly over the wire without any additional application layer processing. The latter is handled by TcpSplunkInput class.
The UdpInput represents the UDP data input. Note that there is no separate class for cooked udp input. Can you guess why ? It has to do with sessions and application logic.
A WindowsActiveDirectoryInput reads directly from the Active Directory. Since organizations secure their resources via the ActiveDirectory, this is the best input to know the hierarchy of the organization.
The Windows event log input reads directly from the event sink for windows. All windows and user applications can generate event logs and these are helpful in troubleshooting.
The Windows perfmon event input reads performance monitoring data and this is helpful for operations to see the load on the server in terms of utilization of critical resources such as memory and cpu.
The Registry input is used to gather information on windows registry keys and hive where applications and windows persist settings, state and configuration data between server reboots.
The WMI input is different from other inputs in that WMI is used for server management by operations and is a different kind of data provider.
What we could add is perhaps a MSMQ input since this has access to all the messages in the windows message queuing. These messages could be from active named private or public queues, dead letter queues, poison queues, and even journal queues. i.e everything except the non-readable internal reserved queues. When journaling is turned on we get access to all the messages historically versus getting notifications as and when the messages arrive.

Wednesday, February 12, 2014

Splunk has an SDK for programmability in different languages including C#. The object model is granular to enable the same kind of functionality as with the web interfaces. We briefly enumerate a few here:
There's an application class that represents the locally installed Splunk app.
There's an application archive that represents the archive of a Splunk app.
Both application and application archive derive from Entity class. The Entity class represents the base class for all Splunk entities. EntityMetadata class provides access to the metadata properties of a corresponding entity and can be instantiated with the static GetMetdadata() method.
The application args class extends Args for application creation properties. The Args class is a helper class so that the Splunk REST APIs can be called with key value pairs arguments
ApplicationSetup class represents the setup information for a Splunk app.

The BaseService functionality is common to both Splunk Enterprise and Splunk storm. The ConfCollection represents the collection of configuration options

Alerts are represented by FiredAlert class and their groupings - FiredAlertGroup and FiredAlertGroupCollection.

The HttpService class is for the web access and uses both http and https protocols.
The Index class represents the Splunk DB/Index object. Index also comes with corresponding IndexArgs.
The Job class represents a search Job and comes with its own JobArgs, JobEventArgs, JobExportArgs, JobResultsArgs, and JobResultsPreviewArgs. The Message and MessageArgs and MessageCollection are used to represent Splunk messages.
The ResultsReaderJson and ResultsReaderXML are specific derivations of the abstract ResultsReader class used for reading search results.

The MonitorInput class represents a monitor input which is a file, directory, script or network port and soon to include windows message queuing. These are monitored for new data.
The WindowsActiveDirectoryInput, WindowsEventLogInput, WindowsPerfmonInput, WindowsRegistryInput and WindowsWmiInput corresponding data input class.

The Receiver class exposes methods to send events to Splunk via the simple or streaming receiver endpoint.

Tuesday, February 11, 2014

Splunk monitors machine data There are some concepts specific to Splunk. We briefly review these now. Index time processing : Splunk reads data from a source such as a file or a port and classifies that source into a source type. Data is broken into events that consist of single or multiple lines and writes each event into an index on disk, for later retrieval with a search.
When search starts, events are retrieved and classified based on eventtypes and the matching events are transformed to generate reports and displayed on dashboards.
By default, the events go into a main index unless a specified index is created or stored.
Fields are searchable name/value pairings in event data - usually the default fields are host, source and sourcetype. Tags are aliases to field values. Event types are dynamic tags attached to an event. Saved Splunk objects such as savedsearches, eventtypes, reports and tags are not only persisted but also secured with permissions based on users and roles. Thus events are enriched before indexing. When sets of events are grouped into one, they are called transactions.
Apps are a collection of splunk configurations, object and code. They help the user in organization of targeted activities.
Splunk instances can work in three different modes - forwarder, indexer and search head. A forwarder is a version of Splunk that allows you to send data to a central Splunk indexer or a group of indexers. An indexer provides indexing capability for local and remote data. An indexer is usually added for every 50-100 GB per day depending on search load. A Splunk search head is typically added for every 10-20 active users depending on searches.
One of the sources for machine data is message queuing. This is particularly interesting because message queues are increasingly being used as the backbone of logging architectures for applications. Subscribing to these message queues is a good way to debug problems in complex applications. As the Exploring Splunk book mentions, you can see exactly what the next component down the chain received from the prior component. However, the depth of support to subscribe to message queues and on different operating system varies. At the very least, transactional and non-transactional message queues on Windows could be supported directly out of the box.

Monday, February 10, 2014

We return to our discussion on Splunk commands. The eval command calculates an expression and puts the resulting value into a field. Eval recognizes functions such as :
abs(x), case(X, "Y",...), ceil(x), cidrmatch("x",Y) that identifies ip addresses that belong to a subnet, coalesce(X,...) that returns the first value that is not null, exact(X) that uses double precision, exp(x), floor(x), if (X,Y, Z), isbool(X), isint(X), isnotnull(X), isnull(X), isnum(x), isstr(), len(), like(X,"Y"), ln(X), log(X,Y), lower(X), ltrim(X,Y), match(X,Y) - which matches regex pattern, max(X,...), md5(x) which gives an md5 hash, min(X,...) which returns the min, mvcount(X) which returns the number of values of X, mvfilter(X) whcih filters the multivalued field, mvjoin(X,Y) which joins the field based on the specified delimiter, now which gives current time, null(), nullif(), pi(), pow(X,Y), random(), relative_time(X,Y), replace(X,Y,Z), round(X,Y), rtrim(X,Y), searchmatch(X) and split(X,"Y"), sqrt(X) and strftime(X, Y)which returns the time as specified by the format, strptime() that parses time from str, substr, time, tonumber, tostring, trim, typeof(X) that returns a string representation of its type, upper(X), urldecode(X) and validate(X,Y,...)
Common stats function include avg, count, dc that returns distinct values, first, last, list, max, median, min, mode, perc<x>(Y) that returns percentile, range that returns difference between max and min values, stdev that returns sample standard deviation, stdevp that returns population standard deviation, sum, sumsq, values and var that returns variance of X.
Regular expressions can be specified with the following meta characters:
\s for white space, \S for not white space, \d for Digit, \D for not digit, \w for word character, \W for not a word character, [...] for any included character, [^...] for no included character, * for zero or more, + for one or more, ? for zero or one, | for Or, (?P<var>...) for named extraction such as for SSN, (?:...) for logical grouping, ^ start of line, $ for end of line, {...} for number of repetitions, \ for Escape, (?= ...) for Lookahead, and (?!...) for negative lookahead. The same regular expressions can be specified in more than one ways but the parser will attempt to simplify/expand it to cover all cases as applicable by the pattern. For example a repetition of x 2 to 5 times x{2,5} will be written as xx(x(x(x)?)?)?

Sunday, February 9, 2014

In regular expression search, matches can be partial or full. In order that we process all the groups in a full match, we iterate multiple times.
In either case, there are a few approaches to pattern matching.
The first is backtracking
The second is finite automata
The third approach is RE2.
Often production implementations mix and match more than one of the approaches above.
We discuss these briefly here.

Kerninghan and Pike mention a very fast and simple backtracking approach. Matches are determined one position at a time. Positions need not be contiguous. If the match occurs, the remainder of the pattern is checked against the rest of the string. A recursive function helps here and the recursion can be nested upto the length of the pattern. This pattern is maintained based on the literals and metacharacters provided. The empty string can match with an aseterisk. A separate recursive method may exist for the asterisk wild card character since it is matched in one of three different ways, depending on the parameters to the function.

RE2 approach is used by PCRE, perl and python. Unlike the backtracking approach above that can take an exponential order of time even though such an approach can support a variety of features, RE2 is efficient in that it uses automata theory and supports leftmost-first match. This approach has the following steps:
Step 1: Parse (Walking a Regexp)
Step 2: Simplify
Step 3: Compilstea.de
Step 4: Matchns in

The above steps are explained as below:

The parser could be simple but there's a variety of regular expressions that can be specified. Some parsers maintain an explicit stack instead of using recursion. Some parsers allow a single linear scan by keeping reverse polish notation. Others use trees or explicit stack. Its possible to parse a regular expression as a decision tree on the specified input text to see if any of the substrings at a position match the pattern. I take that back, we could do with a graph of predefined operations instead. These operations are concatenation, repetition and alternation.
The next step is to simplify.The purpose of simplifying is to reduce the memory footprint.
The result of the next step i.e. compilation is an instruction graph that makes it easy for pattern matching. This way parser doesn't need to be invoked again and again.
Lastly, the step for matching can be both partial or full.

Saturday, February 8, 2014

In the previous post, we talked about regular expression. Their implementation s quite interesting. I would like to cover that in a subsequent post. Meanwhile I wanted to mention the different forms regular expression can take. The given pattern can have wild card and meta characters. This means the patterns can match a variety of forms. Typically we need one expanded form that will match most patterns.
By expanding the regular expression I mean, we use a form which has all the different possible groups enumerated. By expanding the groups, we attempt to cover all the patterns that were intended to be matched. Recall how in our final desired output, we order the results based on groups and captures. Expanding the pattern helps us with this enumeration.
There are many editors for regular expression including Regular Expression Buddy that enables users of regular expressions to try out their expression with sample data in an interactive manner. The idea is that there may need to be some experimentation to decide which regular expression best suits the need so that there is little surprise between the desired and actual output. Although the regular expression buddy is a downloadable software, there are webapplications such as regexpr that we can try out for the same. The User interface allows you to interactively and visually see the results with highlighted matches.

Friday, February 7, 2014

At the heart of every text searching, there is a pattern match that's expected. The regex operator is very widely used and it is equally important as with any software for searching and particularly in Splunk.
Let us look at this operator more closely and see if we can find an implementation that works well. There is a need to optimize some codepaths for fast pattern matching especially for simple patterns. However, here we want to focus on the semantics, organization and the implementation.
Patterns are best described by Group and Captures.
A Group can be a literal or a pattern. Groups can be nested and indicate one or more occurrences of their elements.
A Capture is a match between a group and the text.
A Capture has such things as index and length of the match within the original string.
A group can have many captures often referred to as CaptureCollection.
A match may have many groups each identified by a group number for that match
Matches can follow one after the other in a string. It's necessary to find all. The caller can call Match.NextMatch() to iterate over them.
The results of the output should look something like this:
Original text
Match found :
Group 1=
Capture 0 =   value      Index=      Length=
                Capture 1 =   value      Index=      Length=
Group 2=
Capture 0 =   value      Index=      Length=
                :
and so on.
Since wild cards and other meta characters are supported, it is important to match the group for each possible candidate capture.
All captures are unique in the sense that they have a distinct index and length pair. Indexes and Length won't be sequential but the larger captures precede the smaller captures because the smaller are typically the subset of the bigger.

I've been reading a book on exploring Splunk to index and search machine data. I want to continue on that discussion and include my take from a training video I've seen. Today I want to take a break to paint a vision of my text processing system that can do both keyword weighting and topic extraction from text. I've attempted different projects on different kinds of implementations - both in algorithms and implementations. Most of them have not been satisfactory except perhaps the more recent ones and even there there are some refinements still to be done. But I've learned some and can associate an appropriate algorithm for the task at hand. For the most part, it will follow conventional wisdom. By that I mean where documents are treated as term vectors and vectors are reduced from a high number of dimensions before they are clustered together. I've tried thinking about alternative approaches to avoid the curse of dimensions and I don't feel I have done enough on that front but the benefits of following the convention is that there is plenty of literature on what has worked before. In many cases, there is a lot of satisfaction if it just works. Take for instance the different algorithms to weigh the terms and the clustering of topics. We chose some common principles from most of the implementation discussion in papers and left out the fancy ones.We know that there are soft memberships to differ topics, different ways in which the scope of the search changes and there are different tools to be relied on but overall we have experimented with different pieces of the puzzle so that they can come together.
I now describe the overall layout and organization of this system. We will have layers for different levels of engagement and functionalities starting with the backend all the way to the front end. The distinguishing feature of this system will be that it will allow different algorithms to be switched in and out for the execution of the system. And a scorecard to be maintained for each of the algorithms that can then be evaluated against the text to choose what's best. Due to the nature of the input and the the emphasis of each algorithm, such a strategy design pattern becomes a salient feature of the core of our system. The engine may have to do several mining techniques and may even work with big data, hence it should have a distributed framework where the execution can be forked out to different agents. Below the processing engine layer will be a variety of large data sources and a data access layer. There could also be node initiators and participants from a cluster. The processing engine can sit on top of this heterogenous system.
Above the processing engine comes the management layer that can handle remote commands and queries. These remote commands could be assumed to come over http and may talk to one or more of the interfaces that the customer uses.These could include command line interfaces, User Interface and an administration panel.
The size of the data and the scalability of the processing as well as the distributed tasks may require modular components with communication so that they can be independently tested and they can be switched in and out. Also, the system may perform very differently for data that doesn't fit in main memory be it at the participant or the initiator machine.

Thursday, February 6, 2014

Today I will discuss Named return value optimization and copy elision. Copy elision is a compiler optimization technique that eliminates unnecessary copying of objects. For example, copying can be eliminated in the case of temporary objects of class type that has not been bound to a reference. This is the case in return value optimization. Take the following code as given in msdn :
class RVO
{
// constructor prints call
// copy constructor prints call
// destructor prints call
// declare data variable
}

Now if there was code for a method to return an object of this class after assignment of the data variable, it would call the consructor twice and the copy constructor once and the destructor thrice in that order because one object is created within the method and another object is created in the caller and a temporary object is created in the return value between the method and the caller.

This temporary object can now be done away with in an optimization without affecting the program logic because both the caller and the method will have access to their objects. This optimization is called return value optimization.
Hence the program output with return value optimization will print two constructors followed by two destructors and without the line for copy constructor and a destructor. This saves on memory space particularly if the object can be of arbitrary size.

Compilers such as that for Visual Studio has a switch to kick in this optimization. The effect of this optimization should be clear with the memory growth and can be resource monitored to see the difference.

A side effect of optimization is that programmer should not depend on the temporary objects being created. For example, if the programmer increments the reference count an object at its creation via both the constructor and the copy constructor, he should not differentiate between the two. Further the constructors and destructors will be paired so that programmer can rely on the lifetime of the object without losing the semantics.

Wednesday, February 5, 2014

We discussed alerts, actions, charts, graphs, visualizations, and dashboards in the previous post. We now review recipes for monitoring and alerting. These are supposed to be brief solutions for common problems. Monitoring helps you see what is happening to your data. As an example, let us say we want to monitor how many concurrent users are there at any given time. This is a useful metric to see if a server is overloaded. To do this, we search for the relevant events. Then we use the concurrency command to find the number of users that overlap. Then we use a time chart reporting command to display a chart of the number of concurrent users.
We specify this as search sourcetype=login_data | concurrency duration=ReqTime | timechart max(concurrency)
Let us say next that we want to monitor the inactive hosts.
we use the metadata command that gives information on host, source and source types
Here we specify
| metadata type=hosts | sort recentTime | convert ctime(recentTime) as Latest_Time
We can use tags to categorize data and use it with our searches.
In the above example, we could specify:
... | top 10 tag::host to specify top ten host types.
Since we talked about tag, we might as well see an example about event type
we could display a chart of how host types perform using only event types that end in _host with the following:
... | eval host_types=mvfilter(match(eventtype, "_host$"))
| timechart avg(delay) by host_types
Another common question that we could help answer with monitoring is how did today perform compared to previous month ?
For example we might want to view the hosts that were more popular today than previous month.
This we do with the following steps:
1. get the monthly usage for each host
2. get the daily usage for each host and append
3. use the stats to join the monthly and daily usages by host.
4. use sort and eval to format the results.
Let's try these commands without seeing the book.
| metadata type=hosts | sort duration | earliest = -30d@d | stat sum(duration) as monthly_usage by host | sort 10 - monthly_usage | streamstats count as MonthRank.
Cut and paste the above with changes for daily as
append[ | metadata type=hosts | sort duration | earliest = -1d@d | stat sum(duration) as daily_usage by host | sort 10 - daily_usage | streamstats count as DailyRank]
Next join the monthly and the daily rankings with stats command:
stats first(MonthRank) as MonthRank first(DayRank) as DayRank by host
Then we format the output :
eval diff=MonthRank-DayRank | sort DayRank | table DayRank, host, diff, MonthRank
Each of the steps can now be piped to the other and the overall search query can be pipe-concatenated to form a single composite query.

Tuesday, February 4, 2014

In today's post we will cover another chapter in Exploring Splunk book. This chapter is on enriching data. We can use command like top and stats to explore the data. We can also add spark lines which are small line graphs to the data so that data patterns can be quickly and easily visualized.
With Splunk it is easy to exclude data that has already been seen. We do this with tagging. This helps detecting interesting events from noise.
When we have identified the fields and explored the data, the next step is to categorize and report the data
Different event types can be created to categorize the data. There are only two rules to keep in mind with event types. There are no pipes to be used with event type declaration and there cannot be nested searches aka subsearches to create event types. For example status = 2* to define success cases and status = 4* for client_errors.
More specific event types can be built on type of more general event types. For example web_error can include both client_errors and server_errors. The granularity of event types is left to user discretion since the results matter to the user.
Event types can also have tags. A more descriptive tag about the errors enhances the event type.
As an example, user_impact event tag can be used to report on the events separately.
Together event types and tags allow data categorization and reporting for voluminous machine data.Refining this model is usually an iterative effort. We could start with a few useful fields and then expand the search. All the while, this adds more input to Splunk to organize and label the data.
We mentioned visualizing data with sparklines. We can also visualize data with charts and graphs. This is done from the create report tab of the search page.
For example, we can search with a query such as sourcetype=access* status=404 | stats count by category_id and then create a pie chart on the results. Hovering over the chart now gives details of the data.
Dashboards are yet another visualization tool. Here we present many different charts/graphs and other visualizations in a reporting panel. As with most reporting, a dashboard caters to an audience and effectively answers a few questions that the audience would be most interested in. This can be gathered from user input and feedback iterations. As with charts and graphs, its best to start with a few high level fields before making it more sophisticated.
Alerts are another tool that can run periodically or on events when search results evaluate against a condition.There are three options to schedule an alert. First is to monitor whenever the condition happens. The second is to monitor on a scheduled basis as a less urgent information. Third is to monitor using a realtime rolling window if certain number of things happen within a certain time-period.
Alerts can have associated actions that make them all the more useful. The actions can be specified via the wizard. Some actions could be say send an email, run a script, and show triggered alerts.

To make the data more usable, Splunk allows enriching of data with additional information so that Splunk can classify it better. Data can be saved in reports and dashboards that make it easier to understand. And alerts can be added so that potential issues can be addressed proactively and not after the effect.
The steps in organizing data usually involve identifying fields in the data, categorizing data as a pre-amble to aggregation and reporting etc. Preconfigured settings can be used to identify fields. These utilize hidden attributes embedded in machine data. When we search, Splunk automatically extracts fields by identifying common patterns in the data.
Configuring field extraction can be done in two ways - Splunk can automate the configuration by using the Interactive field extractor or we can manually specify the configuration.
Another way to extract fields is to use search commands. The rex commands comes in very useful for this. It takes a regular expression and then extracts fields that match the expression. To extract fields from multiline tabular data, we use multikv and to extract from xml and json data, we use spath or xmlkv
The search dashboard's field sidebar gives immediate information for each field such as :
the basic data type of the field with abbreviations such as a for text and # for numeric
the number of occurrences of the field in the events list (following the field name)

Monday, February 3, 2014

We discuss the various kinds of processors in Splunk within a pipeline. We have the monitor processor that looks for files and the entries at the end of the file. Files are read one at a time, in 64KB chunks, and until EOF. Large files and archives could be read in parallel. Next we have a UTF-8 processor that converts different char set to UTF-8.
We have a LineBreaker processor that introduces line breaks in the data.
We also have a LineMerge processor that does the reverse.
We have a HeadProcessor that multiplexes different data streams into one channel such as TCP inputs.
We have a Regex replacement processor that searches and replaces the patterns.
We have an annotator processor that adds puncts to raw events. Similar events can now be found.
The Indexer pipeline has TCP output and Syslog output both of which send data to remote server. The indexer processor sends data to disk. Data goes to remote server or disk but usually not both.
An Exec processor is used to handle scripted input.

Sunday, February 2, 2014

With this post, I will now return to my readings on Splunk from the book - Exploring Splunk. Splunk has a server and client. The Engine of the Splunk is exposed via REST based APIs to CLI Interface, web interface and other interfaces.
The Engine has multiple layers of software. At the bottom layer are components that read from different source types such as files, network ports or scripts. The layer above is used for routing, cloning, and load balancing the data feeds and this are dependent on the load. This load is generally distributed for better performance. All the data is subject to Indexing and an index is build Note that both the indexing layer and the layer below i.e. routing, cloning and load balancing are deployed and set up with user access controls. This is essentially the where what gets indexed by whom is decided. The choice is left to users because we don't want sharing or privacy violations and by leaving it configurable we are independent of how much or how little is sent our way for processing.
The layer on top of Index is Search which determines the processing involved for retrieving the results from the index. The search query language is used to describe the processing and the searching is distributed across workers so that the results can be parallel-ly processed. The layer on top of the Search is the Scheduling/Alerting, Reporting and Knowledge each of which is a dedicated component in itself. The results from these are sent through the REST based API.
Pipeline is used to refer to the data transformations as it changes shape, form and meaning before being indexed. Multiple pipe-lines may be involved before indexing. Processor performs small but logical unit of work. Processors are logically contained within a pipeline.
Processors perform small but logical unit of work. Queues hold the data between Pipelines. Producers and consumers operate on the two ends of a queue.
The file input is monitored in two ways - one the file watcher that scans directory or finds files and the other that reads files at the tail where the data is being added.

In today's post, I want to talk about a nuance of K-means clustering. Here the vectors are assigned to the cluster that is nearest as compared by the centroids. There are ways to assign clusters without centroids and these are based on single link, complete link etc. However, centroid based clusters are the easiest in that the computations are limited.
Note that the number of clusters is pre-specified before the start of the program. This means that we don't change the expected outcome. That is we don't return fewer than the expected clusters. Even if all the data points belong to one cluster, this method aims to partition the n data-points into k clusters each with its own mean or centroid.
The mean is recomputed after each round of assignment of data-points to different clusters. We start with a cluster with three different means initialized. At each step, the data points are assigned to the nearest cluster. That means that the clusters cannot be empty. If any cluster becomes empty because its members join other clusters, then that cluster should take the outliers of an already populated cluster. This way the cluster coherency goes up for each of the clusters.
If the number of clusters is large to begin with and the data set is fewer, this will lead to highly partitioned data set that is not adequately represented by one or more of the clusters and the resulting aggregation of clusters may need to be taken together to see the overall distribution. However, this is not an undesirable outcome. This is the expected outcome for the number of parititions specified. The number of partitions was incorrectly specified to be too high and this will be reflected by the chi square goodness of fit. The next step then should be to reduce the number of clusters.
If we specify only two clusters and all the data points are visually close to a predominant cluster, then too the other cluster need not be kept empty. It can improve the previous cluster by taking one of the outliers into the secondary cluster.

Saturday, February 1, 2014

In this post, I give some examples on DTrace:
DTrace is a tracing tool that we can use dynamically and safely on production systems to diagnose issues across layers. The common DTrace providers are :
dtrace - start, end and error probes
syscall - entry and return probes for all system calls
fbt - entry and return probes for all kernel calls
profile - timer driven probes
proc - process creation and lifecycle probes
pid - entry and return probes for all user-level processes
io - probes for all I/O related events.
sdt/usdt - developer defined probes
sched - for all scheduling related events
lockstat - for all locking behavior within the operating system
Syntax to specify commands probe-description/predicate/{action}
Variables (eg self->varname = 123) and associative arrays (eg name[key] = expression) can be declared. They can be global, thread local or clause local. Associative arrays are looked up based on keys
Common builtin variables include :
args: the typed arguments to the current probe,
ourpsinfo: the process state for the current thread
execname : the name passed in
pid : the process id of the current process
probefunc | probemod | probename | probeprov: the function name, module name, name and providername, of the current probe
timestamp vtimestamp - timestamp and the amount of time the current thread has been running
Aggregate functions include count, sum, avg, min, max, lquantize, quantize, clear, trunc etc.
Actions include trace, printf, printa, stack, ustack, stop, copyinstr, strjoin and strlen.
DTrace oneliners:
Trace new processes:
dtrace -n 'proc:::exec_success { trace(ourspsinfo->pr_psargs); }'
Trace files opened
dtrace -n 'syscall::openat*:entry { printf("%s,%s", execname, copyinstr(arg1)); }'
Trace number of syscalls
dtrace -n 'syscall:::entry {@num[execname] = count(); trace(execname); }'
Trace lock times by process name
dtrace -n 'lockstat:::adaptive_block { @time[execname] = sum(arg1); }'
Trace file I/O by process name
dtrace -n 'io:::start { printf("%d %s %d", pid, execname, args[0]->b_bcount);}'
Trace the writes in bytes by process name
dtrace -n 'sysinfo:::writeoh { @bytes[execname] = sum(arg0); }'