Cluster computing

Friday, February 28, 2014

In the post on using the SQL Server service broker as a modular input for Splunk, we introduced a technique but we now describe the complete solution. We mentioned that we could read messages from the Broker by opening a SqlConnection and executing a SQL statement. For every such message received we can create a Splunk modular input event wrapper and send off the data to Splunk.
The program implements the Splunk script object and implements the methods that are required to take the configuration from the user, apply the configuration and extract events. These events are extracted from the message broker as mentioned above.
The sampling rate is determined by the number of messages We fork a thread to process the messages for the designated queue. The config determines the queue names, the sampling intervals etc. The default sampling interval is zero.
We invoke the CLI command object from the SDK. Specifically, we say Command.Splunk("Search").
And then we add the parameter we want to search with. we can check the cli.Opts to see if our search command and parameter were added. After defining the command object, we then create a job object to invoke it. We do it with the following:
var service = Service.Connect(cli.Opts)
var jobs = service.GetJobs();
var job = jobs.Create((string)cli.Opts["search"]);
We wait till the job is done. This we can check with the done flag on the job.
We retrieve the results of the jobs with the job.getResults command which returns a stream. We can then open a ResultsReaderXml on this stream to retrieve the events.

Thursday, February 27, 2014

Today we discuss more on modular input. In this case, we talk about the properties we can set on the modular input data structure. One of the properties is the index. This talks about which index the events should goto. There us a default index but its preferable to store out events by their own index. The second property is the stanza name a unique source type for these kind of events. In fact the stanza is user specified and the events are often retrieved based on these literals.

The idea behind modular inputs seems to be to envelop the events to be differentiated from others. These help even before parsing and indexing have taken place.
When we use the script object with the input we are able to govern the flow of these events. We increase it slowly from a trickle to a deluge if we wanted by varying the polling interval or the sampling rate

In addition to differentiating and regulating the flow, the modular inut and scripts can filter On raw data before they are sent over. For example the inbound and outbound messages may not only be differentiated but one selected over another

A Splunk app that uses modular input uses the script object. It uses such things as methods to run and stream events. Examples are provided in the Splunk documentation but I would like to discuss how to integrate it with a service. A script conforms to an executable much like the service. Only in this case it follows the conventions of Splunk Apps. Otherwise both the script and the service perform data collection. While the service may have methods to process the incoming and outgoing messages, the script has a method for streaming events. In addition the script is setup for polling to monitor the changes and to validate the user input. In essence a modular input script is suited to interface with the user for the configuration of the monitor and to set up the necessary index, queue, stanza etc for the events to be written to. With this framework, no matter what the events are, they can join the pipeline to Splunk.
The StreamingEvents method invokes a while loop to poll or monitor the changes at regular intervals. This method sleeps for a few intervals thus relieving the CPU between polling. Sleeping helps particularly with otherwise high CPU usage on uniprocessors and on earlier systems such as what Sustaining Engineering sees occasionally with customers. Introducing the sleep at the correct place during polling alleviates this. Even a Sleep(0) will be sufficient.
In the case of the service that monitors the service broker queue, sample code on MSDN, it has the same run method that the Splunk Script object has and it also begins with loading the configuration and it forks thread to watch each queue. Each thread for its lifetime opens up a SQL connection object and retrieves the service broker message in a loop based on polling just like the extract events method of the Splunk object.
Thus the easiest way to integrate the service broker method is to use the service broker queue reading logic for all specified queues in the extract events method of the modular input.

Wednesday, February 26, 2014

One of the ways we can look at logging on Windows for any component is with WPP Tracing. This is true for any logs including system components, device drivers, applications, services and any registered trace provider. The trace providers usually are found by their GUID that they register with or that information is extracted from the pdb. The WindowsDDK ships a tool called traceview that can collect and display these traces.
This tool may not be up-to-date on the trace log format but we can easily convert the trace captured in a .etl log file by using eventvwr->open saved log and saving it to the newer format.
Here 's an example of how the logs look like :
00111375 drv 7128 9388 2 111374 02\26\2014-15:59:34:202 Driver::Registry call back: filter out event due to machine or user path set in config. operation = QueryValueKey
The events are displayed we cause we have the formatting for it. This is usually contained in the trace file format maintained by the providers or part of their pdbs. If we don't have the formatting information, the events look something like this:
00111386 Unknown 7128 9388 2 111385 00\00\ 0-00:00:00:00 Unknown( 40): GUID=bbd47d81-a1f8-551f-b37f-8ce988bb02f2 (No Format Information found).
This may not mean that we can use the same fields as we see in TraceView to use with the filters in the event viewer filter. The latter is maintaining its own filter fields, attributes and levels
The event viewer logs have several features.
First off it conforms to a template that's universally recognized. And it identifies events by their source, ids etc.
Second, it can collect a variety of logs, application, system and security. These provide a sink for all the event tracing information on the system. These can be saved and viewed offline.
Third, eventvwr can connect to remote computers and display the events from the logs there.
This is critical when it comes to viewing information across machines.
If our interest is only in filtering certain events, the logman tool can come helpful in filtering events based on provider guid. There are some other options available as well such as to start stop update and delete data collector, to query the data collectors properties and to import or export the XML file.

Tuesday, February 25, 2014

I'm going to blog about Splunk CLI commands. By the way I'm going to check if fifo input is discontinued. Meanwhile lets talk about some basic CLI commands now.
There are several basic commands and it may take a while to cover all of them. I'll try going case by case such as say for a given task at hand. This way we will know how to use it. Again, there's plenty of literature on docs.splunk.com but my goal here is to mention the ones I've used.
Here's a command to register perfmon. You can modify the inputs.conf file with the details of the perfmon config
splunk add exec scripts\splunk-perfmon.path -interval 60
and splunk enable perfmon
The CLI commands are based on verbs and objects.
You can start or stop splunk with : splunk start splunkd --debug
but you can only do that with splunkd and splunkweb. Also, since we are talking about perfmon events, we can use the CLI to see what perfmon will be collecting with our command:
splunk list perfmon
In this case, it will give you output such as :
Monitored Perfmon Collections:
        LogicalDisk
                _TCP_ROUTING:windowsIndex
                counters:*
                disabled:0
                host:RRAJAMANIPC
                index:windows_perfmon
                interval:10
                object:LogicalDisk
These are what we define in the inputs.conf file.
Note that individual perfmon items can also be enabled or disabled separately
splunk enable perfmon LogicalDisk
and similarly we can disable them individually as follows:
splunk disable perfmon LogicalDisk
CLI commands enable to activate a configuration change with the reload command as
splunk reload perfmon which makes it effective immediately.

Program execution logging seems an art. While it can be dismissed as a chore, for sustaining engineering, this seems an invaluable diagnostic. What would make it easier to troubleshoot problems is when there is a descriptive message when errors occur. Typically these messages are for at-the-moment errors without any indication of what customer could do to mitigate it. I don't mean that error messages need to be expanded to include corrective actions in all cases. That would help but perhaps an association between error messages and corrective actions could be maintained. Say if we keep all our error message strings in one place then it could be easy to correlate the errors to the actions by keeping them side by side.
The corrective action strings need not even be in the logging but the association could help support and sustaining to diagnose issues. Especially when the workarounds are something that's domain knowledge. These will avoid a lot of communication and even help the engineers on the field.
At the same time, this solution may not be appropriate in all cases. For example, where we don't want to be too informative to our customers and where we don't found confound them with too much details. Even in such cases, being elaborate in the error conditions and the descriptive messages may help the appropriate audience to target their actions.
Lastly, I want to add that many feature developers might already be aware of common symptoms and mitigations during their development phase. Capturing these artifacts will help in common troubleshooting with the feature at a later point of time. Building a history or a database of such knowledge via simple bug tracking would immensely help. Since troubleshooters ofter search the bug database to see for similar problems reported.
Another consideration is that the application maintain data structures exclusively for supportability. For example, if there is an enumeration of all the workers for a given component, their tasks, their objects and states and if these can be queried in a pull operation independent of the method they are working on, it would be great. These pull operations could be invoked by views specific to runtime diagnostics. So they can be exposed via methods specific to management. These are different from logging in the sense that they are actually calls to the product to retrieve enhanced runtime information.

Monday, February 24, 2014

I finally wrote a program that demonstrates the high performance reading of Message queues from MSMQ on Windows. No matter the number of queues and the number of messages on the queues, this program illustrates reading via the use of IO completion ports. In brief, what the application does is it has a monitor that collects and configures the queues to be read
and creates a completion port for all the queues. Then it forks off threads that are get notified on this completion port whenever messages arrive. The threads then exchange this data read from messages with the monitor that can file/dump the data away
The completion port enables to specify multiple queues and tolerate any load.
The workers are spawned and closed when the port is ready. Closing the port signals the threads to terminate.
This is convenient for initialization and cleanup. Memory usage is limited to the copying of messages in transit and consequently very small as compared with the overall number of messages.
Secondly the application allows for threads to service any messages from the completion port. The messages are tied back to the queue names based on the overlapped key parameter that the threads set when reading a message. The threads know which queue handle the data is coming from and when reading it they can flag the necessary queue so that proper association can take place.
Another thing to keep track of is the task for all the threads is the same and simply to get notified on messages, to read them and post to the monitor. This way there is no restriction to the concurrency from the applications perspective. However, that said, the concurrency value is typically determined by the number of processors. Since these are OS threads, we rely on what they suggest. We do follow their recommendation to use a completion port but the threadpool we use with the completion port is something we can tweak based on what works. Lastly, I wanted to mention the properties we use for message queue receiving are determined by the application. While we can retrieve a large number of properties for each receive we are typically interested in the message buffer and size. So we need to determine these application chosen properties before we make Receive calls. The threads assume the structure of this context when receiving.