Cluster computing

Friday, February 14, 2014

Thread windows style

I found this WINgnam style implementation on the web:


#ifndef _WIN_T_H_

#define _WIN_T_H_ 1
#include <iostream>

nclude <cassert>

#include <memory>
#
i#include <windows.h>

class Runnable {
pub
#include <process.h>

lic:

virtual ~Runnable() = 0;

 virtual void* run() = 0;

 };

class Thread {
public:

le> run);
 Thread();
 virtual ~Thread
 Thread(std::auto_ptr<Runna
b();
 void start();
 void* join();
private:
 HANDLE hThread;

matically
 std::auto
 unsigned wThreadID;
 // runnable object will be deleted aut
o_ptr<Runnable> runnable;
 Thread(const Thread&);

d when run() completes
 void setComplete
 const Thread& operator=(const Thread&);
 // call
ed();
 // stores return value from run()
 void* result;
 virtual void* run() {return 0;}

hread(LPVOID pVoid);
 void printError(LPSTR lpszFunction, 
 static unsigned WINAPI startThreadRunnable(LPVOID pVoid);
 static unsigned WINAPI start
TLPSTR fileName, int lineNumber);
};

class simpleRunnable: public Runnable {
public:
 simpleRunnable(int ID) : myID(ID) {}
 virtual void* run() {

ad: public Thread {
public:
 simpleThread(int ID) : myID(ID) {}
  std::cout << "Thread " << myID << " is running" << std::endl;
  return reinterpret_cast<void*>(myID);
 }
private:
 int myID;
};

class simpleThr
e
 virtual void* run() {
  std::cout << "Thread " << myID << " is running" << std::endl;
  return reinterpret_cast<void*>(myID);
 }
private:
 int myID;
};


#endif

Thursday, February 13, 2014

We look at some more inputs to Splunk today. The SDK offers the following TcpSplunkInput class, UdpInput class, WindowsActiveDirectoryInput class, WindowsEventLogInput class, WindowsPerfmonInput class, WindowsRegistryInput class, WindowsWmiInput class
. In addition there's ScriptInput and MonitorInput class.
The Input class is the Splunk Input class from which all specific inputs are derived. The InputCollection is a collection of inputs and it has heterogenous members with each member mentioning its type of input. The type of input is identified by the InputKind class. The different InputKinds are monitor, script, tcp/raw, tcp/cooked, udp, ad, win-event-log-collections, registry, WinRegMon, win-wmi-collections.
The monitor input monitors files, directory, script or network for new data. It has a blacklist, whitelist and crcSalt. A crcSalt is the string that Splunk has for a matching cyclic redundancy check.
As with all inputs, the corresponding Args class are used to specify the arguments to the inputs.
A ScriptInput represents a scripted data input. There is no limit to the format or content of this kind of data. A TcpInput is for raw TCP data as in the capture directly over the wire without any additional application layer processing. The latter is handled by TcpSplunkInput class.
The UdpInput represents the UDP data input. Note that there is no separate class for cooked udp input. Can you guess why ? It has to do with sessions and application logic.
A WindowsActiveDirectoryInput reads directly from the Active Directory. Since organizations secure their resources via the ActiveDirectory, this is the best input to know the hierarchy of the organization.
The Windows event log input reads directly from the event sink for windows. All windows and user applications can generate event logs and these are helpful in troubleshooting.
The Windows perfmon event input reads performance monitoring data and this is helpful for operations to see the load on the server in terms of utilization of critical resources such as memory and cpu.
The Registry input is used to gather information on windows registry keys and hive where applications and windows persist settings, state and configuration data between server reboots.
The WMI input is different from other inputs in that WMI is used for server management by operations and is a different kind of data provider.
What we could add is perhaps a MSMQ input since this has access to all the messages in the windows message queuing. These messages could be from active named private or public queues, dead letter queues, poison queues, and even journal queues. i.e everything except the non-readable internal reserved queues. When journaling is turned on we get access to all the messages historically versus getting notifications as and when the messages arrive.

Wednesday, February 12, 2014

Splunk has an SDK for programmability in different languages including C#. The object model is granular to enable the same kind of functionality as with the web interfaces. We briefly enumerate a few here:
There's an application class that represents the locally installed Splunk app.
There's an application archive that represents the archive of a Splunk app.
Both application and application archive derive from Entity class. The Entity class represents the base class for all Splunk entities. EntityMetadata class provides access to the metadata properties of a corresponding entity and can be instantiated with the static GetMetdadata() method.
The application args class extends Args for application creation properties. The Args class is a helper class so that the Splunk REST APIs can be called with key value pairs arguments
ApplicationSetup class represents the setup information for a Splunk app.

The BaseService functionality is common to both Splunk Enterprise and Splunk storm. The ConfCollection represents the collection of configuration options

Alerts are represented by FiredAlert class and their groupings - FiredAlertGroup and FiredAlertGroupCollection.

The HttpService class is for the web access and uses both http and https protocols.
The Index class represents the Splunk DB/Index object. Index also comes with corresponding IndexArgs.
The Job class represents a search Job and comes with its own JobArgs, JobEventArgs, JobExportArgs, JobResultsArgs, and JobResultsPreviewArgs. The Message and MessageArgs and MessageCollection are used to represent Splunk messages.
The ResultsReaderJson and ResultsReaderXML are specific derivations of the abstract ResultsReader class used for reading search results.

The MonitorInput class represents a monitor input which is a file, directory, script or network port and soon to include windows message queuing. These are monitored for new data.
The WindowsActiveDirectoryInput, WindowsEventLogInput, WindowsPerfmonInput, WindowsRegistryInput and WindowsWmiInput corresponding data input class.

The Receiver class exposes methods to send events to Splunk via the simple or streaming receiver endpoint.

Tuesday, February 11, 2014

Splunk monitors machine data There are some concepts specific to Splunk. We briefly review these now. Index time processing : Splunk reads data from a source such as a file or a port and classifies that source into a source type. Data is broken into events that consist of single or multiple lines and writes each event into an index on disk, for later retrieval with a search.
When search starts, events are retrieved and classified based on eventtypes and the matching events are transformed to generate reports and displayed on dashboards.
By default, the events go into a main index unless a specified index is created or stored.
Fields are searchable name/value pairings in event data - usually the default fields are host, source and sourcetype. Tags are aliases to field values. Event types are dynamic tags attached to an event. Saved Splunk objects such as savedsearches, eventtypes, reports and tags are not only persisted but also secured with permissions based on users and roles. Thus events are enriched before indexing. When sets of events are grouped into one, they are called transactions.
Apps are a collection of splunk configurations, object and code. They help the user in organization of targeted activities.
Splunk instances can work in three different modes - forwarder, indexer and search head. A forwarder is a version of Splunk that allows you to send data to a central Splunk indexer or a group of indexers. An indexer provides indexing capability for local and remote data. An indexer is usually added for every 50-100 GB per day depending on search load. A Splunk search head is typically added for every 10-20 active users depending on searches.
One of the sources for machine data is message queuing. This is particularly interesting because message queues are increasingly being used as the backbone of logging architectures for applications. Subscribing to these message queues is a good way to debug problems in complex applications. As the Exploring Splunk book mentions, you can see exactly what the next component down the chain received from the prior component. However, the depth of support to subscribe to message queues and on different operating system varies. At the very least, transactional and non-transactional message queues on Windows could be supported directly out of the box.

Monday, February 10, 2014

We return to our discussion on Splunk commands. The eval command calculates an expression and puts the resulting value into a field. Eval recognizes functions such as :
abs(x), case(X, "Y",...), ceil(x), cidrmatch("x",Y) that identifies ip addresses that belong to a subnet, coalesce(X,...) that returns the first value that is not null, exact(X) that uses double precision, exp(x), floor(x), if (X,Y, Z), isbool(X), isint(X), isnotnull(X), isnull(X), isnum(x), isstr(), len(), like(X,"Y"), ln(X), log(X,Y), lower(X), ltrim(X,Y), match(X,Y) - which matches regex pattern, max(X,...), md5(x) which gives an md5 hash, min(X,...) which returns the min, mvcount(X) which returns the number of values of X, mvfilter(X) whcih filters the multivalued field, mvjoin(X,Y) which joins the field based on the specified delimiter, now which gives current time, null(), nullif(), pi(), pow(X,Y), random(), relative_time(X,Y), replace(X,Y,Z), round(X,Y), rtrim(X,Y), searchmatch(X) and split(X,"Y"), sqrt(X) and strftime(X, Y)which returns the time as specified by the format, strptime() that parses time from str, substr, time, tonumber, tostring, trim, typeof(X) that returns a string representation of its type, upper(X), urldecode(X) and validate(X,Y,...)
Common stats function include avg, count, dc that returns distinct values, first, last, list, max, median, min, mode, perc<x>(Y) that returns percentile, range that returns difference between max and min values, stdev that returns sample standard deviation, stdevp that returns population standard deviation, sum, sumsq, values and var that returns variance of X.
Regular expressions can be specified with the following meta characters:
\s for white space, \S for not white space, \d for Digit, \D for not digit, \w for word character, \W for not a word character, [...] for any included character, [^...] for no included character, * for zero or more, + for one or more, ? for zero or one, | for Or, (?P<var>...) for named extraction such as for SSN, (?:...) for logical grouping, ^ start of line, $ for end of line, {...} for number of repetitions, \ for Escape, (?= ...) for Lookahead, and (?!...) for negative lookahead. The same regular expressions can be specified in more than one ways but the parser will attempt to simplify/expand it to cover all cases as applicable by the pattern. For example a repetition of x 2 to 5 times x{2,5} will be written as xx(x(x(x)?)?)?

Sunday, February 9, 2014

In regular expression search, matches can be partial or full. In order that we process all the groups in a full match, we iterate multiple times.
In either case, there are a few approaches to pattern matching.
The first is backtracking
The second is finite automata
The third approach is RE2.
Often production implementations mix and match more than one of the approaches above.
We discuss these briefly here.

Kerninghan and Pike mention a very fast and simple backtracking approach. Matches are determined one position at a time. Positions need not be contiguous. If the match occurs, the remainder of the pattern is checked against the rest of the string. A recursive function helps here and the recursion can be nested upto the length of the pattern. This pattern is maintained based on the literals and metacharacters provided. The empty string can match with an aseterisk. A separate recursive method may exist for the asterisk wild card character since it is matched in one of three different ways, depending on the parameters to the function.

RE2 approach is used by PCRE, perl and python. Unlike the backtracking approach above that can take an exponential order of time even though such an approach can support a variety of features, RE2 is efficient in that it uses automata theory and supports leftmost-first match. This approach has the following steps:
Step 1: Parse (Walking a Regexp)
Step 2: Simplify
Step 3: Compilstea.de
Step 4: Matchns in

The above steps are explained as below:

The parser could be simple but there's a variety of regular expressions that can be specified. Some parsers maintain an explicit stack instead of using recursion. Some parsers allow a single linear scan by keeping reverse polish notation. Others use trees or explicit stack. Its possible to parse a regular expression as a decision tree on the specified input text to see if any of the substrings at a position match the pattern. I take that back, we could do with a graph of predefined operations instead. These operations are concatenation, repetition and alternation.
The next step is to simplify.The purpose of simplifying is to reduce the memory footprint.
The result of the next step i.e. compilation is an instruction graph that makes it easy for pattern matching. This way parser doesn't need to be invoked again and again.
Lastly, the step for matching can be both partial or full.

Saturday, February 8, 2014

In the previous post, we talked about regular expression. Their implementation s quite interesting. I would like to cover that in a subsequent post. Meanwhile I wanted to mention the different forms regular expression can take. The given pattern can have wild card and meta characters. This means the patterns can match a variety of forms. Typically we need one expanded form that will match most patterns.
By expanding the regular expression I mean, we use a form which has all the different possible groups enumerated. By expanding the groups, we attempt to cover all the patterns that were intended to be matched. Recall how in our final desired output, we order the results based on groups and captures. Expanding the pattern helps us with this enumeration.
There are many editors for regular expression including Regular Expression Buddy that enables users of regular expressions to try out their expression with sample data in an interactive manner. The idea is that there may need to be some experimentation to decide which regular expression best suits the need so that there is little surprise between the desired and actual output. Although the regular expression buddy is a downloadable software, there are webapplications such as regexpr that we can try out for the same. The User interface allows you to interactively and visually see the results with highlighted matches.