Cluster computing

Saturday, February 15, 2014

What's a good way to scale the concurrent processing of a queue both on the producer and consumer side ? This seems a text book question but think about support on cross platform and high performance. Maybe if we narrow down our question to windows platform, that would help. Anyways, its the number of concurrent workers we can have for a queue. The general rule of thumb was that you could have as many threads as one more than the number of processors to keep everyone busy. And if its a light weight worker without any overhead of TLS storage, we could scale to as many as we want. The virtual workers can use the same pool of physical threads. I'm not talking fibers which don't have the TLS storage as well. Fibers are definitely welcome over OS threads. But I'm looking at a way to parallelize as much as we can in terms of number of concurrent calls on the same system.
In addition we consider the inter worker communication both in a failsafe, reliable manner. OS provides mechanisms for thread completion based on handles returned from the CreateThread and then there's a completion port on windows that could be used with multiple threads. The threads can then close when the port is closed.
Maybe the best way to do this would be to keep a pool of threads and partitioned tasks instead of timeslicing.
Time-slicing or sharing does not really improve concurrent progress if the tasks are partitioned.
Again, it helps if the tasks are partitioned. The .Net task parallel library enables both declarative parallel queries and imperative parallel algorithms. By declarative we mean we can use notations such as 'AsParallel' to indicate we want routines to be executed in parallel. By Imperative we mean we can use partitions, permutations and combinations with linear data.
In general a worker helps with parallelization when it has little or no communication and works on isolated data.
I want to mention diagnostics and boost. Diagnostics on a workers activity to detect hangs or for identifying a worker among a set of workers are enabled with such things as logging or tracing and
identifiers for workers. Call level logging and tracing enable detection of activity by a worker. Between a set of workers, IDs can tell apart a worker from the test. Logging can include this ID to detect the worker with a problem activity.
There can also be a dedicated activity or worker to monitor others.
Boosting a workers performance is in terms of cpu speed and memory. Both are variables that depend on hardware. Memory and caches come very helpful in improving the performance of the same activity by a worker.

Friday, February 14, 2014

Thread windows style

I found this WINgnam style implementation on the web:


#ifndef _WIN_T_H_

#define _WIN_T_H_ 1
#include <iostream>

nclude <cassert>

#include <memory>
#
i#include <windows.h>

class Runnable {
pub
#include <process.h>

lic:

virtual ~Runnable() = 0;

 virtual void* run() = 0;

 };

class Thread {
public:

le> run);
 Thread();
 virtual ~Thread
 Thread(std::auto_ptr<Runna
b();
 void start();
 void* join();
private:
 HANDLE hThread;

matically
 std::auto
 unsigned wThreadID;
 // runnable object will be deleted aut
o_ptr<Runnable> runnable;
 Thread(const Thread&);

d when run() completes
 void setComplete
 const Thread& operator=(const Thread&);
 // call
ed();
 // stores return value from run()
 void* result;
 virtual void* run() {return 0;}

hread(LPVOID pVoid);
 void printError(LPSTR lpszFunction, 
 static unsigned WINAPI startThreadRunnable(LPVOID pVoid);
 static unsigned WINAPI start
TLPSTR fileName, int lineNumber);
};

class simpleRunnable: public Runnable {
public:
 simpleRunnable(int ID) : myID(ID) {}
 virtual void* run() {

ad: public Thread {
public:
 simpleThread(int ID) : myID(ID) {}
  std::cout << "Thread " << myID << " is running" << std::endl;
  return reinterpret_cast<void*>(myID);
 }
private:
 int myID;
};

class simpleThr
e
 virtual void* run() {
  std::cout << "Thread " << myID << " is running" << std::endl;
  return reinterpret_cast<void*>(myID);
 }
private:
 int myID;
};


#endif

Thursday, February 13, 2014

We look at some more inputs to Splunk today. The SDK offers the following TcpSplunkInput class, UdpInput class, WindowsActiveDirectoryInput class, WindowsEventLogInput class, WindowsPerfmonInput class, WindowsRegistryInput class, WindowsWmiInput class
. In addition there's ScriptInput and MonitorInput class.
The Input class is the Splunk Input class from which all specific inputs are derived. The InputCollection is a collection of inputs and it has heterogenous members with each member mentioning its type of input. The type of input is identified by the InputKind class. The different InputKinds are monitor, script, tcp/raw, tcp/cooked, udp, ad, win-event-log-collections, registry, WinRegMon, win-wmi-collections.
The monitor input monitors files, directory, script or network for new data. It has a blacklist, whitelist and crcSalt. A crcSalt is the string that Splunk has for a matching cyclic redundancy check.
As with all inputs, the corresponding Args class are used to specify the arguments to the inputs.
A ScriptInput represents a scripted data input. There is no limit to the format or content of this kind of data. A TcpInput is for raw TCP data as in the capture directly over the wire without any additional application layer processing. The latter is handled by TcpSplunkInput class.
The UdpInput represents the UDP data input. Note that there is no separate class for cooked udp input. Can you guess why ? It has to do with sessions and application logic.
A WindowsActiveDirectoryInput reads directly from the Active Directory. Since organizations secure their resources via the ActiveDirectory, this is the best input to know the hierarchy of the organization.
The Windows event log input reads directly from the event sink for windows. All windows and user applications can generate event logs and these are helpful in troubleshooting.
The Windows perfmon event input reads performance monitoring data and this is helpful for operations to see the load on the server in terms of utilization of critical resources such as memory and cpu.
The Registry input is used to gather information on windows registry keys and hive where applications and windows persist settings, state and configuration data between server reboots.
The WMI input is different from other inputs in that WMI is used for server management by operations and is a different kind of data provider.
What we could add is perhaps a MSMQ input since this has access to all the messages in the windows message queuing. These messages could be from active named private or public queues, dead letter queues, poison queues, and even journal queues. i.e everything except the non-readable internal reserved queues. When journaling is turned on we get access to all the messages historically versus getting notifications as and when the messages arrive.

Wednesday, February 12, 2014

Splunk has an SDK for programmability in different languages including C#. The object model is granular to enable the same kind of functionality as with the web interfaces. We briefly enumerate a few here:
There's an application class that represents the locally installed Splunk app.
There's an application archive that represents the archive of a Splunk app.
Both application and application archive derive from Entity class. The Entity class represents the base class for all Splunk entities. EntityMetadata class provides access to the metadata properties of a corresponding entity and can be instantiated with the static GetMetdadata() method.
The application args class extends Args for application creation properties. The Args class is a helper class so that the Splunk REST APIs can be called with key value pairs arguments
ApplicationSetup class represents the setup information for a Splunk app.

The BaseService functionality is common to both Splunk Enterprise and Splunk storm. The ConfCollection represents the collection of configuration options

Alerts are represented by FiredAlert class and their groupings - FiredAlertGroup and FiredAlertGroupCollection.

The HttpService class is for the web access and uses both http and https protocols.
The Index class represents the Splunk DB/Index object. Index also comes with corresponding IndexArgs.
The Job class represents a search Job and comes with its own JobArgs, JobEventArgs, JobExportArgs, JobResultsArgs, and JobResultsPreviewArgs. The Message and MessageArgs and MessageCollection are used to represent Splunk messages.
The ResultsReaderJson and ResultsReaderXML are specific derivations of the abstract ResultsReader class used for reading search results.

The MonitorInput class represents a monitor input which is a file, directory, script or network port and soon to include windows message queuing. These are monitored for new data.
The WindowsActiveDirectoryInput, WindowsEventLogInput, WindowsPerfmonInput, WindowsRegistryInput and WindowsWmiInput corresponding data input class.

The Receiver class exposes methods to send events to Splunk via the simple or streaming receiver endpoint.

Tuesday, February 11, 2014

Splunk monitors machine data There are some concepts specific to Splunk. We briefly review these now. Index time processing : Splunk reads data from a source such as a file or a port and classifies that source into a source type. Data is broken into events that consist of single or multiple lines and writes each event into an index on disk, for later retrieval with a search.
When search starts, events are retrieved and classified based on eventtypes and the matching events are transformed to generate reports and displayed on dashboards.
By default, the events go into a main index unless a specified index is created or stored.
Fields are searchable name/value pairings in event data - usually the default fields are host, source and sourcetype. Tags are aliases to field values. Event types are dynamic tags attached to an event. Saved Splunk objects such as savedsearches, eventtypes, reports and tags are not only persisted but also secured with permissions based on users and roles. Thus events are enriched before indexing. When sets of events are grouped into one, they are called transactions.
Apps are a collection of splunk configurations, object and code. They help the user in organization of targeted activities.
Splunk instances can work in three different modes - forwarder, indexer and search head. A forwarder is a version of Splunk that allows you to send data to a central Splunk indexer or a group of indexers. An indexer provides indexing capability for local and remote data. An indexer is usually added for every 50-100 GB per day depending on search load. A Splunk search head is typically added for every 10-20 active users depending on searches.
One of the sources for machine data is message queuing. This is particularly interesting because message queues are increasingly being used as the backbone of logging architectures for applications. Subscribing to these message queues is a good way to debug problems in complex applications. As the Exploring Splunk book mentions, you can see exactly what the next component down the chain received from the prior component. However, the depth of support to subscribe to message queues and on different operating system varies. At the very least, transactional and non-transactional message queues on Windows could be supported directly out of the box.

Monday, February 10, 2014

We return to our discussion on Splunk commands. The eval command calculates an expression and puts the resulting value into a field. Eval recognizes functions such as :
abs(x), case(X, "Y",...), ceil(x), cidrmatch("x",Y) that identifies ip addresses that belong to a subnet, coalesce(X,...) that returns the first value that is not null, exact(X) that uses double precision, exp(x), floor(x), if (X,Y, Z), isbool(X), isint(X), isnotnull(X), isnull(X), isnum(x), isstr(), len(), like(X,"Y"), ln(X), log(X,Y), lower(X), ltrim(X,Y), match(X,Y) - which matches regex pattern, max(X,...), md5(x) which gives an md5 hash, min(X,...) which returns the min, mvcount(X) which returns the number of values of X, mvfilter(X) whcih filters the multivalued field, mvjoin(X,Y) which joins the field based on the specified delimiter, now which gives current time, null(), nullif(), pi(), pow(X,Y), random(), relative_time(X,Y), replace(X,Y,Z), round(X,Y), rtrim(X,Y), searchmatch(X) and split(X,"Y"), sqrt(X) and strftime(X, Y)which returns the time as specified by the format, strptime() that parses time from str, substr, time, tonumber, tostring, trim, typeof(X) that returns a string representation of its type, upper(X), urldecode(X) and validate(X,Y,...)
Common stats function include avg, count, dc that returns distinct values, first, last, list, max, median, min, mode, perc<x>(Y) that returns percentile, range that returns difference between max and min values, stdev that returns sample standard deviation, stdevp that returns population standard deviation, sum, sumsq, values and var that returns variance of X.
Regular expressions can be specified with the following meta characters:
\s for white space, \S for not white space, \d for Digit, \D for not digit, \w for word character, \W for not a word character, [...] for any included character, [^...] for no included character, * for zero or more, + for one or more, ? for zero or one, | for Or, (?P<var>...) for named extraction such as for SSN, (?:...) for logical grouping, ^ start of line, $ for end of line, {...} for number of repetitions, \ for Escape, (?= ...) for Lookahead, and (?!...) for negative lookahead. The same regular expressions can be specified in more than one ways but the parser will attempt to simplify/expand it to cover all cases as applicable by the pattern. For example a repetition of x 2 to 5 times x{2,5} will be written as xx(x(x(x)?)?)?