Cluster computing

Wednesday, April 2, 2014

Today we look at prim algorithm. This is also a greedy method. Here we take grow the tree by adding the minimum cost edge that connects to it. That is we look at all the edges that connect to the existing tree and choose the lowest cost.we grow the tree one edge at a time. The algorithm is described as below
Algorithm Prim(E, cost, n, t)
{
Let (k,l) be on edge of minimum cost in E;
t[1,1] = k
t[1,2] = l
mincost = cost[k,l]
for i = 1 to n do
if (cost[i,1] < cost[i,k]) then near[i] = l;
else if cost[i,k] is not large) then near[i] = k
ese near[i] = -1;
near[k] = near[l] = 0;

for ( i = 2 to n-1 ) do
{
Choose j such that near[j] > 0 and cost[j, near[j]] is minimum.
t[i,1] = j; t[i,2] = near[j]; mincost = mincost + cost[j, near[j]]
near[j] = 0;
for k = 1 to n do {
if (near[k] > 0) and (cost[k, near[k]]) > cost[k, j]) or
(near[k] < 0) and (cost[k,j] is not large)
then
near[k] = j;
}
}
return mincost;
}

Tuesday, April 1, 2014

Today we will look at Kruskal algorithm. This algorithm is a greedy method. It enumerates the edges in the order of ascending weights. The edges need not be connected. That is where at any given time we have disjoint sets that are called forests. The algorithm proceeds like this
We have two helper methods
Find (u) that finds the component containing u and
Union (u, v) that combines component s.
Algorithm Kruskal (E, cost, n, t)
{
Organize the edges of the graph in a min heap H , based on cost
For I = 1 to N do component [I] = I
I = 0 mincost =0
while ( i < n - 1)
{
    (u,v) = delete-min(H)
    j = Find(u); k = Find(v);
    if (j != k) then {
        i = i + 1;
        t[i,1] = u; t[i,2] = v;
       mincost = mincost + cost[u, v];
       Union(j, k);
      }
}
return mincost;
}

Monday, March 31, 2014

I want to talk about Splunk saved searches and their visibility aka permissions. Splunk has the following permissions :
A saved search could either be only private (local) or it could be visible to the current app or it could be visible to all applications (global)
The roles that Splunk defines are 'Everyone', 'admin', 'can_delete', 'power', 'splunk-system-role' and 'user'.
Each of these roles can have read or write capability.
Saved searches are searches that can be scheduled for runs. The permissions can also be specified via the configuration file with the following format:
[object_type/object_name]
access = read : [ comma-separated list of roles ], write: [comma-separated list of roles ]
We can also set permissions for all objects of a type
[eventtypes]
access = read : [ *]. write : [admin, power]
Objects can also be made to be globally available by adding an entry for the object in default.meta
as
export = system
As an example, to make all event types in the business analytics app viewable in every app in the Splunk installation, we use the following:
[eventtypes]
access = read : [ * ], write : [admin, power ]
export = system
Every app and its objects is governed by a set of permissions and these can be specified for every role within Splunk.
Every user has their own user directory. Objects visibility can be promoted from user level to app level and to global across all apps. This is called sharing and Splunk implements it by moving that object from the user directory to the app directory.

In today's post, we continue our discussion on Splunk. In addition to the discussion on search earlier, I wanted to bring up that search artifacts are maintained in the dispatch folder. We can view all the processing information in these folders including events, commands, arguments, preview and results.
We now look at alerts. Splunk Enterprise can be configured to send alert messages to anyone when real time or historical search results have met a condition. The conditions can cover a variety of threshold and trend based scenarios.
There are three Splunk alert categories. These are per-result alerts, scheduled alerts and rolling-window alerts. Alerts are based on reports that run on a regular interval over a set of historical time range or in real time (if the report is a real time search) When alerts are triggered, different actions are executed. There are several alert actions such as e-mail notifications that we will cover but later.
The per-result alerts are real-time searches that trigger every time the base search returns a result.
This is usually authored to be invoked when an matching result comes in and is generally used in workflow oriented applications. These alerts can be throttled to ensure they don't fire too often.
Alerts based on historical searches usually run on a regular schedule. This alert type triggers whenever a scheduled run of a historical search returns results that meet a particular condition. These are generally lower priority alerts and more for monitoring over time. For example, trigger an alert whenever the number of 404 errors in any 1 hour interval exceeds 100.
Real time searches can also have alerts that monitor events within a rolling time window. These trigger when its conditions are met by events as they pass through this window in real time. For example, trigger an alert when there are three consecutive failed logins.

Sunday, March 30, 2014

We talk about distributed search in Splunk today. As we know from the discussion on distributed deployment, Splunk Enterprise performs three major functions as it moves data through the data pipeline. First, it consumes data from files, the network or elsewhere. Then it indexes the data. Finally it runs interactive or scheduled searches on the indexed data Each of this functionality can be split into dedicated instances of Splunk that can number from a few to thousands.
The dedicated instances that consume data are called forwarders and they can be light weight or universal. The indexers do the heavy indexing and they are usually the ones with the higher performance configurations. Typically, there's one indexer for every 150/200 GBs of daily indexing and a 12 to 24 core server for each indexer instance. Indexers can also be clustered and configured to replicate each others data. This process is known as index replication. Replication helps prevent data loss and improves data availability. Clusters also have a builtin distributed search capability which we will get to shortly. The other type of dedicated instances are the search heads. The search heads co-ordinate searches across the set of indexers, consolidate their results and present them to the user. For the largest environments, several search heads sharing a single configuration set can be pooled together. With search head pooling, we can co-ordinate simultaneous searches across a large number of indexers. With these logical definitions, we can now mix and match the components on same or different physical machines and configurations to suit our needs. Often if the instances share the same machine, it will be a high end machine with several cores and GBs of RAM. Deployment decisions are based on the amount of incoming data, the amount of indexed data, the number of concurrent users, the number of saved searches, the types of searches employed and the choice to run Splunk apps.
Note that distributed search is not about deployment. This is a builtin functionality for Splunk.
In distributed search we look at use cases such as the following:
As a Splunk administrator, I want to distribute indexing and searching loads across multiple Splunk instances to that it is possible to search and index large quantities of data in reasonable time.
As a Splunk administrator, I want to distribute search to control access to indexed data such that the admins and security personnel have full access and others require access only to their data.
As a Splunk administrator, I want to have different Splunk instances in different geographical offices such that local offices have access to their own data while the centralized headquarters hae access at the corporate level.
In the distributed search, the Splunk Enterprise instance that does the searching is referred to as the search head and those that participate in the search are called search peers. Typically, the search head will run search across all its search peers unless otherwise limited as specified by the splunk_server field in the query. If you want to pick a subset for searches, we do this with search head pooling.
When a distributed search is initiated, the search head replicates and distributes its knowledge objects to its search peers. Knowledge objects include saved searches, event types and other entities used in searching across indexes. This is packaged in a knowledge bundle and distributed to the peers.
Knowledge bundles typically reside under the $SPLUNK_HOME/var/run folder and have a .bundle extension. Bundles can contain almost all the contents of the search head's apps. Instead of copying over the bundles to different peers, its sometimes advisable to mount the knowledge bundles. Together with the bundle user authorization flows to the peers. The user running the search, the role and the location of the distributed authorize.conf file are included with the bundle.

In today's post, we bring up a C# application with the SDK we talked about but not the service broker example I gave earlier. we focus on the SDK itself, specifically on the inputs to write an event. Did you know that you can color code the event display with how you specify the events. I didn't until recently. I will go over some of these features and explore them with the SDK.
But first I want to mention that most of Splunks features are also available directly from the management service REST APIs. For example, we can give
curl -k -u admin:changeme https://localhost:8089/services/apps/local/myapp \ -d description="My application"
Splunk SDKs provide a layer over the REST APIs so we can directly instantiate the objects and prepare them. Knowing the REST APIs is still helpful in cases where the APIs are not available for a feature in a particular SDK or when they are marshalling parameters to the REST APIs and you are interested in finding out what is set and what isn't. By the way there is an SDK for most of the languages for web development but that also means that not all SDKs may have the same capabilities. The REST APIs provide a uniform consistent layer for all the features and are also documented similar to the SDKs. There's a reference PDF file available that describes most of the endpoints.
The endpoints are categorized as data, admin etc and we will see a little more on these shortly but I want to bring up that the REST APIs are easy to use with tools like curl and Fiddler.
Splunk's REST APIs are available for access control, applications, clusters, configurations, deployment, Indexes, inputs, knowledge, license and output. Indexes, input and output are something we will cover first.
We will also look at Search REST API. Searches can be adhoc or saved. Saved searches can be scheduled for runs. There's also auto-complete functionality as well.
I will return to this post shortly. I want to cover distributed search.

Friday, March 28, 2014

Today I will continue my discussion on thread level software instrumentation
On window s there thread local storage available. Here we create a buffer and keep all our stack based data we collect. To collect CPU ticks, we read the system clock in the constructor and again in the destructor of an object we put on the stack. The diff is the time spent executing code and can be cumulated with the running total on the thread to see how much time each takes. Note that its ideal to keep this counter on the scheduler. since threads could attach and detach from the currently executing run.
Similarly it is easier to have a base method for allocations and deletions and they can be take the stack trace when the corresponding system calls are made. This then gives us a histogram of stack traces for allocations and we can use that to find the biggest users of memory or the most frequent users.
Having a thread local storage is desirable on any platform especially given the functionality available via language specific provisions such as C++ 11 that introduces thread_local. Creating a pointer sized variable enables allocation of an arbitrary buffer reserved for this thread which allows thread specific instrumentation that is not normally covered from routines that place a variable on the stack. Undeniably the latter is required too. For example, crash logs are written this way. But here we specify what else we can add to crash log notifications via thread local storage.
The first thing we said is historical stack traces that are not captured by as-of-this-point-of-time stack trace by stack local variables.
When do we take this stack trace, for every exception encountered. Even if not all of them are crash notations.The suggestion is that by keeping track of the last few exceptions at any given time, when we get a dump, it gives us another data point of interest.
We add an exception handler to the chain of handlers that each thread maintains. The first of these we will customize to take a stack trace, put it on our thread local ring buffer, and return 'keep looking' so that other handlers including the most recent frame's own handler to get a chance to handle the exception.
Why is this better than logging ? It's complimentary to logging. Dumps and logs are both relevant for diagnosis and often every bit of information helps.
One consideration for the ring buffer is that it could be implemented as a skip list for fast traversal.
Creating a ring buffer on each thread is one part. It could be used for different purposes.
using it for storing exceptions encountered us one such purpose. Usually the operating systems meet this purpose and using it for applications specific tasks is preferable. One such purpose could be to list the last user query or commands
For now we will just focus on creating a ring buffer for every thread on all platforms.
We do this whenever we start a thread or have a callback for initialization routines. There should be fewer of those through out the source code.