Cluster computing

Monday, June 26, 2017

Today we continue our discussion on system design. This time we cover Splunk. 1) Splunk – Splunk is a scaleable time series database arranged as events in buckets – hot/warm, cold or thawed. These are stored as index files together with some metadata. There are special buckets called introspection. The architecture consists of light weight forwarders, indexers and searchheads each with its own topology and built to scale. The forwarders are the collection agent that collect machine data from customer. The indexers receive the events and handle the bulk of the operations. The searchheads present analysis tools, charts and management interfaces. Splunk has recently added analysis features based on machine learning. Previously most of the search features were based on unix like command operators that became quite popular and boosted Splunk’s adoption s as the IT tool of choice among other usages. There are a variety of charting tools and their frontend is based on beautiful Javascript while the middle tier is based on Django. The indexers are written in C++ and come with robust capabilities. It is important to note that their database unlike convention relational data or NoSQL was designed primarily for specific usages. If they moved their database to commodity or platform options in the public cloud, they can evolve their frontend to be not restricted to a single enterprise based instance or local host based instance and provide isolated cloud based storage per customer on a subscription basis and Splunk as a cloud and browser based service.
Next, we cover Consistent Hashing – This is a notion that is quietly finding its way into several distributed systems and services. Initially we had a cache that was distributed among n servers as hash(o) modulo n. This had the nasty side-effect that when one or more servers went down or were added into the pool, all the objects in the cache would lose their hash because the variable n changed. Instead consistent hashing came up with the scheme of accommodating new servers and taking old servers offline by arranging the hashes around a circle with cache points. When a cache is removed or added, the objects with hashes along the circle are moved clockwise to the next cache point. It also introduced “virtual nodes” which are replicas of cache points in the circle. Since the caches may have non-uniform distribution of objects across caches, the virtual nodes have replicas of objects from a number of cache points.
Public class ConsistentHash<T>{
Private SortedDictionary<Int, T> circle = new SortedDictionary<Int, T>(); // Usually a TreeMap is used which keeps the keys sorted even with duplicates.
Since a number of replicas are maintained, the replica number may be added to the string representation of the object before it is hashed. A good example is memcached which uses consistent hashing.
#codingexercise
Find if there is a subset of numbers in a given integer array that when AND with the given number results in Zero.
static bool IsValid(List<int> items, int Z)
{
var res = new BitArray(new int[] { Z });
foreach(var item in items)
{
var b = new BitArray(new int[]{item});
res = res.And(b);
}
int[] result = new int[1];
res.CopyTo(result, 0);
return result[0] == 0;

}

Cluster computing

Monday, June 26, 2017

No comments:

Post a Comment