Cluster computing

Tuesday, June 13, 2017

Data architectures in Cloud Computing.

Today we look at data storage for social networking applications as an extreme of storage needed for this purpose. We recap the considerations presented in a video by Facebook Engineers for their data infrastructure needs. They have data arriving to the tune of several hundred Terabytes and a storage of over 300 Petabytes a fraction of which is processed. For example, their log storage flows into HDFS which is massively distributed storage. They use NoSQL Hadoop over HDFS together with Hive for data warehouse operations and SQL querying. This makes it easy for data tools and pipelines to work with this data stack.

Then they introduced Presto over HDFS for interactive analysis and Apache Giraph over Hadoop for Graph Analytics. We will discuss Presto and Giraph shortly but let us take a look at the kind of data stored in these databases. All the images from the social networking flow into HayStack, The users are stored in mysql and all the chat is stored in H-Base. They use Scribe for Log Storage, Scuba for real-time slice and dice and Puma for Streaming analytics. These also give an indication of the major types of data processings involved with social network application.
Apache Giraph is an iterative graph processing system used for high scalability. It is used to analyze the graph formed by users and their connections.

#codingexercise
Given an array, find the maximum j-i such that arr[j] > arr[i]
int GetMaxDiff(List<int> A)
{
int diff = INT_MIN;
for (int i = 0; i < A.Length; i++)
for (int j = A.Length -1; j > i; j--)
{
if (A[j] > A[i] && j-i > diff)
diff = j-i;
}
return diff;
}

Monday, June 12, 2017

Data architectures in Cloud Computing.

We were discussing that traditional data processing architecture has changed a lot from where they used to be part of the ubiquitous three tier architecture involving databases, to being more distributed, scaled up and scaled out, sharded and hosted on private and public clouds, maintained on clusters and containers with shared volumes, hosted in memory and even becoming hybrid to involve SQL and NoSQL technologies. We continue reviewing some of the improvements in this field
One of the trends is the availability of augmented memory with the help of solid state drives. The SSD increases the speed for data access by an order of magnitude compared to disk. Therefore memory intensive programs such as database and data access technologies have potential to make most use of this improvement. With non-moveable storage such as SSD, there's also less vulnerability to faults and improved reliability of persistence. Needless to say, it improves database performance because disks don't have to be spun up. Perhaps, the more significant contribution is when memory is backed by SSD such that in-memory programs can take more advantage of cache. Even if the database is used 24x7 for long periods of time, this does not wear the SSD. The TRIM command is especially provided for sustained long-term performance and wear-leveling. The TRIM support can be enabled on a continuous as well as a periodic basis.

#codingexercise
Find the next greater element for all in an integer array
Int[] GetNextGreaterElements(List<int> A)
{
var result = new int[A.length];
for (int i =0; i < A.Length; i++)
{
int next = -1;
for (int j = i+1; j < A.Length; j++)
if (A[j] > A[i]){
next = A[j];
break;
}
result[i] = next;
}
return result;

}

Sunday, June 11, 2017

From database to data platform, there is evolution in programmability, toolset and workflows using the database. Also, designating dedicated instances of the database for organizational wide usage makes it appealing as a sink for all data enabling newer workflows and migration to being a platform. This means we make it easy for the data to flow into the in-memory database by writing different connectors, aggregators, input collection agents and forwarders. Newer data is not only found by content but also by source such as machine data, user data etc, technology such as document stores and key value indexes, and translations such as from one region or locale to another.

In-memory databases could also benefit from unlimited memory via cloud resources or facilitators that stitch cloud resources. This may require moving away from cluster model where there is a differentiation between masters and slaves and enabling peer based computing scaling to unlimited numbers. There are some technical challenges in doing that but allowing a virtual cluster with cloud resources may still be viable. Moreover, we can allow smaller datasets on numerous memory rich compute resources with orders of magnitude more resources. If we make memory seamless between instances, we no longer have a cluster model but a single instance on an infinite hardware. The memory stitching layer can be offloaded to be database independent. By seamless, we mean it is not distributed model and also include the option for using SSD devices with RAM. If the in-memory database cannot be single limitless memory instance, perhaps we can consider single large scale cloud based generic cluster. Data writes that are flushed can be propagated to cloud storage.

Azure provides cloud database as a service. It is a managed service offering database to developers. This notion of managed databases implies that the database adaptively tunes performance and automatically improves reliability and data protection which frees up the app development. It automatically scales on the fly with no downtime. We cite the Azure database as a service for the notion of managed services and for introducing elastic properties to the database. Whether a database is in memory or hosted on a cluster, is relational or NoSQL, there are benefits from managed services that makes it all the more appealing to users.

#codingexercise
Find the minimum number of deletions that will make two strings anagrams of each other
int GetMinDelete(string A, string B)
{
assert(A.ToLower() == A && B.ToLower() == B);
var ca = new int[26]{};
var cb = new int[26]{};
for (int i = 0; i < A.Length; i++)
ca[A[i]-'a']++;
for (int i = 0; i < B.Length; i++)
cb[B[i]-'a']++;
int result = 0;
for (int i = 0; i < 26; i++)
result += Math.Abs(ca[i]-cb[i]);
return result;
}

Get Minimum sum of squares of character counts after removing k characters
step 1 make a frequency table
step 2 order frequencies in descending manner
step 3 decrement frequencies starting with the most repeating character until k characters
step 4 sum the squares of the remaining frequency counts.

We have a series of daily stock prices as a bar chart. We have to determine the span for all n days. A span is the number of days just before the given day, for which the price of the stock on the current day is less than or equal to the price of the stock on the given day.

Int[] GetSpan(List<int> prices)

{

var result = new int[prices.Length];

for (int I =0; I< prices.Length; I++)

{

Int count = 1;

For (int j = I-1; j >=0; j--)

If (prices[j] >prices[I])

Break;

Else

Count++;

Result[I] = count;

}

Return result;

}

Saturday, June 10, 2017

Database queries are just as important as data manipulation and archival and have played an important factor in their use. The kind of queries made has driven the adoption of the SQL standard that continues till date. Even when map-Reduce is the norm for Big Data, the ability to translate SQL queries on Big Data has become popular. Query execution and optimization is heavily dependent on costing model Once the query is sent to the database engine, it is translated into an abstract syntax tree. Then it is bound/resolved to column/table names. Next it is optimized by a set of tree transformations that chooses join orders or rewrites subselects. Both the nodes and the operations might change. The tree may be inserted into the cache and hydrated with code before execution. During execution, another set of tree transformations occur make it interpreted and more stateful. Resources acquired during cleanup are then released on cleanup. On the NoSQL side, say with Hadoop ecosystem, Hive is preferred over Pig because it is very SQL-like. It abstracts the data retrieval but does not support all the operations. Hive is also way more slower than SQL queries because there is no caching involved. It translates to MapReduce and re-runs each time. We realize its value only in context of the scale out efficiencies of Hadoop. A TableScan seems appealing only when the index seeks are in the orders of magnitude more.

Data access patterns that involve locking and logging are generally limiting. When these constraints relaxed with say lock free skip lists, then range queries can actually scale more than B-Trees. These data structures together with multiversion concurrency control make it easier for the database to run in-memory which makes it all the more faster. In traditional databases, there is fixed overhead per query in terms of setting up contexts, preparing the tree for interpretations and running the query. The in-memory databases use code generation to avoid all this. Similarly MVCC gives the same efficiency and correctness as transactions. Given that SQL queries are becoming a standard for working with relational, non-relational and in-memory databases, it should be safe to bet that this will become the norm for any new additions to the group. In fact, in-memory database scale out beyond single server to cluster nodes with communications based on SQL querying. Tables are distributed with hash-partitioning. What in-memory database fail to do is become a data platform like Splunk does for machine data and analysis

#codingexercise

Reverse a singly linked list in groups of K and K+1 alternatively

eg. 1, 2, 3, 4, 5, 6,7 K= 3

3,2,1,7,6,5,4
static Node Reverse(int k, int group, ref Node root)
{
Node current = root;
Node next = null;
Node prev = null;
int count = 0;
int max = k;
if (group %2 == 1) max = k+1 ;
while (current != null && count < max)
{
next = current.next;
current.next = prev;
prev = current;
current = next;
count++;
}
if (root != null)
root.next = Reverse(k, group+1, ref next);
return prev;
}

Friday, June 9, 2017

Data architectures in Cloud Computing.

Traditional data processing architecture has changed a lot from where they used to be part of the ubiquitous three tier architecture involving databases, to being more distributed, scaled up and scaled out, sharded and hosted on private and public clouds, maintained on clusters and containers with shared volumes, hosted in memory and even becoming hybrid to involve SQL and NoSQL technologies. We describe some of these evolutions in this article.

There are some trends with data that are undeniably and renowned as motivating this evolution. First, data sets are sticky. They are costly to acquire, transfer and use in a new location. This also meant that innovation will be increasingly accomplished by end users rather than an expert provider. Consequently the ecosystem has been changing. Second, data is growing rapidly. The order of scale has increased from GigaBytes to TeraBytes to PetaBytes and so on. The data is increasingly being gathered from numerous sensors, logs and networks. More commonly, database administrators find that their mysql database starts becoming slower and slower even with master slave replication, adding more RAM, sharding, denormalization and other SQL tuning techniques.

Therefore architects often choose to spread out their options as data grows and is still manageable and portable at that stage. They get rid of joins and denormalize beforehand, they switch to data stores that can scale such as MongoDB and have applications and services take more compute than storage. As data grows, scalability concerns grow. This is where Big Data comes in. Initially much of the growth in data was exclusively for analytics purposes so Big Data became synonymous with MapReduce kind of computing. However that begins to change when there are more usages of the data. For example, SQL statements are used to work with the data and SQL connectors are used to bridge relational and NoSQL stores. NoSQL is usually supported on a distributed file system with key-values as columns in a column family. The charm of using such system is that it can scale horizontally with the addition of commodity hardware but it does not support the guarantees that a relational store comes with. This calls for a mixed model in many cases.

Usages have also driven other expressions of the databases. For example, distributed databases in the form of matrix were adopted to grow to large data sets and high volume computations. Separation of data into tables, blobs and queues enabled it to be hosted in much smaller granularity on public and private clouds. When the data could not be broken down such as with Master Data catalogs of a retail store, it was served with its own stack of web services in a tiered architecture that decoupled the dependency on the original large volume data store. Adoption of clusters in various forms other than for Big Data and file systems such as abstraction of Operation System resources enabled smaller databases to be migrated to clusters from dedicated servers.

#codingexercise

static List<String> GenerateEmailAliases(String firstname, String lastname)

{

var ret = new List<String>();
ret.Add(firstname);
ret.Add(lastname);

for(int i = 0; i < firstname.Length; i++)

for (int j = 0; j < lastname.Length; j++)

{

var alias = firstname.Substring(0, i + 1) + lastname.Substring(0, j + 1);

ret.Add(alias);

}

return ret;

}

bool IsPowerOfTwo(uint x)
{
return (( x != 0) && ((x & (x-1)) == 0);
}
// reverse a linkedlist in groups of k
static Node Reverse(int k, ref Node root)
{
Node current = root;
Node next = null;
Node prev = null;
int count = 0;
while (current != null && count < k)
{
next = current.next;
current.next = prev;
prev = current;
current = next;
count++;
}
if (root != null)
root.next = Reverse(k, ref next);
return prev;

}

Thursday, June 8, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added serverless computing to the mix. Today we discuss marathon in detail. This is another specialization that is changing the typical landscape of application-database deployments in the traditional OLTP store.
Marathon is a container orchestration platform for Mesosphere's Datacenter Operating System (DC/OS) and Apache Mesos. Mesosphere is a platform for building data rich applications that can be portable across hybrid cloud. Mesos is a distributed systems kernel that unshackles us from a single box while providing us the same abstractions for CPU, Memory, Storage and other compute resources. It enables us with a fault-tolerant and elastic distributed systems. Together Marathon and Mesos provide a one-stop shop for an always-on always-connected, highly available, load-balanced, resource managed and cloud-portable deployment environment that is production ready. Contrast this with the traditional dedicated resources in deployment environments in many enterprises and we see how managed the deployment environment has become. Moreover, it enables service discovery and load balancing so applications and services can be written as many as needed and configured with load balancer. Anything that is hosted on Marathon automatically comes with health checks. Not only this we can also subscribe to events and collect metrics. The entire marathon framework is available via REST API for programmability.
Perhaps the most interesting application is the hosting of database service on the Marathon containers. While load balancing for user services is a commonly understood practice, the same for a database service is lesser known. Marathon, however treats all services as code that can run on any container hosted on the same underlying distributed layer. The data persists on a shared volume and intra-cluster connectivity to the shared volume is trivial latency and redirection. That said storage and networking efficiency still needs to be carefully studied. Also, persistent volumes are used for applications to preserve they state because when they are restarted, they lose their state. A local volume that is pinned to the node will be available when that node relaunches. This bundles up the disk and the compute for that node.

#codingexercise
Find the first non-repeating character in a string
char GetNonRepeating(string str)
{
var h = new Hashtable();
for (int i = 0; i < str.Length; i++)
if (h.Contains(str[i]))
h[str[i]] += 1;
else
h.Add(str[i], 1);
char c= '\0';
for (int i = 0; i < str.Length; i++)
if (h[str[i]] == 1) {
c = str[i];
break;
}
return c;
}