Cluster computing

Sunday, June 11, 2017

From database to data platform, there is evolution in programmability, toolset and workflows using the database. Also, designating dedicated instances of the database for organizational wide usage makes it appealing as a sink for all data enabling newer workflows and migration to being a platform. This means we make it easy for the data to flow into the in-memory database by writing different connectors, aggregators, input collection agents and forwarders. Newer data is not only found by content but also by source such as machine data, user data etc, technology such as document stores and key value indexes, and translations such as from one region or locale to another.

In-memory databases could also benefit from unlimited memory via cloud resources or facilitators that stitch cloud resources. This may require moving away from cluster model where there is a differentiation between masters and slaves and enabling peer based computing scaling to unlimited numbers. There are some technical challenges in doing that but allowing a virtual cluster with cloud resources may still be viable. Moreover, we can allow smaller datasets on numerous memory rich compute resources with orders of magnitude more resources. If we make memory seamless between instances, we no longer have a cluster model but a single instance on an infinite hardware. The memory stitching layer can be offloaded to be database independent. By seamless, we mean it is not distributed model and also include the option for using SSD devices with RAM. If the in-memory database cannot be single limitless memory instance, perhaps we can consider single large scale cloud based generic cluster. Data writes that are flushed can be propagated to cloud storage.

Azure provides cloud database as a service. It is a managed service offering database to developers. This notion of managed databases implies that the database adaptively tunes performance and automatically improves reliability and data protection which frees up the app development. It automatically scales on the fly with no downtime. We cite the Azure database as a service for the notion of managed services and for introducing elastic properties to the database. Whether a database is in memory or hosted on a cluster, is relational or NoSQL, there are benefits from managed services that makes it all the more appealing to users.

#codingexercise
Find the minimum number of deletions that will make two strings anagrams of each other
int GetMinDelete(string A, string B)
{
assert(A.ToLower() == A && B.ToLower() == B);
var ca = new int[26]{};
var cb = new int[26]{};
for (int i = 0; i < A.Length; i++)
ca[A[i]-'a']++;
for (int i = 0; i < B.Length; i++)
cb[B[i]-'a']++;
int result = 0;
for (int i = 0; i < 26; i++)
result += Math.Abs(ca[i]-cb[i]);
return result;
}

Get Minimum sum of squares of character counts after removing k characters
step 1 make a frequency table
step 2 order frequencies in descending manner
step 3 decrement frequencies starting with the most repeating character until k characters
step 4 sum the squares of the remaining frequency counts.

We have a series of daily stock prices as a bar chart. We have to determine the span for all n days. A span is the number of days just before the given day, for which the price of the stock on the current day is less than or equal to the price of the stock on the given day.

Int[] GetSpan(List<int> prices)

{

var result = new int[prices.Length];

for (int I =0; I< prices.Length; I++)

{

Int count = 1;

For (int j = I-1; j >=0; j--)

If (prices[j] >prices[I])

Break;

Else

Count++;

Result[I] = count;

}

Return result;

}

Saturday, June 10, 2017

Database queries are just as important as data manipulation and archival and have played an important factor in their use. The kind of queries made has driven the adoption of the SQL standard that continues till date. Even when map-Reduce is the norm for Big Data, the ability to translate SQL queries on Big Data has become popular. Query execution and optimization is heavily dependent on costing model Once the query is sent to the database engine, it is translated into an abstract syntax tree. Then it is bound/resolved to column/table names. Next it is optimized by a set of tree transformations that chooses join orders or rewrites subselects. Both the nodes and the operations might change. The tree may be inserted into the cache and hydrated with code before execution. During execution, another set of tree transformations occur make it interpreted and more stateful. Resources acquired during cleanup are then released on cleanup. On the NoSQL side, say with Hadoop ecosystem, Hive is preferred over Pig because it is very SQL-like. It abstracts the data retrieval but does not support all the operations. Hive is also way more slower than SQL queries because there is no caching involved. It translates to MapReduce and re-runs each time. We realize its value only in context of the scale out efficiencies of Hadoop. A TableScan seems appealing only when the index seeks are in the orders of magnitude more.

Data access patterns that involve locking and logging are generally limiting. When these constraints relaxed with say lock free skip lists, then range queries can actually scale more than B-Trees. These data structures together with multiversion concurrency control make it easier for the database to run in-memory which makes it all the more faster. In traditional databases, there is fixed overhead per query in terms of setting up contexts, preparing the tree for interpretations and running the query. The in-memory databases use code generation to avoid all this. Similarly MVCC gives the same efficiency and correctness as transactions. Given that SQL queries are becoming a standard for working with relational, non-relational and in-memory databases, it should be safe to bet that this will become the norm for any new additions to the group. In fact, in-memory database scale out beyond single server to cluster nodes with communications based on SQL querying. Tables are distributed with hash-partitioning. What in-memory database fail to do is become a data platform like Splunk does for machine data and analysis

#codingexercise

Reverse a singly linked list in groups of K and K+1 alternatively

eg. 1, 2, 3, 4, 5, 6,7 K= 3

3,2,1,7,6,5,4
static Node Reverse(int k, int group, ref Node root)
{
Node current = root;
Node next = null;
Node prev = null;
int count = 0;
int max = k;
if (group %2 == 1) max = k+1 ;
while (current != null && count < max)
{
next = current.next;
current.next = prev;
prev = current;
current = next;
count++;
}
if (root != null)
root.next = Reverse(k, group+1, ref next);
return prev;
}

Friday, June 9, 2017

Data architectures in Cloud Computing.

Traditional data processing architecture has changed a lot from where they used to be part of the ubiquitous three tier architecture involving databases, to being more distributed, scaled up and scaled out, sharded and hosted on private and public clouds, maintained on clusters and containers with shared volumes, hosted in memory and even becoming hybrid to involve SQL and NoSQL technologies. We describe some of these evolutions in this article.

There are some trends with data that are undeniably and renowned as motivating this evolution. First, data sets are sticky. They are costly to acquire, transfer and use in a new location. This also meant that innovation will be increasingly accomplished by end users rather than an expert provider. Consequently the ecosystem has been changing. Second, data is growing rapidly. The order of scale has increased from GigaBytes to TeraBytes to PetaBytes and so on. The data is increasingly being gathered from numerous sensors, logs and networks. More commonly, database administrators find that their mysql database starts becoming slower and slower even with master slave replication, adding more RAM, sharding, denormalization and other SQL tuning techniques.

Therefore architects often choose to spread out their options as data grows and is still manageable and portable at that stage. They get rid of joins and denormalize beforehand, they switch to data stores that can scale such as MongoDB and have applications and services take more compute than storage. As data grows, scalability concerns grow. This is where Big Data comes in. Initially much of the growth in data was exclusively for analytics purposes so Big Data became synonymous with MapReduce kind of computing. However that begins to change when there are more usages of the data. For example, SQL statements are used to work with the data and SQL connectors are used to bridge relational and NoSQL stores. NoSQL is usually supported on a distributed file system with key-values as columns in a column family. The charm of using such system is that it can scale horizontally with the addition of commodity hardware but it does not support the guarantees that a relational store comes with. This calls for a mixed model in many cases.

Usages have also driven other expressions of the databases. For example, distributed databases in the form of matrix were adopted to grow to large data sets and high volume computations. Separation of data into tables, blobs and queues enabled it to be hosted in much smaller granularity on public and private clouds. When the data could not be broken down such as with Master Data catalogs of a retail store, it was served with its own stack of web services in a tiered architecture that decoupled the dependency on the original large volume data store. Adoption of clusters in various forms other than for Big Data and file systems such as abstraction of Operation System resources enabled smaller databases to be migrated to clusters from dedicated servers.

#codingexercise

static List<String> GenerateEmailAliases(String firstname, String lastname)

{

var ret = new List<String>();
ret.Add(firstname);
ret.Add(lastname);

for(int i = 0; i < firstname.Length; i++)

for (int j = 0; j < lastname.Length; j++)

{

var alias = firstname.Substring(0, i + 1) + lastname.Substring(0, j + 1);

ret.Add(alias);

}

return ret;

}

bool IsPowerOfTwo(uint x)
{
return (( x != 0) && ((x & (x-1)) == 0);
}
// reverse a linkedlist in groups of k
static Node Reverse(int k, ref Node root)
{
Node current = root;
Node next = null;
Node prev = null;
int count = 0;
while (current != null && count < k)
{
next = current.next;
current.next = prev;
prev = current;
current = next;
count++;
}
if (root != null)
root.next = Reverse(k, ref next);
return prev;

}

Thursday, June 8, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added serverless computing to the mix. Today we discuss marathon in detail. This is another specialization that is changing the typical landscape of application-database deployments in the traditional OLTP store.
Marathon is a container orchestration platform for Mesosphere's Datacenter Operating System (DC/OS) and Apache Mesos. Mesosphere is a platform for building data rich applications that can be portable across hybrid cloud. Mesos is a distributed systems kernel that unshackles us from a single box while providing us the same abstractions for CPU, Memory, Storage and other compute resources. It enables us with a fault-tolerant and elastic distributed systems. Together Marathon and Mesos provide a one-stop shop for an always-on always-connected, highly available, load-balanced, resource managed and cloud-portable deployment environment that is production ready. Contrast this with the traditional dedicated resources in deployment environments in many enterprises and we see how managed the deployment environment has become. Moreover, it enables service discovery and load balancing so applications and services can be written as many as needed and configured with load balancer. Anything that is hosted on Marathon automatically comes with health checks. Not only this we can also subscribe to events and collect metrics. The entire marathon framework is available via REST API for programmability.
Perhaps the most interesting application is the hosting of database service on the Marathon containers. While load balancing for user services is a commonly understood practice, the same for a database service is lesser known. Marathon, however treats all services as code that can run on any container hosted on the same underlying distributed layer. The data persists on a shared volume and intra-cluster connectivity to the shared volume is trivial latency and redirection. That said storage and networking efficiency still needs to be carefully studied. Also, persistent volumes are used for applications to preserve they state because when they are restarted, they lose their state. A local volume that is pinned to the node will be available when that node relaunches. This bundles up the disk and the compute for that node.

#codingexercise
Find the first non-repeating character in a string
char GetNonRepeating(string str)
{
var h = new Hashtable();
for (int i = 0; i < str.Length; i++)
if (h.Contains(str[i]))
h[str[i]] += 1;
else
h.Add(str[i], 1);
char c= '\0';
for (int i = 0; i < str.Length; i++)
if (h[str[i]] == 1) {
c = str[i];
break;
}
return c;
}

Wednesday, June 7, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added serverless computing to the mix. Today we continue the discussion.
Applications have evolved with cloud computing. What used to be monolithic and deployed on mere separation between Application dedicated virtual machines and database dedicated storage, was made more modular and separate into deep vertical partitions with their own operating systems. With twelve factor applications, it was easier to take advantage of containers. This worked well with platform as a service and docker containers. It is possible however to go further towards decomposing the application modules into compute and data access intensive functions that can be offloaded into its own containers with both function as a service and backend as a service. The ease of modifications is very appealing when we look at individual functions packaged in a container by itself. Both public clouds currently support this form of computing. AWS Lambda and Azure Functions can be executed in response to events at any scale.
There are a few tradeoffs in the serverless computing that may be taken into perspective. First, we introduce latency in the system because the functions don't execute local to the application and require setup and teardown routines during invocations.Moreoever, debugging of serverless computing functions is harder to perform because the functions are responding to more than one applications and the callstack is not available or may have to be put together by looking at different compute resources. The same goes for monitoring as well because we now rely on separate systems. We can contrast this with applications that are hosted with load balancer services to improve availability. The services registered for load balancing is the same code on every partition. The callstack is coherent even if it is on different servers. Moreover, these share the same persistence even if the entire database server is also hosted on say Marathon with the storage on a shared volume. The ability of Marathon to bring up instances as appropriate along with the health checks improves the availability of the application. The choice of using platform as a service or a marathon cluster based deployment or serverless computing depends on the application.
#codingexercise
Given a preorder traversal of a BST, find the inorder traversal
List<int> GetInOrderFromPreOrder(List<int> A)
{
if (A == null) return A;
return A.sort();
}
int power(uint base, uint exp)
{
int result = 1;
if (exp == 0) return result;
for (int i = 0; i < exp; i++)
result = result * base;
return result;
}
int power(unit base, uint exp)
{
int result = 1;
while (exp> 0)
{
if (exp & 1)
result = result * base;
base = base * base;
exp == exp >> 1;
}
return result;
}

Tuesday, June 6, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added serverless computing to the mix. Today we continue the discussion. This time we focus on Docker support.
OpenWhisk supports Docker actions. This means we can execute binaries on demand without provisioning virtual machines. Docker actions are best suited where it is difficult to refactor an application into smaller set of functions. This is a common use case for existing applications and services.
When we request images from Docker for executing the action, these take longer because the latency is high. It depends on the size of the image and the network bandwidth. Contrast this with the pool of warm containers that don't require a cold start. Moreover, Docker images may not be posted on a public hub because the code to execute on them may be proprietary and it will violate security. These were mitigated with OpenWhisk providing a base image for Docker actions. Also, a Docker action can now receive a zip file with an executable.
The suggestion here is that we dont need to create custom images. This saves time on latency. A base image is already provided. Also, the executable can be switched. Without customizing images and not sharing them, we don't compromise on security. In addition, since only the executables are switched, the time it takes to execute the code is less.

#codingexercise
A bot is an id that visits the site m times in the last n seconds. Given a list of entries in the log sorted by time, return all the bots id.
Yesterday we solved this with iteration over the relevant window of the log. This is a typical question on logs and events both of which are stored in Time Series Database.
Time series database helps with specialized queries for the data. Unlike a relational data that serves an OLTP system, the time series is a continuous stream of events and often at a high rate.
In the logs, Bots generally identify themselves with their user agent string and they obey the rules in the robots.txt file of the site. Consequently, we can differentiate the bots from the logs into those that behave and those who don't. And the ones that do leave an identification string.
count = 0;
string pat = @"(?<bot_name>Google?)bot\W";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
foreach (var kvp in h)
{
Match m = r.Match(h[kvp.key]);
if (m.Success)
count++;
}

one more:
count the number of ways elements add upto N using array elements with repetitions allowed:
int GetCount(List<int> A, int sum)
{
var counts = new int[A.Count + 1] {0};
for ( int i = 0; i < A.Count; i++)
counts[i] = 0;
counts[0] = 1;
for (int i = 1; i <= sum; i++)
for (int j = 0; j < A.Count; j++)
if (i >= A[j])
counts[i] += counts[i-A[j]];
return counts[sum];
}
Alternatively, this can be done with backtracking instead of dynamic programming as we showed with the help of the Combine method involving repetitions in the earlier posts.

Monday, June 5, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added serverless computing to the mix. Today we continue the discussion.
Serverless computing is open by design. The engine and the event emitter/consumer is open. The interface is open. Its components are Docker, Kafka, Consul which are all open The tools used with this are also open.
Since the emphasis is on actions, triggers, rules and the deployment and runtime are managed, it is easiest to upload and use it. Actions are the event handlers. They can run on any platform. Typically they are hosted in a container. They can be changed to create sequences and to increase flexibility and foster reuse. An association of a trigger and an action is called a rule. Rules can be specified at the time the actions are registered. A package is a collection of actions and triggers. It allows you to outsource load and calculation intensive tasks. This allows share and reuse. The only drawback is that troubleshooting is more tedious now as there is more correlation to be done. However, actions can be both synchronous and asynchronous and expressed in their own language and runtime. This means we can get responses in blocking and non-blocking manner. The runtimes are found in the container hosted.
In the standalone mode, the containers are made available with VirtualBox. In the distributed environment, it can come from PaaS. These actions do not require predeclared association with containers which means the and the infrastructure does not need to know what the container names are. The execution of the action is taken care of by this layer.

#codingexercise
A bot is an id that visits the site m times in the last n seconds. Given a list of entries in the log sorted by time, return all the bots id.
Hashtable<int, int> GetBots(Log[] logs, int m, int n)
{
var h = new Hashtable<int, int>();
var min = Math.min(logs[logs.Length-1]-n, 0);
for (int i = logs.Length - 1; i >= min; i--)
if h.Contains(logs[i].id)
h[logs[i].id] += 1;
else
h.Add(logs[i].id, 1);
foreach(var kvp in h)
if (h[kvp.key] < m)
h.Remove(kvp.key)
return h;
}