Cluster computing

Sunday, June 18, 2017

We were reviewing DynamoDB. We will compare it with Cosmos DB but first let us review Cosmos DB. Azure Cosmos DB was written to be the industry's first globally-distributed, multi-model database services. It is built to collect data from a plethora of IoT devices as well as to be handle the processing of machine-learning. It is a globally distributed data service that can elastically scale throughput and storage with data consistency, low latency and high availability. There are five different consistency levels which include strong, bounded-staleness, session, consistent-prefix and eventual consistency levels. No schema or index is required because the database automatically indexes all data and the engine does not know or care about any schemas. Key-value, documents and graph data models are easily supported.
Perhaps one of the undeniably desirable feature of Cosmos DB is for developers to choose the data model and the API. Developers are no longer bound to one model. They can easily switch from key-value to graph model as they write one application after another. This makes this DB second nature to the developers and earns mindshare which is just as important as acquiring data because the two are related. Similarly they can also manipulate the data with the same set of APIs that they are already familiar with. This database does not impose any new APIs We can use the same dialects such as that of Document DB, Mongo DB and APIs of Gremlin database. These are in addition to the native Azure Table Storage API. Not only this but other data access APIs can be added to the offerings going forward. Azure Cosmos is the first and only globally distributed database service in the industry to financially backed comprehensive SLAs as per their documentation. This gives us the assurance that the APIs will not only be easier to work with but also highly available with low latency, high consistency and throughput. In other words, customers who were using NoSQL database can now for go their own maintenance and adopt a managed Database as a Service layer with little or no disruption to their existing applications. In our discussion of Dataset architecture here, we saw important concepts around new trends but nothing seems to bring it all in together as this managed service platform from Azure Cosmos.

A text summarization attempt: http://scratchpad.io/scintillating-spot-7410
API
UI

#codingexercise
Given a rectangular matrix with rowwise and columnwise sorted distinct integers, find a given integer
Tuple<int,int> GetPosition(int[,] A, int Rows, int Cols, int X)
{
var ret = new Tuple<int, int>(){-1,-1};
// x, y coordinates
for (int i =0; i < Rows; i++)
{
int index = binary_search(A, cols, i, X);
if ( index != -1)
{
ret.first = i;
ret.second = index;
break;
}
}
return ret;
}
We can also do binary search along columns as well as rows.

Saturday, June 17, 2017

This database allows the developers to add any number of regions while retaining the same fidelity as the one closest. The database also seamlessly replicates data across regions. It elastically scales throughput and storage. The applications can also be built to be more responsive. The database engine is said to be write-optimized, log structured and latch-free which enables it to guarantee writes in single digit millisecond latencies anywhere in the world.

#codingexercise

Node BuildTree(char[] Preorder, char[] InOrder, int index = 0)
{
Node root = new Node();
root.data = PreOrder[index];
root.left = null;
root.right = null;

int inIndex = Array,indexOf(InOrder, Preorder[index]);

if ( index+1 < Preorder.Length &&
IsLeftSubtree(Preorder[index+1], inOrder, inIndex) == true)
root.left = BuildTree(Preorder, InOrder, index + 1);

if ( inIndex+1 < InOrder.Length &&
isPredecessor(InOrder[inIndex+1], Preorder, index) == false)
root.right = BuildTree(Preorder, InOrder, Array.IndexOf(PreOrder, InOrder[inIndex + 1]));

return root;

}

bool is Predecessor(char c, char[] Preorder, int index)
{
return Array.IndexOf(Preorder, c) < index;
}

bool isLeftSubtree(char c, char[] inOrder, int index)
{
Array.IndexOf(Inorder,c) < index;

}

// Preorder A B C D E F
// Inorder C B D A E F
// Build tree
a
b e
c d f
Node BuildTree(char[] Preorder, char[] InOrder, char parent, int index = 0, int start, int end)
{
Node root = new Node();
char current = PreOrder[index];
root.data = current;
root.left = null;
root.right = null;
if (start == end) return root;
int size_left = indexOfFromStart(InOrder, current, parent, index, start, end);
if (size_left > 0)
root.left = BuildTree(Preorder, InOrder, current, index+1, index+1, index+size_left);
int size_right = end-start-size_left;
if (size_right > 0)
root.right = BuildTree(Preorder, InOrder, current, index+size_left+1, index+size_left+1, end);
return root;
}
int indexOfFromStart(char[] InOrder, char c, char parent, int index, int start, int end)
{
if (start >= end) return 0;
int inIndex= InOrder.IndexOf(c);
// start and end are in pre-order
// index in from inorder but the size of the left subtree
// must be contained in preorder between start and end
// and must be between 0 and index in inorder
if (inIndex<start) return 0;
assert(inindex <= end);
int begin = 0;
if (parent != '/0')
begin = InOrder.IndexOf(parent);
if (begin<inIndex)
return inIndex-begin-1;
return inIndex;
}

Friday, June 16, 2017

In continuation of our discussion on data architecture covered in previous posts, let us take an example and study it. We will review Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability. With cloud based business such as Amazon, the compute and storage is distributed in several datacenters around the world. At this scale, there are many fault domains and and yet the persistent state has to be managed with high reliability and availability. To maitain this standard, Dynamo sacrifices consistency under certain conditions It makes extensive use of object-versioning and application assisted conflict resolution. It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost effectiveness and performance. For a store like Amazon, a relational data store is limiting in terms of scale and availability. Dynamo provides a simple primary-key interface to meet this need.
We saw how conflict resolution is handled in Dynamo. When the conflict resolution happens and who does the conflict resolution is determined by the needs of the services using Dynamo.
Other design considerations include the ability to scale out the storage with no downtime, uniform nodes and no special roles for any set of nodes, the ability to decentralize control by way of simpler scalable peers and exploit heterogeneity in infrastructure.
Peer to Peer network formed for this storage does not depend on flooding the network with queries to find the peer with the data. Instead it relies on a structure and a globally consistent protocol to ensure that any node can route the search query to some peer that has the related data.
#codingexercise
1) Given a nested list of int array – calculate the sum of all the integers, while multiplying each number by its depth…

For example:

Here’s the depth sum of the nested list {1, {2, 3}}:

{1, {2, 3}} => 1*1 + 2*2 + 3*2 = 11

Here’s the depth sum of the nested list {1, {2, {3}}}:

{1, {2, {3}}} => 1*1 + 2*2 + 3*3 = 14

static int CalcSum(Nested n, int depth)
{
int total = n.value * depth;
if (n.IsNested())
{
int newdepth = depth++;
foreach( var nl in n.nested)
{
total += CalcSum(nl, newdepth)
}
}
return total;
}
2) Find sentences with the highest density of words from a list : solution

Thursday, June 15, 2017

We were reviewing Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo is a single highly available system for this entire Amazon store and it is designed to handle all loads including peak loads. Reading from their paper, we saw that Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability.
Let us look at the techniques involved to achieve this scale and availability. Data is partitioned and replicated using consistent hashing and consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique. The replicas are kept in sync with a decentralized synchronization protocol. Dynamo employs a gossip based distributed failure detection and to determine memberships. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.Conflicts need to be handled by determining when to resolve them and who resolves them. Eventually a consistent state will be achieved. While some systems determine when to resolve during writes, Dynamo is designed to be an always writeable store.
The writes are important to services because they could involve customer updates which cannot be lost or rejected.Therefore conflict resolution is not handled in writes. Instead, it is done during reads. To determine who performs the conflict resolution, we can choose between applications and datastore. The datastore is not context aware so it may choose to use the last write or such other straightforward mechanisms. The application is however savvy about the user context and can determine the best resolution. Therefore, the conflict resolution is performed by the application.

#codingexercise
Find a number in a rotated sorted array
int GetNumber(List<int> sortedRotated, int num)
{
int n = sortedRotated.Count;
int pivot = find_pivot(sortedRotated, 0, n-1, num); // binary search
if (pivot == -1)
return binary_search(sortedRotated, 0, n-1);
if (sortedRotated[pivot] == num)
return pivot;
if (sortedRotated[pivot] <= num)
return GetNumber(sortedRotated, 0, pivot-1, num);
else
return GetNumber(sortedRotated, pivot+1, n, num);
}
int find_pivot( List<int> sortedRotated, int start, int end)
{
if (start < end)
return -1;
if (start == end) return start;
int mid = (start + end ) / 2;
if ( mid < end && sortedRotated[mid] > sortedRotated[mid+1]) return mid;
if (mid > start && sortedRotated[mid-1] > sortedRotated[mid]) return mid-1;
if (sortedRotated[low] >= sortedRotated[mid])
return find_pivot(sortedRotated, start, mid-1);
return find_pivot(sortedRotated, mid+1, end);
}
#alternative : http://ideone.com/e.js/F7V2aj
Also, we can use another technique that relies on a combined array that concatenates the sortedRotated array with itself which gives a subarray that is already sorted with all of the elements.
eg: 8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7
8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7, 8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7
1, 3, 4, 4, 4, 6, 6, 7, 7, 8, 9, 9,
As an example, since all the elements are sorted, finding the index of the element in this sorted sequence is a binary search. This length can then be added to the index of the pivot point to return the result.

DFS in a matrix with blockers
void DFS(int[,] A, int row, int col, int row, int col, ref int[,] visited, ref int size)
{

var rd = { -1,-1,-1,0,0,1,1, 1};
var cd = { -1,0,1,-1,0,1,-1,0,1 };

visited[row,col] = true;
for (int k = 0; k < 8; k++)
if (isSafe(A, row,col, row + rd[k], col +cd[k])){
size++;
DFS(A, row,col, row+rd[k], col+cd[k], ref visited, ref size);
}

}

Wednesday, June 14, 2017

In continuation of our discussion on data architecture covered in previous posts, let us take an example and study it. We will review Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability. With cloud based business such as Amazon, the compute and storage is distributed in several datacenters around the world. At this scale, there are many fault domains and and yet the persistent state has to be managed with high reliability and availability. To maitain this standard, Dynamo sacrifices consistency under certain conditions It makes extensive use of object-versioning and application assisted conflict resolution. It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost effectiveness and performance. For a store like Amazon, a relational data store is limiting in terms of scale and availability. Dynamo provides a simple primary-key interface to meet this need.
Let us look at the techniques involved to achieve this scale and availability. Data is partitioned and replicated using consistent hashing and consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique. The replicas are kept in sync with a decentralized synchronization protocol. Dynamo employs a gossip based distributed failure detection and to determine memberships. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution. This lets Dynamo handle the peak load traffic to the Amazon store.Usually replication is performed synchronously to provide a consistent data access interface. This trades off the availability under certain failures. It is well known that consistency and availability are hard to achieve together. Availability can be increased by optimistic replication techniques where changes are allowed to propagate to replicas in the background. Disconnectivity is tolerated. Conflicts arising from now needs to be handled by determining when to resolve them and who resolves them. Eventually a consistent state will be achieved. While some systems determine when to resolve during writes, Dynamo is designed to be an always writeable store.
Dynamo has been the underlying storage technology for a number of the core services of Amazon e-commerce platform. This store handles several millions of checkouts a day. Dynamo is a single highly available system for this entire store.
Courtesy: Amazon paper on Dynamo
#codingexercise
Count number of bridges possible between cities on northern and southern parts of the river such that they don't intersect. The southern cities are random order of the northern cities and bridges are drawn between same cities.
cities A B C D E F G H
pairs H A B F E D G C
best 1 1 2 3 3 3 4 3
A-A
B-B
D-D
G-G

Longest increasing subsequence
int GetBridges(List<int> northernCities, List<int> southernCities)
{
return min (GetBridgesLIS(northernCities), GetBridgesLIS(southernCities));
}
int GetBridgesLIS(List<Char> A)
{
var best = new int[A.Length+1];
for (int i = 0; i < best.Length; i++)
best[i] = 1;
for (int i = 1; i < A.Length; i++)
for (int j=0; j < i; j++)
if (A[i] > A[j])
{
best[i] = Math.Max(best[i], best[j]+1);
}
return best.ToList().max();
}

Tuesday, June 13, 2017

Data architectures in Cloud Computing.

Today we look at data storage for social networking applications as an extreme of storage needed for this purpose. We recap the considerations presented in a video by Facebook Engineers for their data infrastructure needs. They have data arriving to the tune of several hundred Terabytes and a storage of over 300 Petabytes a fraction of which is processed. For example, their log storage flows into HDFS which is massively distributed storage. They use NoSQL Hadoop over HDFS together with Hive for data warehouse operations and SQL querying. This makes it easy for data tools and pipelines to work with this data stack.

Then they introduced Presto over HDFS for interactive analysis and Apache Giraph over Hadoop for Graph Analytics. We will discuss Presto and Giraph shortly but let us take a look at the kind of data stored in these databases. All the images from the social networking flow into HayStack, The users are stored in mysql and all the chat is stored in H-Base. They use Scribe for Log Storage, Scuba for real-time slice and dice and Puma for Streaming analytics. These also give an indication of the major types of data processings involved with social network application.
Apache Giraph is an iterative graph processing system used for high scalability. It is used to analyze the graph formed by users and their connections.

#codingexercise
Given an array, find the maximum j-i such that arr[j] > arr[i]
int GetMaxDiff(List<int> A)
{
int diff = INT_MIN;
for (int i = 0; i < A.Length; i++)
for (int j = A.Length -1; j > i; j--)
{
if (A[j] > A[i] && j-i > diff)
diff = j-i;
}
return diff;
}

Monday, June 12, 2017

Data architectures in Cloud Computing.

We were discussing that traditional data processing architecture has changed a lot from where they used to be part of the ubiquitous three tier architecture involving databases, to being more distributed, scaled up and scaled out, sharded and hosted on private and public clouds, maintained on clusters and containers with shared volumes, hosted in memory and even becoming hybrid to involve SQL and NoSQL technologies. We continue reviewing some of the improvements in this field
One of the trends is the availability of augmented memory with the help of solid state drives. The SSD increases the speed for data access by an order of magnitude compared to disk. Therefore memory intensive programs such as database and data access technologies have potential to make most use of this improvement. With non-moveable storage such as SSD, there's also less vulnerability to faults and improved reliability of persistence. Needless to say, it improves database performance because disks don't have to be spun up. Perhaps, the more significant contribution is when memory is backed by SSD such that in-memory programs can take more advantage of cache. Even if the database is used 24x7 for long periods of time, this does not wear the SSD. The TRIM command is especially provided for sustained long-term performance and wear-leveling. The TRIM support can be enabled on a continuous as well as a periodic basis.

#codingexercise
Find the next greater element for all in an integer array
Int[] GetNextGreaterElements(List<int> A)
{
var result = new int[A.length];
for (int i =0; i < A.Length; i++)
{
int next = -1;
for (int j = i+1; j < A.Length; j++)
if (A[j] > A[i]){
next = A[j];
break;
}
result[i] = next;
}
return result;

}