Cluster computing

Tuesday, June 20, 2017

We were reviewing DynamoDB. We also reviewed Cosmos DB. Azure Cosmos DB was written to be the industry's first globally-distributed, multi-model fully managed database services. The latter is built to collect data from a plethora of IoT devices as well as to be handle the processing of machine learning. Today we compare the Dynamo DB and Cosmos DB.
First, Cosmos DB shows global distribution. As more and more regions are added, the same entity can be spread out over any number of regions. Dynamo DB does not offer region by region addition for the same entity. Each entity is local to a region.
Second, The criteria above also implies that storage and throughput can now be determined by more than one regions where as Dynamo DB cannot make that claim. Although both first and second are not the functional database features an application might use, it improves the availability and performance of the database to the application.
Third, spanning of multiple regions also provide multi-homing capabilities that are otherwise missed from dynamo db database.
Fourth, there is no compromise on read/write latency which is guaranteed to be lowest in Cosmos and at the 99th percentile. The same kind of metric is atleast not mentioned for Dynamo DB.
Fifth, Cosmos offers five consistency models where Dynamo offers only two of those namely strong and eventual. The availability of more consistency models allows an application to pick the behaviour of the database that best suits its persistence requirements.
Sixth, Cosmos offers support for multiple models and API. Dynamo does too. This is the flexibility that the applications want especially since they are fast becoming standards.
Seventh, Cosmos does not require maintenance. It does not maintain any schemas. Dynamo however requires indexes to be managed. This also leads to the feature of Cosmos where indexes are automatically maintained for all data.
Eighth, the service level agreements for the Cosmos are far more comprehensive than Dynamo. Although, the SLAs for NoSQL databases do not seem to have the same rigor as transactional systems, the performance and availability of the database make it more appealing.
#codingexercise
Convert multiset to distinct combinations where order does not matter:
Solution: http://ideone.com/ARkeZe
Some more http://ideone.com/lKxg5s
Also, it is possible to use a tree structure for the solution involving nested dictionaries shared yesterday.

Monday, June 19, 2017

Fulltext Search versus NLP in databases
Many databases in the 1990s started providing full text search techniques. It was a text retrieval technique that relied on searching all of the words in every document stored in a specific full text database.Database engines and web search engines started providing this capability so that search did not just cover the metadata and the selected sections but also the entire content. Since it was difficult to scan the entire content of a file one by one, it was offloaded to its own process separate from the OLTP to create a concordance. Stop words and stemming helped to keep the index relevant. Querying also improved significantly with fulltext and new forms of queries were made available. Still FullText did not really make use of clustering or page ranking techniques.
The difference in techniques between fulltext and nlp rely largely on the data structures used. While one uses inverted document lists and indexes, the other is making use of word vectors. The latter can be used to create similarity graphs and used with inference techniques. The word vectors can also be used with a variety of data mining techniques. With the availability of managed service databases in the public clouds, we are now empowered to creating nlp databases in the cloud.
Discussion to be continued ...
A text summarization attempt: http://scratchpad.io/scintillating-spot-7410
API
UI
#codingexercise
Given a rectangular matrix with rowwise and columnwise sorted distinct integers, find a given integer
Tuple<int,int> GetPosition(int[,] A, int Rows, int Cols, int X)
{
var ret = new Tuple<int, int>(){-1,-1};
// x, y coordinates
for (int j =0; j < Cols; j++)
{
int index = binary_search(A, rows, j, X);
if ( index != -1)
{
ret.first = index;
ret.second = j;
break;
}
}
return ret;
}
Sorting nested dictionary : http://ideone.com/SlHBib

Sunday, June 18, 2017

We were reviewing DynamoDB. We will compare it with Cosmos DB but first let us review Cosmos DB. Azure Cosmos DB was written to be the industry's first globally-distributed, multi-model database services. It is built to collect data from a plethora of IoT devices as well as to be handle the processing of machine-learning. It is a globally distributed data service that can elastically scale throughput and storage with data consistency, low latency and high availability. There are five different consistency levels which include strong, bounded-staleness, session, consistent-prefix and eventual consistency levels. No schema or index is required because the database automatically indexes all data and the engine does not know or care about any schemas. Key-value, documents and graph data models are easily supported.
Perhaps one of the undeniably desirable feature of Cosmos DB is for developers to choose the data model and the API. Developers are no longer bound to one model. They can easily switch from key-value to graph model as they write one application after another. This makes this DB second nature to the developers and earns mindshare which is just as important as acquiring data because the two are related. Similarly they can also manipulate the data with the same set of APIs that they are already familiar with. This database does not impose any new APIs We can use the same dialects such as that of Document DB, Mongo DB and APIs of Gremlin database. These are in addition to the native Azure Table Storage API. Not only this but other data access APIs can be added to the offerings going forward. Azure Cosmos is the first and only globally distributed database service in the industry to financially backed comprehensive SLAs as per their documentation. This gives us the assurance that the APIs will not only be easier to work with but also highly available with low latency, high consistency and throughput. In other words, customers who were using NoSQL database can now for go their own maintenance and adopt a managed Database as a Service layer with little or no disruption to their existing applications. In our discussion of Dataset architecture here, we saw important concepts around new trends but nothing seems to bring it all in together as this managed service platform from Azure Cosmos.

A text summarization attempt: http://scratchpad.io/scintillating-spot-7410
API
UI

#codingexercise
Given a rectangular matrix with rowwise and columnwise sorted distinct integers, find a given integer
Tuple<int,int> GetPosition(int[,] A, int Rows, int Cols, int X)
{
var ret = new Tuple<int, int>(){-1,-1};
// x, y coordinates
for (int i =0; i < Rows; i++)
{
int index = binary_search(A, cols, i, X);
if ( index != -1)
{
ret.first = i;
ret.second = index;
break;
}
}
return ret;
}
We can also do binary search along columns as well as rows.

Saturday, June 17, 2017

This database allows the developers to add any number of regions while retaining the same fidelity as the one closest. The database also seamlessly replicates data across regions. It elastically scales throughput and storage. The applications can also be built to be more responsive. The database engine is said to be write-optimized, log structured and latch-free which enables it to guarantee writes in single digit millisecond latencies anywhere in the world.

#codingexercise

Node BuildTree(char[] Preorder, char[] InOrder, int index = 0)
{
Node root = new Node();
root.data = PreOrder[index];
root.left = null;
root.right = null;

int inIndex = Array,indexOf(InOrder, Preorder[index]);

if ( index+1 < Preorder.Length &&
IsLeftSubtree(Preorder[index+1], inOrder, inIndex) == true)
root.left = BuildTree(Preorder, InOrder, index + 1);

if ( inIndex+1 < InOrder.Length &&
isPredecessor(InOrder[inIndex+1], Preorder, index) == false)
root.right = BuildTree(Preorder, InOrder, Array.IndexOf(PreOrder, InOrder[inIndex + 1]));

return root;

}

bool is Predecessor(char c, char[] Preorder, int index)
{
return Array.IndexOf(Preorder, c) < index;
}

bool isLeftSubtree(char c, char[] inOrder, int index)
{
Array.IndexOf(Inorder,c) < index;

}

// Preorder A B C D E F
// Inorder C B D A E F
// Build tree
a
b e
c d f
Node BuildTree(char[] Preorder, char[] InOrder, char parent, int index = 0, int start, int end)
{
Node root = new Node();
char current = PreOrder[index];
root.data = current;
root.left = null;
root.right = null;
if (start == end) return root;
int size_left = indexOfFromStart(InOrder, current, parent, index, start, end);
if (size_left > 0)
root.left = BuildTree(Preorder, InOrder, current, index+1, index+1, index+size_left);
int size_right = end-start-size_left;
if (size_right > 0)
root.right = BuildTree(Preorder, InOrder, current, index+size_left+1, index+size_left+1, end);
return root;
}
int indexOfFromStart(char[] InOrder, char c, char parent, int index, int start, int end)
{
if (start >= end) return 0;
int inIndex= InOrder.IndexOf(c);
// start and end are in pre-order
// index in from inorder but the size of the left subtree
// must be contained in preorder between start and end
// and must be between 0 and index in inorder
if (inIndex<start) return 0;
assert(inindex <= end);
int begin = 0;
if (parent != '/0')
begin = InOrder.IndexOf(parent);
if (begin<inIndex)
return inIndex-begin-1;
return inIndex;
}

Friday, June 16, 2017

In continuation of our discussion on data architecture covered in previous posts, let us take an example and study it. We will review Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability. With cloud based business such as Amazon, the compute and storage is distributed in several datacenters around the world. At this scale, there are many fault domains and and yet the persistent state has to be managed with high reliability and availability. To maitain this standard, Dynamo sacrifices consistency under certain conditions It makes extensive use of object-versioning and application assisted conflict resolution. It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost effectiveness and performance. For a store like Amazon, a relational data store is limiting in terms of scale and availability. Dynamo provides a simple primary-key interface to meet this need.
We saw how conflict resolution is handled in Dynamo. When the conflict resolution happens and who does the conflict resolution is determined by the needs of the services using Dynamo.
Other design considerations include the ability to scale out the storage with no downtime, uniform nodes and no special roles for any set of nodes, the ability to decentralize control by way of simpler scalable peers and exploit heterogeneity in infrastructure.
Peer to Peer network formed for this storage does not depend on flooding the network with queries to find the peer with the data. Instead it relies on a structure and a globally consistent protocol to ensure that any node can route the search query to some peer that has the related data.
#codingexercise
1) Given a nested list of int array – calculate the sum of all the integers, while multiplying each number by its depth…

For example:

Here’s the depth sum of the nested list {1, {2, 3}}:

{1, {2, 3}} => 1*1 + 2*2 + 3*2 = 11

Here’s the depth sum of the nested list {1, {2, {3}}}:

{1, {2, {3}}} => 1*1 + 2*2 + 3*3 = 14

static int CalcSum(Nested n, int depth)
{
int total = n.value * depth;
if (n.IsNested())
{
int newdepth = depth++;
foreach( var nl in n.nested)
{
total += CalcSum(nl, newdepth)
}
}
return total;
}
2) Find sentences with the highest density of words from a list : solution

Thursday, June 15, 2017

We were reviewing Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo is a single highly available system for this entire Amazon store and it is designed to handle all loads including peak loads. Reading from their paper, we saw that Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability.
Let us look at the techniques involved to achieve this scale and availability. Data is partitioned and replicated using consistent hashing and consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique. The replicas are kept in sync with a decentralized synchronization protocol. Dynamo employs a gossip based distributed failure detection and to determine memberships. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.Conflicts need to be handled by determining when to resolve them and who resolves them. Eventually a consistent state will be achieved. While some systems determine when to resolve during writes, Dynamo is designed to be an always writeable store.
The writes are important to services because they could involve customer updates which cannot be lost or rejected.Therefore conflict resolution is not handled in writes. Instead, it is done during reads. To determine who performs the conflict resolution, we can choose between applications and datastore. The datastore is not context aware so it may choose to use the last write or such other straightforward mechanisms. The application is however savvy about the user context and can determine the best resolution. Therefore, the conflict resolution is performed by the application.

#codingexercise
Find a number in a rotated sorted array
int GetNumber(List<int> sortedRotated, int num)
{
int n = sortedRotated.Count;
int pivot = find_pivot(sortedRotated, 0, n-1, num); // binary search
if (pivot == -1)
return binary_search(sortedRotated, 0, n-1);
if (sortedRotated[pivot] == num)
return pivot;
if (sortedRotated[pivot] <= num)
return GetNumber(sortedRotated, 0, pivot-1, num);
else
return GetNumber(sortedRotated, pivot+1, n, num);
}
int find_pivot( List<int> sortedRotated, int start, int end)
{
if (start < end)
return -1;
if (start == end) return start;
int mid = (start + end ) / 2;
if ( mid < end && sortedRotated[mid] > sortedRotated[mid+1]) return mid;
if (mid > start && sortedRotated[mid-1] > sortedRotated[mid]) return mid-1;
if (sortedRotated[low] >= sortedRotated[mid])
return find_pivot(sortedRotated, start, mid-1);
return find_pivot(sortedRotated, mid+1, end);
}
#alternative : http://ideone.com/e.js/F7V2aj
Also, we can use another technique that relies on a combined array that concatenates the sortedRotated array with itself which gives a subarray that is already sorted with all of the elements.
eg: 8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7
8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7, 8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7
1, 3, 4, 4, 4, 6, 6, 7, 7, 8, 9, 9,
As an example, since all the elements are sorted, finding the index of the element in this sorted sequence is a binary search. This length can then be added to the index of the pivot point to return the result.

DFS in a matrix with blockers
void DFS(int[,] A, int row, int col, int row, int col, ref int[,] visited, ref int size)
{

var rd = { -1,-1,-1,0,0,1,1, 1};
var cd = { -1,0,1,-1,0,1,-1,0,1 };

visited[row,col] = true;
for (int k = 0; k < 8; k++)
if (isSafe(A, row,col, row + rd[k], col +cd[k])){
size++;
DFS(A, row,col, row+rd[k], col+cd[k], ref visited, ref size);
}

}

Wednesday, June 14, 2017

In continuation of our discussion on data architecture covered in previous posts, let us take an example and study it. We will review Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability. With cloud based business such as Amazon, the compute and storage is distributed in several datacenters around the world. At this scale, there are many fault domains and and yet the persistent state has to be managed with high reliability and availability. To maitain this standard, Dynamo sacrifices consistency under certain conditions It makes extensive use of object-versioning and application assisted conflict resolution. It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost effectiveness and performance. For a store like Amazon, a relational data store is limiting in terms of scale and availability. Dynamo provides a simple primary-key interface to meet this need.
Let us look at the techniques involved to achieve this scale and availability. Data is partitioned and replicated using consistent hashing and consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique. The replicas are kept in sync with a decentralized synchronization protocol. Dynamo employs a gossip based distributed failure detection and to determine memberships. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution. This lets Dynamo handle the peak load traffic to the Amazon store.Usually replication is performed synchronously to provide a consistent data access interface. This trades off the availability under certain failures. It is well known that consistency and availability are hard to achieve together. Availability can be increased by optimistic replication techniques where changes are allowed to propagate to replicas in the background. Disconnectivity is tolerated. Conflicts arising from now needs to be handled by determining when to resolve them and who resolves them. Eventually a consistent state will be achieved. While some systems determine when to resolve during writes, Dynamo is designed to be an always writeable store.
Dynamo has been the underlying storage technology for a number of the core services of Amazon e-commerce platform. This store handles several millions of checkouts a day. Dynamo is a single highly available system for this entire store.
Courtesy: Amazon paper on Dynamo
#codingexercise
Count number of bridges possible between cities on northern and southern parts of the river such that they don't intersect. The southern cities are random order of the northern cities and bridges are drawn between same cities.
cities A B C D E F G H
pairs H A B F E D G C
best 1 1 2 3 3 3 4 3
A-A
B-B
D-D
G-G

Longest increasing subsequence
int GetBridges(List<int> northernCities, List<int> southernCities)
{
return min (GetBridgesLIS(northernCities), GetBridgesLIS(southernCities));
}
int GetBridgesLIS(List<Char> A)
{
var best = new int[A.Length+1];
for (int i = 0; i < best.Length; i++)
best[i] = 1;
for (int i = 1; i < A.Length; i++)
for (int j=0; j < i; j++)
if (A[i] > A[j])
{
best[i] = Math.Max(best[i], best[j]+1);
}
return best.ToList().max();
}