Cluster computing

Monday, June 19, 2017

Fulltext Search versus NLP in databases
Many databases in the 1990s started providing full text search techniques. It was a text retrieval technique that relied on searching all of the words in every document stored in a specific full text database.Database engines and web search engines started providing this capability so that search did not just cover the metadata and the selected sections but also the entire content. Since it was difficult to scan the entire content of a file one by one, it was offloaded to its own process separate from the OLTP to create a concordance. Stop words and stemming helped to keep the index relevant. Querying also improved significantly with fulltext and new forms of queries were made available. Still FullText did not really make use of clustering or page ranking techniques.
The difference in techniques between fulltext and nlp rely largely on the data structures used. While one uses inverted document lists and indexes, the other is making use of word vectors. The latter can be used to create similarity graphs and used with inference techniques. The word vectors can also be used with a variety of data mining techniques. With the availability of managed service databases in the public clouds, we are now empowered to creating nlp databases in the cloud.
Discussion to be continued ...
A text summarization attempt: http://scratchpad.io/scintillating-spot-7410
API
UI
#codingexercise
Given a rectangular matrix with rowwise and columnwise sorted distinct integers, find a given integer
Tuple<int,int> GetPosition(int[,] A, int Rows, int Cols, int X)
{
var ret = new Tuple<int, int>(){-1,-1};
// x, y coordinates
for (int j =0; j < Cols; j++)
{
int index = binary_search(A, rows, j, X);
if ( index != -1)
{
ret.first = index;
ret.second = j;
break;
}
}
return ret;
}
Sorting nested dictionary : http://ideone.com/SlHBib

Sunday, June 18, 2017

We were reviewing DynamoDB. We will compare it with Cosmos DB but first let us review Cosmos DB. Azure Cosmos DB was written to be the industry's first globally-distributed, multi-model database services. It is built to collect data from a plethora of IoT devices as well as to be handle the processing of machine-learning. It is a globally distributed data service that can elastically scale throughput and storage with data consistency, low latency and high availability. There are five different consistency levels which include strong, bounded-staleness, session, consistent-prefix and eventual consistency levels. No schema or index is required because the database automatically indexes all data and the engine does not know or care about any schemas. Key-value, documents and graph data models are easily supported.
Perhaps one of the undeniably desirable feature of Cosmos DB is for developers to choose the data model and the API. Developers are no longer bound to one model. They can easily switch from key-value to graph model as they write one application after another. This makes this DB second nature to the developers and earns mindshare which is just as important as acquiring data because the two are related. Similarly they can also manipulate the data with the same set of APIs that they are already familiar with. This database does not impose any new APIs We can use the same dialects such as that of Document DB, Mongo DB and APIs of Gremlin database. These are in addition to the native Azure Table Storage API. Not only this but other data access APIs can be added to the offerings going forward. Azure Cosmos is the first and only globally distributed database service in the industry to financially backed comprehensive SLAs as per their documentation. This gives us the assurance that the APIs will not only be easier to work with but also highly available with low latency, high consistency and throughput. In other words, customers who were using NoSQL database can now for go their own maintenance and adopt a managed Database as a Service layer with little or no disruption to their existing applications. In our discussion of Dataset architecture here, we saw important concepts around new trends but nothing seems to bring it all in together as this managed service platform from Azure Cosmos.

A text summarization attempt: http://scratchpad.io/scintillating-spot-7410
API
UI

#codingexercise
Given a rectangular matrix with rowwise and columnwise sorted distinct integers, find a given integer
Tuple<int,int> GetPosition(int[,] A, int Rows, int Cols, int X)
{
var ret = new Tuple<int, int>(){-1,-1};
// x, y coordinates
for (int i =0; i < Rows; i++)
{
int index = binary_search(A, cols, i, X);
if ( index != -1)
{
ret.first = i;
ret.second = index;
break;
}
}
return ret;
}
We can also do binary search along columns as well as rows.

Saturday, June 17, 2017

This database allows the developers to add any number of regions while retaining the same fidelity as the one closest. The database also seamlessly replicates data across regions. It elastically scales throughput and storage. The applications can also be built to be more responsive. The database engine is said to be write-optimized, log structured and latch-free which enables it to guarantee writes in single digit millisecond latencies anywhere in the world.

#codingexercise

Node BuildTree(char[] Preorder, char[] InOrder, int index = 0)
{
Node root = new Node();
root.data = PreOrder[index];
root.left = null;
root.right = null;

int inIndex = Array,indexOf(InOrder, Preorder[index]);

if ( index+1 < Preorder.Length &&
IsLeftSubtree(Preorder[index+1], inOrder, inIndex) == true)
root.left = BuildTree(Preorder, InOrder, index + 1);

if ( inIndex+1 < InOrder.Length &&
isPredecessor(InOrder[inIndex+1], Preorder, index) == false)
root.right = BuildTree(Preorder, InOrder, Array.IndexOf(PreOrder, InOrder[inIndex + 1]));

return root;

}

bool is Predecessor(char c, char[] Preorder, int index)
{
return Array.IndexOf(Preorder, c) < index;
}

bool isLeftSubtree(char c, char[] inOrder, int index)
{
Array.IndexOf(Inorder,c) < index;

}

// Preorder A B C D E F
// Inorder C B D A E F
// Build tree
a
b e
c d f
Node BuildTree(char[] Preorder, char[] InOrder, char parent, int index = 0, int start, int end)
{
Node root = new Node();
char current = PreOrder[index];
root.data = current;
root.left = null;
root.right = null;
if (start == end) return root;
int size_left = indexOfFromStart(InOrder, current, parent, index, start, end);
if (size_left > 0)
root.left = BuildTree(Preorder, InOrder, current, index+1, index+1, index+size_left);
int size_right = end-start-size_left;
if (size_right > 0)
root.right = BuildTree(Preorder, InOrder, current, index+size_left+1, index+size_left+1, end);
return root;
}
int indexOfFromStart(char[] InOrder, char c, char parent, int index, int start, int end)
{
if (start >= end) return 0;
int inIndex= InOrder.IndexOf(c);
// start and end are in pre-order
// index in from inorder but the size of the left subtree
// must be contained in preorder between start and end
// and must be between 0 and index in inorder
if (inIndex<start) return 0;
assert(inindex <= end);
int begin = 0;
if (parent != '/0')
begin = InOrder.IndexOf(parent);
if (begin<inIndex)
return inIndex-begin-1;
return inIndex;
}

Friday, June 16, 2017

In continuation of our discussion on data architecture covered in previous posts, let us take an example and study it. We will review Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability. With cloud based business such as Amazon, the compute and storage is distributed in several datacenters around the world. At this scale, there are many fault domains and and yet the persistent state has to be managed with high reliability and availability. To maitain this standard, Dynamo sacrifices consistency under certain conditions It makes extensive use of object-versioning and application assisted conflict resolution. It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost effectiveness and performance. For a store like Amazon, a relational data store is limiting in terms of scale and availability. Dynamo provides a simple primary-key interface to meet this need.
We saw how conflict resolution is handled in Dynamo. When the conflict resolution happens and who does the conflict resolution is determined by the needs of the services using Dynamo.
Other design considerations include the ability to scale out the storage with no downtime, uniform nodes and no special roles for any set of nodes, the ability to decentralize control by way of simpler scalable peers and exploit heterogeneity in infrastructure.
Peer to Peer network formed for this storage does not depend on flooding the network with queries to find the peer with the data. Instead it relies on a structure and a globally consistent protocol to ensure that any node can route the search query to some peer that has the related data.
#codingexercise
1) Given a nested list of int array – calculate the sum of all the integers, while multiplying each number by its depth…

For example:

Here’s the depth sum of the nested list {1, {2, 3}}:

{1, {2, 3}} => 1*1 + 2*2 + 3*2 = 11

Here’s the depth sum of the nested list {1, {2, {3}}}:

{1, {2, {3}}} => 1*1 + 2*2 + 3*3 = 14

static int CalcSum(Nested n, int depth)
{
int total = n.value * depth;
if (n.IsNested())
{
int newdepth = depth++;
foreach( var nl in n.nested)
{
total += CalcSum(nl, newdepth)
}
}
return total;
}
2) Find sentences with the highest density of words from a list : solution

Thursday, June 15, 2017

We were reviewing Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo is a single highly available system for this entire Amazon store and it is designed to handle all loads including peak loads. Reading from their paper, we saw that Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability.
Let us look at the techniques involved to achieve this scale and availability. Data is partitioned and replicated using consistent hashing and consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique. The replicas are kept in sync with a decentralized synchronization protocol. Dynamo employs a gossip based distributed failure detection and to determine memberships. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.Conflicts need to be handled by determining when to resolve them and who resolves them. Eventually a consistent state will be achieved. While some systems determine when to resolve during writes, Dynamo is designed to be an always writeable store.
The writes are important to services because they could involve customer updates which cannot be lost or rejected.Therefore conflict resolution is not handled in writes. Instead, it is done during reads. To determine who performs the conflict resolution, we can choose between applications and datastore. The datastore is not context aware so it may choose to use the last write or such other straightforward mechanisms. The application is however savvy about the user context and can determine the best resolution. Therefore, the conflict resolution is performed by the application.

#codingexercise
Find a number in a rotated sorted array
int GetNumber(List<int> sortedRotated, int num)
{
int n = sortedRotated.Count;
int pivot = find_pivot(sortedRotated, 0, n-1, num); // binary search
if (pivot == -1)
return binary_search(sortedRotated, 0, n-1);
if (sortedRotated[pivot] == num)
return pivot;
if (sortedRotated[pivot] <= num)
return GetNumber(sortedRotated, 0, pivot-1, num);
else
return GetNumber(sortedRotated, pivot+1, n, num);
}
int find_pivot( List<int> sortedRotated, int start, int end)
{
if (start < end)
return -1;
if (start == end) return start;
int mid = (start + end ) / 2;
if ( mid < end && sortedRotated[mid] > sortedRotated[mid+1]) return mid;
if (mid > start && sortedRotated[mid-1] > sortedRotated[mid]) return mid-1;
if (sortedRotated[low] >= sortedRotated[mid])
return find_pivot(sortedRotated, start, mid-1);
return find_pivot(sortedRotated, mid+1, end);
}
#alternative : http://ideone.com/e.js/F7V2aj
Also, we can use another technique that relies on a combined array that concatenates the sortedRotated array with itself which gives a subarray that is already sorted with all of the elements.
eg: 8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7
8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7, 8, 9, 9, 1, 3, 4, 4, 4, 6, 6, 7, 7
1, 3, 4, 4, 4, 6, 6, 7, 7, 8, 9, 9,
As an example, since all the elements are sorted, finding the index of the element in this sorted sequence is a binary search. This length can then be added to the index of the pivot point to return the result.

DFS in a matrix with blockers
void DFS(int[,] A, int row, int col, int row, int col, ref int[,] visited, ref int size)
{

var rd = { -1,-1,-1,0,0,1,1, 1};
var cd = { -1,0,1,-1,0,1,-1,0,1 };

visited[row,col] = true;
for (int k = 0; k < 8; k++)
if (isSafe(A, row,col, row + rd[k], col +cd[k])){
size++;
DFS(A, row,col, row+rd[k], col+cd[k], ref visited, ref size);
}

}

Wednesday, June 14, 2017

In continuation of our discussion on data architecture covered in previous posts, let us take an example and study it. We will review Amazon Dynamo as an example of a cloud based data storage and distributed system. Dynamo addresses the need for a reliable, scalable, highly available key-value storage system. It provides the ability to trade-off cost, consistency, durability and performance while maintaining high availability. With cloud based business such as Amazon, the compute and storage is distributed in several datacenters around the world. At this scale, there are many fault domains and and yet the persistent state has to be managed with high reliability and availability. To maitain this standard, Dynamo sacrifices consistency under certain conditions It makes extensive use of object-versioning and application assisted conflict resolution. It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost effectiveness and performance. For a store like Amazon, a relational data store is limiting in terms of scale and availability. Dynamo provides a simple primary-key interface to meet this need.
Let us look at the techniques involved to achieve this scale and availability. Data is partitioned and replicated using consistent hashing and consistency is facilitated by object versioning. Replicas are maintained during updates based on a quorum like technique. The replicas are kept in sync with a decentralized synchronization protocol. Dynamo employs a gossip based distributed failure detection and to determine memberships. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution. This lets Dynamo handle the peak load traffic to the Amazon store.Usually replication is performed synchronously to provide a consistent data access interface. This trades off the availability under certain failures. It is well known that consistency and availability are hard to achieve together. Availability can be increased by optimistic replication techniques where changes are allowed to propagate to replicas in the background. Disconnectivity is tolerated. Conflicts arising from now needs to be handled by determining when to resolve them and who resolves them. Eventually a consistent state will be achieved. While some systems determine when to resolve during writes, Dynamo is designed to be an always writeable store.
Dynamo has been the underlying storage technology for a number of the core services of Amazon e-commerce platform. This store handles several millions of checkouts a day. Dynamo is a single highly available system for this entire store.
Courtesy: Amazon paper on Dynamo
#codingexercise
Count number of bridges possible between cities on northern and southern parts of the river such that they don't intersect. The southern cities are random order of the northern cities and bridges are drawn between same cities.
cities A B C D E F G H
pairs H A B F E D G C
best 1 1 2 3 3 3 4 3
A-A
B-B
D-D
G-G

Longest increasing subsequence
int GetBridges(List<int> northernCities, List<int> southernCities)
{
return min (GetBridgesLIS(northernCities), GetBridgesLIS(southernCities));
}
int GetBridgesLIS(List<Char> A)
{
var best = new int[A.Length+1];
for (int i = 0; i < best.Length; i++)
best[i] = 1;
for (int i = 1; i < A.Length; i++)
for (int j=0; j < i; j++)
if (A[i] > A[j])
{
best[i] = Math.Max(best[i], best[j]+1);
}
return best.ToList().max();
}

Tuesday, June 13, 2017

Data architectures in Cloud Computing.

We were discussing that traditional data processing architecture has changed a lot from where they used to be part of the ubiquitous three tier architecture involving databases, to being more distributed, scaled up and scaled out, sharded and hosted on private and public clouds, maintained on clusters and containers with shared volumes, hosted in memory and even becoming hybrid to involve SQL and NoSQL technologies. We continue reviewing some of the improvements in this field

Today we look at data storage for social networking applications as an extreme of storage needed for this purpose. We recap the considerations presented in a video by Facebook Engineers for their data infrastructure needs. They have data arriving to the tune of several hundred Terabytes and a storage of over 300 Petabytes a fraction of which is processed. For example, their log storage flows into HDFS which is massively distributed storage. They use NoSQL Hadoop over HDFS together with Hive for data warehouse operations and SQL querying. This makes it easy for data tools and pipelines to work with this data stack.

Then they introduced Presto over HDFS for interactive analysis and Apache Giraph over Hadoop for Graph Analytics. We will discuss Presto and Giraph shortly but let us take a look at the kind of data stored in these databases. All the images from the social networking flow into HayStack, The users are stored in mysql and all the chat is stored in H-Base. They use Scribe for Log Storage, Scuba for real-time slice and dice and Puma for Streaming analytics. These also give an indication of the major types of data processings involved with social network application.
Apache Giraph is an iterative graph processing system used for high scalability. It is used to analyze the graph formed by users and their connections.

#codingexercise
Given an array, find the maximum j-i such that arr[j] > arr[i]
int GetMaxDiff(List<int> A)
{
int diff = INT_MIN;
for (int i = 0; i < A.Length; i++)
for (int j = A.Length -1; j > i; j--)
{
if (A[j] > A[i] && j-i > diff)
diff = j-i;
}
return diff;
}