Cluster computing

Sunday, July 23, 2017

We were discussing the replacement of MongoDB with snowflake data warehouse. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, are best understood with a case study of DoubleDown in their migration of data from MongoDB to Snowflake. Snowflake's primary advantage is that it brings clouds elastic and isolation capabilities to a warehouse where compute is added in the form of what it calls virtual clusters and the storage is shared. Each cluster is like a virtual warehouse and serves a single user although they are never aware of the nodes in the cluster. Each query is executed in a single cluster. It utilizes micro partitions to securely and efficiently store customer data. It is appealing over traditional warehouses because it provides software as a service experience. Its query language is SQL so development pace can be rapid. It loads JSON natively so several lossy data transformations and ETL could be avoided. It is able to store highly granular staging data in S3 which makes it very effective to scale.
While Hadoop and Spark are useful to increase analytics in their own way, this technology brings warehouse to the cloud. Since some of the warehouse capabilities are not transactional in nature, it eliminates transaction management during execution. Moreover queries are heavy on aggregation so Snowflake allows these operators to make the best use of cloud memory and storage. Each warehouse features such as the access control, query optimizer, and others is implemented as a cloud service. The failure of individual service nodes does not cause data loss or loss of availability. Concurrency is handled in these cloud services with the help of snapshot isolation and MVCC.
Snowflake provides the warehouse in a software as a service manner.Its this model that is interesting to text summarization service. There is end to end security in the offering and Snowflake utilizes micro partitions to securely and efficiently store customer data. Another similarity is that Snowflake is not a full service model. Instead it is meant as a component in the workflows often replacing those associated with the use of say MongoDB in an enterprise. The interface for Snowflake is SQL which is a widely accepted interface. The Summarization service does not have this benefit but it is meant for participation in workflows with the help of REST APIs. If the SQL standard is enhanced at some point to include text analysis then the summarization service can be enhanced to include these as well.

Saturday, July 22, 2017

We were discussing the replacement of MongoDB with snowflake data warehouse. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, are best understood with a case study of DoubleDown in their migration of data from MongoDB to Snowflake. Towards the end of the discussion, we will reflect on whether Snowflake can be built on top of DynamoDB or CosmosDB.
DoubleDown is an internet casino games provider that was acquired by International Game Technology. Its games are available through different applications and platforms. While these games are free to play, it makes money from in game purchases and advertising partners. With existing and new games, it does analysis to gain insights that influence game design, marketing campaign evaluation and management. It improves the understanding of player behaviour, assess user experience, and uncover bugs and defects. This data comes from MySQL databases, internal production databases, cloud based game servers, internal Splunk servers, and several third party vendors. All this continuous data feeds amount to about 3.5 terabytes of data per day which come from separate data paths, ETL transformations and stored in large JSON files. MongoDB was used for processing this data which supported a collection of collectors and aggregators. The data was then pulled into a staging area where it was cleaned, transformed and conformed to a star schema before loading into a data warehouse. This warehouse then supported analysis and reporting tools including Tableau. Snowflake not only replaced the MongoDB but it also streamlined the data operation while expediting the process that took nearly a day earlier. Snowflake brought the following advantages:
Its query language is SQL so development pace was rapid
It loads JSON natively so several lossy data transformations were avoided
It was able to store highly granular stage and store data in S3 which made it very effective.
It was able to process JSON data using SQL which did away with a lot of map-reduce
Snowflake utilizes micro partitions to securely and efficiently store customer data
Snowflake is appealing over traditional warehouses because it provides software as a service experience. It is elastic where storage and compute resources can be scaled independently and seamlessly without impact on data availability or performance of concurrent queries. It is a multi-cluster technology that is highly available. Each cluster is like a virtual warehouse and serves a single user although they are never aware of the nodes in the cluster. The cluster sizes vary in T-shirt sizes. Each such warehouse is a pure compute resource while storage is shared and the compute instances work independent of the data volume. The execution engine is columnar which may be better suited for analytics as opposed to row wise storage. Data is not materialized in the form of intermediate results but processed as if in a pipeline. It is not clear why the virtual clusters are chosen to be homogeneous and not allowed to be heterogeneous where the compute may scale up rather than scale out. It is also not clear why the clusters are not supporting some that are outsourced as commodity clusters such as Mesos and Marathon stack. Arguably performance improvements to the tune of relational counterpart requires homogeneous architecture.

Friday, July 21, 2017

As we discussed Content Databases, document libraries and S3 storage together with their benefits for text summarization service, we now review NoSQL databases as a store for the summaries. We argued that the summaries can be represented as JSON documents and that the store can grow arbitrarily large to the tune of 40GB per user and that the text of the original document may also be stored. Consequently one of the approaches could be to use a MongoDB store which can be a wonderful in-premise solution but one that requires significant processing for the purposes of analysis and reporting especially when the datasets are large However we talked about the text summarization service as a truly cloud based offering. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, we could do well to review the case study of DoubleDown in their migration of data from MongoDB to Snowflake. Towards the end of the discussion, we will reflect on whether Snowflake can be built on top of DynamoDB or CosmosDB.
DoubleDown is an internet casino games provider that was acquired by International Game Technology. Its games are available through different applications and platforms. While these games are free to play, it makes money from in game purchases and advertising partners. With existing and new games, it does analysis to gain insights that influence game design, marketing campaign evaluation and management. It improves the understanding of player behaviour, assess user experience, and uncover bugs and defects. This data comes from MySQL databases, internal production databases, cloud based game servers, internal Splunk servers, and several third party vendors. All this continuous data feeds amount to about 3.5 terabytes of data per day which come from separate data paths, ETL transformations and stored in large JSON files. MongoDB was used for processing this data which supported a collection of collectors and aggregators. The data was then pulled into a staging area where it was cleaned, transformed and conformed to a star schema before loading into a data warehouse. This warehouse then supported analysis and reporting tools including Tableau. Snowflake not only replaced the MongoDB but it also streamlined the data operation while expediting the process that took nearly a day earlier. Snowflake brought the following advantages:
Its query language is SQL so development pace was rapid
It loads JSON natively so several lossy data transformations were avoided
It was able to store highly granular stage and store data in S3 which made it very effective.
It was able to process JSON data using SQL which did away with a lot of map-reduce
Snowflake utilizes micro partitions to securely and efficiently store customer data
#codingexercise
int GetMaxValue(List<int> A)
{
var max = INT_MIN;
for (int i = 0; i < A.Count; i++)
if (A[i] > max)
max = A[i];
return max;
}

Thursday, July 20, 2017

We were discussing file synchronization services. It uses events and callbacks to indicate progress, preview the changes to be made, handle user specified conflict resolution and graceful error handling per file.The file synchronization provider is designed to handle concurrent operations by applications. Changes to the file will not be synchronized until the next synchronization session so that concurrent changes to the source or destination is not lost. Each file synchronization is atomic so that the user does not end up with a partially correct copy of the file. This service provide incremental synchronization between two file system locations based on change detection which is a process that evaluates changes from last synchronization. The service stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. Changes are detected by comparing the current file metadata with the version last saved in the snapshot. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
#codingexercise
Find the length of the longest subsequence of one string which is a substring of another string.
// Returns the maximum length of subseq in substr
static int MaxSubSeqInSubString(string X, string Y)
{
char[] subseq = X.ToCharArray();
int n = subseq.Length;
char[] substr = Y.ToCharArray();
int m = substr.Length;

Debug.Assert (m < 1000 && n < 1000);
var dp = new int[1000, 1000];
for (int i = 0; i <= m; i++) // for substr
for (int j = 0; j <= n; j++) // for subseq
dp[i, j] = 0;

for (int i = 1; i <= m; i++) // for substr
{
for (int j = 1; j <= n; j++) // for subseq
{
if (subseq[j-1] == substr[i-1])
dp[i,j] = dp[i-1,j-1] + 1;
else
dp[i,j] = dp[i,j-1];
}
}

int result = 0;
for (int i = 1; i <=m; i++)
result = Math.Max(result, dp[i,n]);
return result;
}
/*
A, ABC => 1
D, ABC => 0
D, D => 1
A, D => 0
, => 0
A, => 0
, A => 0
A, DAG => 1
AG, DAGH => 2
AH, DAGH => 1
DAGH, AH => 2
ABC, ABC => 3
BDIGH, HDGB => 2
*/
Node GetNthElementInInorder(Node root, int N)
{
var ret = new List<Node>();
InorderTraversal(root, ref ret);
if (N <= ret.Count)
return ret[N-1];
else
return null;
}

Wednesday, July 19, 2017

We were discussing document libraries and file synchronization service. This service provide incremental synchronization between two file system locations. Changes are detected from the last synchronization.
It stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
Since change detection evaluates all files, a large number of files in the replica may degrade performance. Users are notified with progress during the synchronization operation with the help of events raised from managed code execution or from callbacks in the unmanaged code. If the progress is displayed during preview mode, the changes are not committed. If the users modify different file system replicas and they get out of sync, a process of conflict resolution is performed. The conflict resolution is deterministic. It does not matter which replica initiates the conflict resolution. In all cases, it avoids data loss and applies the most recent update or preserves different files.
The callbacks and events don't just come helpful for progress or preview, they are also helpful for error handling and recovery. These enable graceful error handling per file during the synchronization of a set of files.Errors may come from locked files, changes after change detection, access denied, insufficient disk space, etc. If an error is encountered, the file is skipped so that the rest of the synchronization proceeds. The application gets the file details and error information which it may use to re-synchronize after fixing up the problem. If the entire synchronization operation fails, the application may get a proper error code. For example, a replica in use error code is given when there are concurrent synchronization operations on the same replica.
The file synchronization provider is designed to handle concurrent operations by applications. Changes to the file will not be synchronized until the next synchronization session so that concurrent changes to the source or destination is not lost. Each file synchronization is atomic so that the user does not end up with a partially correct copy of the file.
#codingexercise
void InOrderTraversal(Node root, ref List<Node> lasttwo)
{
if (root == null) return;
InOrderTraversal(root.left, ref lasttwo);
ShiftAndAdd(root, ref lasttwo);
InOrderTraversal(root.right, ref lasttwo);
}
void ShiftAndAdd(Node root, ref List<Node> lasttwo)
{
if (lasttwo.Count < 2) {lasttwo.Add(root); return;}
lasttwo[0] = lasttwo[1];
lasttwo[1] = root;
return;
}

Tuesday, July 18, 2017

We were discussing document libraries and file synchronization service. This service provide incremental synchronization between two file system locations based on change detection which is a process that evaluates changes from last synchronization.
It stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
Since change detection evaluates all files, a large number of files in the replica may degrade performance.
The file synchronization provider supports extensive progress reporting during the synchronization operation. This can be visualized through the User Interface as a progress bar. This information is reported to the application via events in the managed code or callbacks in the unmanaged code.
The preview mode displays what changes would happen during synchronization. The changes are not committed but the progress notifications are still sent out.
If the users modify different file system replicas, then they may get out of sync and a process of conflict resolution is performed. The conflict resolution is deterministic. It does not matter which replica initiates the conflict resolution. In all cases, it avoids data loss. The policy for conflict resolution is usually the same. If an existing file is modified independently on two replicas, the file with the last write wins the selection.If an existing file is modified and the same file or folder is deleted in another replica, the deleted will be resurrected. The override for the delete is also the option used when a file is created in a folder which is deleted in another replica. If there are name collisions on files or folders when say one of the folders is renamed, the user will end up with a single folder and the contents merged. The name collision conflicts may happen in other cases also.
If two files are created independently on two replicas with the same name, then there will certainly be a file with that name but with the contents from the side that had the most recent update.
If an existing file is renamed with the name colliding with that of a new file created,then both files are kept and one of them is renamed because the user intended to keep both.
#codingexercise
Given two substrings find if a first string is a subsequence of another
bool IsSubsequence(string one, int len1, string two, int len2)
{
assert (len1 <= one.Length && len2 <= two.Length);
if (len1 == 0) return true;
if (len2 == 0) return false;
if (one[len1-1] == two[len2-1])
return IsSubsequence(one, len1-1, two, len2-1);
return IsSubsequence(one, len1, two, len2-1);
}
/*
AH, DAGH => True
DAGH, AH => False
ABC, ABC => True

*/
int GetSecondLastInBST(Node root)
{
var inorder = new List<int>();
InorderTraversal(root, ref inorder);
if (inorder.Count-2 < 0) throw new Exception("Invalid");
return inorder[inorder.Count-2];
}
// The above can be modified to store only last two elements

Monday, July 17, 2017

We were discussing document libraries and file synchronization service. This service provide incremental synchronization between two file system locations based on change detection which is a process that evaluates changes from last synchronization.
It stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. Changes are detected by comparing the current file metadata with the version last saved in the snapshot. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
When a file is renamed or moved, just that operation will be performed on the replica avoiding a copy operation. If a folder is renamed or moved, it is deleted and created on other replicas. Files within the folders are processed as renames.
Since change detection evaluates all files, a large number of files in the replica may degrade performance. Therefore, this expensive operation should be done as often as required by the applications using the sync framework.
The file synchronization provider supports extensive progress reporting during the synchronization operation. This can be visualized through the User Interface as a progress bar. This information is reported to the application via events in the managed code or callbacks in the unmanaged code. These event handlers and callbacks enable an application to even skip a change.
The preview mode displays what changes would happen during synchronization. The changes are not committed but the progress notifications are still sent out. It can be used by the application to present the verification UI to the user with all the changes that will be made if synchronization is executed.
The file synchronization provider does not provide an upfront estimate of the total number of files to be synchronized before the synchronization starts because this can be expensive to perform. However, some statistics can be collected using a two pass approach where a preview mode is run before the real synchronization session.
Files can be filtered out of synchronization based on filename-based exclusion, filename based inclusion, subdirectory exclude and file attribute based exclusion. Certain files may always be excluded if they are marked with both SYSTEM and HIDDEN attributes.