Cluster computing

Saturday, July 22, 2017

We were discussing the replacement of MongoDB with snowflake data warehouse. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, are best understood with a case study of DoubleDown in their migration of data from MongoDB to Snowflake. Towards the end of the discussion, we will reflect on whether Snowflake can be built on top of DynamoDB or CosmosDB.
DoubleDown is an internet casino games provider that was acquired by International Game Technology. Its games are available through different applications and platforms. While these games are free to play, it makes money from in game purchases and advertising partners. With existing and new games, it does analysis to gain insights that influence game design, marketing campaign evaluation and management. It improves the understanding of player behaviour, assess user experience, and uncover bugs and defects. This data comes from MySQL databases, internal production databases, cloud based game servers, internal Splunk servers, and several third party vendors. All this continuous data feeds amount to about 3.5 terabytes of data per day which come from separate data paths, ETL transformations and stored in large JSON files. MongoDB was used for processing this data which supported a collection of collectors and aggregators. The data was then pulled into a staging area where it was cleaned, transformed and conformed to a star schema before loading into a data warehouse. This warehouse then supported analysis and reporting tools including Tableau. Snowflake not only replaced the MongoDB but it also streamlined the data operation while expediting the process that took nearly a day earlier. Snowflake brought the following advantages:
Its query language is SQL so development pace was rapid
It loads JSON natively so several lossy data transformations were avoided
It was able to store highly granular stage and store data in S3 which made it very effective.
It was able to process JSON data using SQL which did away with a lot of map-reduce
Snowflake utilizes micro partitions to securely and efficiently store customer data
Snowflake is appealing over traditional warehouses because it provides software as a service experience. It is elastic where storage and compute resources can be scaled independently and seamlessly without impact on data availability or performance of concurrent queries. It is a multi-cluster technology that is highly available. Each cluster is like a virtual warehouse and serves a single user although they are never aware of the nodes in the cluster. The cluster sizes vary in T-shirt sizes. Each such warehouse is a pure compute resource while storage is shared and the compute instances work independent of the data volume. The execution engine is columnar which may be better suited for analytics as opposed to row wise storage. Data is not materialized in the form of intermediate results but processed as if in a pipeline. It is not clear why the virtual clusters are chosen to be homogeneous and not allowed to be heterogeneous where the compute may scale up rather than scale out. It is also not clear why the clusters are not supporting some that are outsourced as commodity clusters such as Mesos and Marathon stack. Arguably performance improvements to the tune of relational counterpart requires homogeneous architecture.

Cluster computing

Saturday, July 22, 2017

No comments:

Post a Comment