Cluster computing

Sunday, July 23, 2017

We were discussing the replacement of MongoDB with snowflake data warehouse. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, are best understood with a case study of DoubleDown in their migration of data from MongoDB to Snowflake. Snowflake's primary advantage is that it brings clouds elastic and isolation capabilities to a warehouse where compute is added in the form of what it calls virtual clusters and the storage is shared. Each cluster is like a virtual warehouse and serves a single user although they are never aware of the nodes in the cluster. Each query is executed in a single cluster. It utilizes micro partitions to securely and efficiently store customer data. It is appealing over traditional warehouses because it provides software as a service experience. Its query language is SQL so development pace can be rapid. It loads JSON natively so several lossy data transformations and ETL could be avoided. It is able to store highly granular staging data in S3 which makes it very effective to scale.
While Hadoop and Spark are useful to increase analytics in their own way, this technology brings warehouse to the cloud. Since some of the warehouse capabilities are not transactional in nature, it eliminates transaction management during execution. Moreover queries are heavy on aggregation so Snowflake allows these operators to make the best use of cloud memory and storage. Each warehouse features such as the access control, query optimizer, and others is implemented as a cloud service. The failure of individual service nodes does not cause data loss or loss of availability. Concurrency is handled in these cloud services with the help of snapshot isolation and MVCC.
Snowflake provides the warehouse in a software as a service manner.Its this model that is interesting to text summarization service. There is end to end security in the offering and Snowflake utilizes micro partitions to securely and efficiently store customer data. Another similarity is that Snowflake is not a full service model. Instead it is meant as a component in the workflows often replacing those associated with the use of say MongoDB in an enterprise. The interface for Snowflake is SQL which is a widely accepted interface. The Summarization service does not have this benefit but it is meant for participation in workflows with the help of REST APIs. If the SQL standard is enhanced at some point to include text analysis then the summarization service can be enhanced to include these as well.

Cluster computing

Sunday, July 23, 2017

No comments:

Post a Comment