Monday, July 24, 2017

Yesterday we were discussing the design of snowflake data warehouse. We continue with the discussion on some of its features.In particular, we review the storage versus compute considerations for a warehouse as a cloud service.  The shared-nothing architecture is not specific to warehouse and is widely accepted as a standard for scalability and cost-effectiveness. With the use of commodity hardware and scale out based on additional nodes, this design lets every node have the same duties and runs on the same hardware. With the contention minimized and the processing homogeneous, it becomes easy to scale out to larger and larger workloads. Every query processor node has its own local disks. Tables are horizontally partitioned across nodes and each node is only responsible for the rows on its local disks. This works particularly well for star schema because the fact table is large and partitioned while the dimension table is small. The main drawback of this design is that compute and storage is now in the form of clusters and tightly coupled. First, the workload need not be homogeneous and hardware does not need to be forced to have low average utilization.  Some workloads can be highly compute intensive. Second, if membership changes because nodes fail, all their associated data now gets reassigned. This transfer is usually done by the same nodes that are also performing the query processing which limits their elasticity and availability. Third, every node is a liability and requires upgrades or system resizes. When the upgrades are done with little or no downtime, they do not affect query processing. However, this design makes such online upgrades difficult. Replication is used to mitigate the reliance on a single copy of data and we may recollect how nodes are rebuilt but these add to the processing under the condition that nodes have to be homogeneous. This varies from on-premise to cloud based deployments where the compute may be heterogeneous with far more frequent failures. Snowflake works around this by having separate compute and storage layer where the storage is based on any cloud that provides blob storage but in this case, it relies on Amazon S3. By letting the data be remote instead of local to the compute nodes, the local disk space is not used for replicating the base data and the compute node can use it to cache some table data.
With the compute separated from storage, it is now possible to include more than one clusters for the same cloud service or dedicate a cluster for a single microservice.

#codingexercise
http://ideone.com/TqZ8jQ
Another technique can be to enumerate all the increasing subsequences and then apply filter.

No comments:

Post a Comment