Cluster computing

Wednesday, July 26, 2017

We were discussing cloud services and compute or storage requirements. We mentioned services being granular. Today we continue with the discussion on Snowflake cloud services from their whitepaper. The engine for Snowflake is Columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion. Batches of few thousand rows in columnar format are processed at a time. However it differs from Map-Reduce because it does not materialize intermediate results. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops. Query plans are not just trees, they can also be DAG-shaped. With push operators, this results in efficiency. Overhead of traditional query processing is not there in Snowflake. There is no need for transaction management during execution because queries are executed against a fixed set of immutable files. There is no buffer pool. This was used for table buffering but is no longer required. Instead the memory is used for operators. Queries can scan large amounts of data so there is more efficiency is using the memory for the operators. All major operators are allowed to spill to disk and recurse when memory is exhausted. Many analytical workloads require large joins or aggregations. Instead of requiring them to operate in pure memory, they can spill to disk. The Cloud Services layer is heavily multi-tenant. Each Snowflake service in this layer is shared across many users. This improves utilization of the nodes and reduces administrative overhead. Running a query over fewer nodes is more beneficial than running it over hundreds of nodes. Scale out is important but this efficiency per node is helpful.
#codingexercise
static int GetCountIncreasingSequences(List<int> A)

{

int[] dp = new int[A.Count];

for (int i = 0; i < A.Count; i++)

{

dp[i] = 1;

for (int j = 0; j <= i - 1; j++)

{

if (A[j] < A[i])

{

dp[i] = dp[i] + dp[j];

}

}

}

return dp.Sum();

}

Cluster computing

Wednesday, July 26, 2017

No comments:

Post a Comment