Cluster computing

Friday, July 28, 2017

We were discussing Snowflake cloud services from their whitepaper. The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
Can Snowflake replace Splunk ? This is probably unlikely because a warehouse and a time series database serve different purposes. Moreover, Splunk is lightweight enough to run on desktop and appliances. That said Snowflake can perform time travel. Let us take a closer look at this. Snowflake implements Snapshot isolation on top of multi-version concurrency control. This means that a copy on write occurs and a new file is added or removed. When the files are removed by a new version, they are retained for a configurable duration. Time Travel in this case means walking through different versions of the data This is done with the SQL keywords AT or BEFORE syntax. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes execept that we get there differently.
#codingexercise
Find the length of the longest subsequence of consecutive integers in a given array
int GetLongest(List<int>A)
{
if (A == null || A.Count == 0) return 0;
if (A.Count == 1) return 1;
A.sort();
int max = 1;
int cur = 1;
for (int i = 1; i < A.Count; i++)
{
if (A[i-1] + 1 == A[i])
{
cur = cur + 1;
}
else
{
max = Math.Max(max, cur);
cur = 1;
}
}
max = Math.Max(max, cur);
return max;
}

Cluster computing

Friday, July 28, 2017

No comments:

Post a Comment