Cluster computing

Friday, August 4, 2017

In today's post we complete the conclusion of Snowflake data warehouse architecture. The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. By using a hierarchy , we reduce the scope of the keys and the data to be secured.
Snowflake introduced three data types variant, array and objects. These enabled it to be a document store as well. with the gelp of these data types, it introduced msssive efficiencies in data ingestion. This "schema later" approach also allowed it to be parse and transform later.
Snowflake performs concurrency control using snapshot isolation implemented over multi-version concurrency control which means a copy of every changed database object is preserved for some duration. The table files are stored in S3 and are therefore immutable. Therefore write operations on the table produce newer versions of the table and file operations are tracked in the metadata. This makes MVCC a natural choice for concurrency control.
To limit the data access that is relevant to a given query, Snowflake performs min-max based pruning, which is also known as small materialized aggregates, zone maps and data skipping. The system maintains data distribution for a given chunk of data as min and max values of the chunk so that the system can discard certain chunks that are not needed for a query. This is analogous to key ranges in B+ trees.
With these features, Snowflake supports pure software as a service experience and continuous availability. It differs from other major vendors such as Google Cloud platform which offers BigQuery service but where such service requires append-only data as well as schema, Snowflake offers full ACID transactions and not require schemas.
Microsoft SQL data warehouse also separates compute and storage but it requires administrative tasks and limits the number of queries executed concurrently. Moreover it supports non-relational data with PolyBase unlike Snowflakes' built in data types.
#codingexercise
Given a sorted array of words, find the order of characters
List<Char> GetAlphabeticalOrder(List<string> words, int alphabetSize)
{
Graph g(alphabetSize);
for (int i = 0; i < words.Count-1; i++)
{
var word1 = words[i];
var word2 = words[i+1];
Tuple<char, char> t = GetFirstMismatch(word1, word2);
if (t != null)
g.addEdge(t.first, t.second);
}
var ret = new List<Char>();
g.topologicalSort(ref ret);
return ret;
}

Cluster computing

Friday, August 4, 2017

No comments:

Post a Comment