Cluster computing

Friday, August 4, 2017

In today's post we complete the conclusion of Snowflake data warehouse architecture. The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. By using a hierarchy , we reduce the scope of the keys and the data to be secured.
Snowflake introduced three data types variant, array and objects. These enabled it to be a document store as well. with the gelp of these data types, it introduced msssive efficiencies in data ingestion. This "schema later" approach also allowed it to be parse and transform later.
Snowflake performs concurrency control using snapshot isolation implemented over multi-version concurrency control which means a copy of every changed database object is preserved for some duration. The table files are stored in S3 and are therefore immutable. Therefore write operations on the table produce newer versions of the table and file operations are tracked in the metadata. This makes MVCC a natural choice for concurrency control.
To limit the data access that is relevant to a given query, Snowflake performs min-max based pruning, which is also known as small materialized aggregates, zone maps and data skipping. The system maintains data distribution for a given chunk of data as min and max values of the chunk so that the system can discard certain chunks that are not needed for a query. This is analogous to key ranges in B+ trees.
With these features, Snowflake supports pure software as a service experience and continuous availability. It differs from other major vendors such as Google Cloud platform which offers BigQuery service but where such service requires append-only data as well as schema, Snowflake offers full ACID transactions and not require schemas.
Microsoft SQL data warehouse also separates compute and storage but it requires administrative tasks and limits the number of queries executed concurrently. Moreover it supports non-relational data with PolyBase unlike Snowflakes' built in data types.
#codingexercise
Given a sorted array of words, find the order of characters
List<Char> GetAlphabeticalOrder(List<string> words, int alphabetSize)
{
Graph g(alphabetSize);
for (int i = 0; i < words.Count-1; i++)
{
var word1 = words[i];
var word2 = words[i+1];
Tuple<char, char> t = GetFirstMismatch(word1, word2);
if (t != null)
g.addEdge(t.first, t.second);
}
var ret = new List<Char>();
g.topologicalSort(ref ret);
return ret;
}

Thursday, August 3, 2017

In previous blogposts we have been discussing Snowflake computing. Today we conclude our discussions.The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. By using a hierarchy , we reduce the scope of the keys and the data to be secured.
Snowflake introduced three data types variant, array and objects. These enabled it to be a document store as well. with the gelp of these data types, it introduced msssive efficiencies in data ingestion. This "schema later" approach also allowed it to be parse and transform later.
#codingexercise
Find the length of the longest subsequence of consecutive integers in a given array
int GetLongest(List<int>A)
{
if (A == null || A.Count == 0) return 0;
if (A.Count == 1) return 1;
A.sort();
int max = 1;
int cur = 1;
for (int i = 1; i < A.Count; i++)
{
if (A[i-1] + 1 == A[i])
{
cur = cur + 1;
}
else
{
max = Math.Max(max, cur);
cur = 1;
}
}
max = Math.Max(max, cur);
return max;
}

Wednesday, August 2, 2017

Today we continue the discussion on Snowflake computing. The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. By using a hierarchy , we reduce the scope of the keys and the data to be secured.
Snowflake stores both semi-structured and schemaless data.There are three different data types added - variant, array and object. Variant type includes all native data types as well as arrays and objects. These therefore store documents. Arrays and objects are specializations of variants.
The variant data type is a self describing binary serialization which supports fast key value lookup as well as efficient type tests, comparison and hashing. The variant type also helps Snowflake to perform Extract Load Transform instead of Extract Transform and Load operations. We saw that this significantly reduced the data ingestion operations time for a customer of Snowflake as compared to the existing processes using MongoDB. The ability to load JSON directly while allowing parsing or type inference is called the "schema later" approach. This approach decouples the producers and consumers. In a traditional warehouse, changes to the schema required co-ordination between departments and time-consuming operations.
#codingexercise
Find the sum of first n magic numbers
Magic numbers can be expressed as a power of 5 or sum of unique powers of 5. They occur in series represented by binary distributions such as : 001, 010, 011 etc. for 5, 25, 30 ...
Therefore,

int GetMagicN(int n)
{
int power = 1;
int result = 0;
while (n)
{
power = power * 5;
if ( n & 1)
result += power;
n >> = 1;
}
return result;
}
long GetMagicSum (int k)
{
long result = 0;
for (int I = 1; I <= k; I++)
result += GetMagicN(k);
return result;
}

Tuesday, August 1, 2017

Monday, July 31, 2017

Today we continue the discussion on Snowflake architecture.The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. By using a hierarchy , we reduce the scope of the keys and the data to be secured. Encryption keys go through four stages in their life cycle. First, they are created, then they are used to encrypt or decrypt, then they are marjed as no longer in use and finally decommissioned.Keys are rotated at periodic intervals.Retired keys can still be used to decrypt data but only the new ones are used to encrypt. Before a retired key is destroyed, data is reencrypted with the latest key. This is called rekeying.
Generally key rotation and compute resources require data redistribution. However, Snowflake allows users to scale up or down and even pause resources without any data movement.
Snowflake draws inspiration from BigQuery, Google's approach to fast infinite sql processing. However BigQuery does not adhere strictly to SQL, its tables are append only and require schemas. mSnowflake provides ACID guarantees and full DML and does not require schemas for semi structured data.
#codingexercise
Find the length of the longest subsequence of consecutive integers in a given array
int GetLongest(List<int>A)
{
if (A == null || A.Count == 0) return 0;
if (A.Count == 1) return 1;
A.sort();
int max = 1;
int cur = 1;
for (int i = 1; i < A.Count; i++)
{
if (A[i-1] + 1 == A[i])
{
cur = cur + 1;
}
else
{
max = Math.Max(max, cur);
cur = 1;
}
}
max = Math.Max(max, cur);
return max;
}

Sunday, July 30, 2017

Today we continue the discussion on Snowflake architecture.The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
The cloud Services layer is always on and comprises of services that manage virtual warehouses, queries and transactions and all the metadata. The Virtual warehouses consist of elastic clusters of virtual machines. These are instantiated on demand to scale the query processing. The data storage spans availability zones and therefore is setup with replication to handle the failures from these zones.
We now review the security features of Snowflake which is designed to protect user data with two factor authentication, encrypted data import and export, secure data transfer and storage and role based access control for database objects. Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. This key hierarchy has for four levels - root keys, account keys, table keys and file keys. Each layer encrypts the lower layer. Each account key corresponds to one user account, each table key corresponds to one database and each file key corresponds to one table file. By using a hierarchy , we reduce the scope of the keys and the data to be secured.

Saturday, July 29, 2017

Covert Redirect Vulnerability
-------------------------------------

Almost every business on the internet requires you to login. This is the way they secure and isolate your data from that of anybody else. It used to be that every site required its own username and password.
But this proliferated the number that you needed to remember. On the other hand, web protocols found it easy to delegate the login to a referral website as long as that website could authoritatively perform the user authentication.
This delegation now spanned to different companies and as with everything that is shared between businesses, soon an accepted version also called a standard was adopted. It was called OpenID and OAuth. The former performs authentication which is a way to say who you are and the latter performs authorization, which is a way to say what access policy is associated with you. If a store wants to know you intend to use its services, it would use the OpenID protocol to know the user between different services. If the store wanted to have access to your photos for publishing or printing, it would require OAuth.
When this process of redirecting user to a third party site to login can be compromised, it is referred to as a security vulnerability. One such issue was a serious CovertRedirect security related to OAuth 2.0 and OpenID These attacks might jeopardize the user to divulge information to a potential hacker. A covert redirect happens when a site relies on its partners and it does not validate the redirect URLs. It can generally be avoided with a whitelist of redirect URLs but many companies decline to do so because everyone must opt in or the white list doesn't mean anything. This is harder to enforce. The exploit does not need the user to complete the login because the identity itself is information.
The currency in this delegated login is usually a token. A token is an amalgamation of representations for user, client and a stamp of authority. The client is usually the provider that acts as the one requiring to grant access for certain resources. When this three-some information is brought together, in a token we can guarantee that it is valid. It is very similar to carrying an entitlement paper on a motorbike. That paper has the driver information, the vehicle information and a stamp of authority. Together, this gives assurance to law enforcement that the bike is not stolen.
A webservice requiring a token needs to know which partner to redirect the customer to. However, if it does not validate the redirect uri, then it is hard to enforce whether the redirection is to a partner. That is why blessing the list of partners and making sure the referrals are to the partners only is sufficient practice. Often this can be mitigated by requiring the client to specify the redirect uri at the time of the registration with the identity provider. In the absence of enough trust in the redirect uri, this vulnerability may result.