Cluster computing

Sunday, July 30, 2017

Today we continue the discussion on Snowflake architecture.The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
The cloud Services layer is always on and comprises of services that manage virtual warehouses, queries and transactions and all the metadata. The Virtual warehouses consist of elastic clusters of virtual machines. These are instantiated on demand to scale the query processing. The data storage spans availability zones and therefore is setup with replication to handle the failures from these zones.
We now review the security features of Snowflake which is designed to protect user data with two factor authentication, encrypted data import and export, secure data transfer and storage and role based access control for database objects. Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. This key hierarchy has for four levels - root keys, account keys, table keys and file keys. Each layer encrypts the lower layer. Each account key corresponds to one user account, each table key corresponds to one database and each file key corresponds to one table file. By using a hierarchy , we reduce the scope of the keys and the data to be secured.

Saturday, July 29, 2017

Covert Redirect Vulnerability
-------------------------------------

Almost every business on the internet requires you to login. This is the way they secure and isolate your data from that of anybody else. It used to be that every site required its own username and password.
But this proliferated the number that you needed to remember. On the other hand, web protocols found it easy to delegate the login to a referral website as long as that website could authoritatively perform the user authentication.
This delegation now spanned to different companies and as with everything that is shared between businesses, soon an accepted version also called a standard was adopted. It was called OpenID and OAuth. The former performs authentication which is a way to say who you are and the latter performs authorization, which is a way to say what access policy is associated with you. If a store wants to know you intend to use its services, it would use the OpenID protocol to know the user between different services. If the store wanted to have access to your photos for publishing or printing, it would require OAuth.
When this process of redirecting user to a third party site to login can be compromised, it is referred to as a security vulnerability. One such issue was a serious CovertRedirect security related to OAuth 2.0 and OpenID These attacks might jeopardize the user to divulge information to a potential hacker. A covert redirect happens when a site relies on its partners and it does not validate the redirect URLs. It can generally be avoided with a whitelist of redirect URLs but many companies decline to do so because everyone must opt in or the white list doesn't mean anything. This is harder to enforce. The exploit does not need the user to complete the login because the identity itself is information.
The currency in this delegated login is usually a token. A token is an amalgamation of representations for user, client and a stamp of authority. The client is usually the provider that acts as the one requiring to grant access for certain resources. When this three-some information is brought together, in a token we can guarantee that it is valid. It is very similar to carrying an entitlement paper on a motorbike. That paper has the driver information, the vehicle information and a stamp of authority. Together, this gives assurance to law enforcement that the bike is not stolen.
A webservice requiring a token needs to know which partner to redirect the customer to. However, if it does not validate the redirect uri, then it is hard to enforce whether the redirection is to a partner. That is why blessing the list of partners and making sure the referrals are to the partners only is sufficient practice. Often this can be mitigated by requiring the client to specify the redirect uri at the time of the registration with the identity provider. In the absence of enough trust in the redirect uri, this vulnerability may result.

Friday, July 28, 2017

We were discussing Snowflake cloud services from their whitepaper. The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
Can Snowflake replace Splunk ? This is probably unlikely because a warehouse and a time series database serve different purposes. Moreover, Splunk is lightweight enough to run on desktop and appliances. That said Snowflake can perform time travel. Let us take a closer look at this. Snowflake implements Snapshot isolation on top of multi-version concurrency control. This means that a copy on write occurs and a new file is added or removed. When the files are removed by a new version, they are retained for a configurable duration. Time Travel in this case means walking through different versions of the data This is done with the SQL keywords AT or BEFORE syntax. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes execept that we get there differently.
#codingexercise
Find the length of the longest subsequence of consecutive integers in a given array
int GetLongest(List<int>A)
{
if (A == null || A.Count == 0) return 0;
if (A.Count == 1) return 1;
A.sort();
int max = 1;
int cur = 1;
for (int i = 1; i < A.Count; i++)
{
if (A[i-1] + 1 == A[i])
{
cur = cur + 1;
}
else
{
max = Math.Max(max, cur);
cur = 1;
}
}
max = Math.Max(max, cur);
return max;
}

Thursday, July 27, 2017

We were discussing Snowflake cloud services from their whitepaper. The engine for Snowflake is Columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
We now revisit the multi data center software as a service design of Snowflake. A web user interface is provided that supports not only SQL operations, but also gives access to database catalog, user and system management, monitoring and usage information. The web user interface is only one of the interfaces to the system but it is convenient not only to use Snowflake but perform administrator tasks as well. Behind the web user interface, Snowflake is designed as a Cloud Service that operates on several Virtual Warehouse compute instances all of which share a Data Storage layer where the data is replicated across multiple availability zones.
The cloud Services layer is always on and comprises of services that manage virtual warehouses, queries and transactions and all the metadata. The Virtual warehouses consist of elastic clusters of virtual machines. These are instantiated on demand to scale the query processing. The data storage spans availability zones and therefore is setup with replication to handle the failures from these zones. If a node fails, other nodes can pick up the activities without much impact on the end users. This differs from Virtual warehouses which do not span availability zones.
#codingexercise
static int GetCountIncreasingSequences(List<int> A, uint subarraysize)

{

int[] dp = new int[A.Count];

for (int i = 0; i < A.Count; i++)

{

dp[i] = 1;

for (int j = 0; j <= i - 1; j++)

{

if (A[j] < A[i])

{

dp[i] = dp[i] + dp[j];

}

}

}

return dp.ToList().GetRange(0, subarraysize).Sum();

}

Find and print longest consecutive number sequence in a given sequence

Int GetLongestContiguousSubsequence(List<uint> A)

{

Var h = new Hashtable();

For (int I = 0; I < A.Count; i++)

If (h.ContainsKey(A[i]) == false)

h.Add(A[i], 1);

int max = INT_MIN;

for (int I = 0; I < A.Count; i++)

{

int cur = 0;

for (int j = A[i]; j >= 0; j--)

if (h.ContainsKey(j))

cur++;

max = Math.Max(max, cur);

}

return max;

}

The nested for loops have overlapping sub problems, so we could at least memoize the results. Alternatively we can sort the array to find longest span of consecutive integers for the whole array.

Wednesday, July 26, 2017

We were discussing cloud services and compute or storage requirements. We mentioned services being granular. Today we continue with the discussion on Snowflake cloud services from their whitepaper. The engine for Snowflake is Columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion. Batches of few thousand rows in columnar format are processed at a time. However it differs from Map-Reduce because it does not materialize intermediate results. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops. Query plans are not just trees, they can also be DAG-shaped. With push operators, this results in efficiency. Overhead of traditional query processing is not there in Snowflake. There is no need for transaction management during execution because queries are executed against a fixed set of immutable files. There is no buffer pool. This was used for table buffering but is no longer required. Instead the memory is used for operators. Queries can scan large amounts of data so there is more efficiency is using the memory for the operators. All major operators are allowed to spill to disk and recurse when memory is exhausted. Many analytical workloads require large joins or aggregations. Instead of requiring them to operate in pure memory, they can spill to disk. The Cloud Services layer is heavily multi-tenant. Each Snowflake service in this layer is shared across many users. This improves utilization of the nodes and reduces administrative overhead. Running a query over fewer nodes is more beneficial than running it over hundreds of nodes. Scale out is important but this efficiency per node is helpful.
#codingexercise
static int GetCountIncreasingSequences(List<int> A)

{

int[] dp = new int[A.Count];

for (int i = 0; i < A.Count; i++)

{

dp[i] = 1;

for (int j = 0; j <= i - 1; j++)

{

if (A[j] < A[i])

{

dp[i] = dp[i] + dp[j];

}

}

}

return dp.Sum();

}

Tuesday, July 25, 2017

Yesterday we were discussing cloud services and compute or storage requirements. We briefly mentioned services being granular. Services were hosted on compute and even when there were multiple service instances, each instance was whole. One of the ways to make this more granular, was to break down the processing with serverless computing. The notion here is that computations within a service can be packaged and executed elsewhere with little or no coupling to compute resources. This is a major change in the design of services. From service oriented architecture, we are going to microservices and from microservices we are going to serverless computing.

There are a few tradeoffs in the serverless computing that may be taken into perspective. First, we introduce latency in the system because the functions don't execute local to the application and require setup and teardown routines during invocations. Moreoever, debugging of serverless computing functions is harder to perform because the functions are responding to more than one applications and the callstack is not available or may have to be put together by looking at different compute resources. The same goes for monitoring as well because we now rely on separate systems. We can contrast this with applications that are hosted with load balancer services to improve availability. The services registered for load balancing is the same code on every partition. The callstack is coherent even if it is on different servers. Moreover, these share the same persistence even if the entire database server is also hosted on say Marathon with the storage on a shared volume. The ability of Marathon to bring up instances as appropriate along with the health checks improves the availability of the application. The choice of using platform as a service or a marathon cluster based deployment or serverless computing depends on the application.

That said, all the advantages that come with deploying code on containers in PaaS is the same for serverless computing only on smaller granularity.

The serverless architecture may be standalone or distributed. In both cases, it remains an event-action platform to execute code in response to events. We can execute code written as functions in many different languages and a function is executed in its own container. Because this execution is asynchronous to the frontend and backend, they need not perform continuous polling which helps them to be more scaleable and resilient. OpenWhisk introduces event programming model where the charges are only for what is used. Moreover, it scales on a per-request basis.

#codingexercise
Implement a virtual pottery wheel method that conserves mass but shapes it according to external factor
List<int> ShapeOnPotteryWheel(List<int> diameters, List<int> touches)
{
assert(diameters.count == touches.count); // height of clay
Assert(touches.All( x => x >= 0));
double volume = 0;
for (int i = 0; i < touches.count; i++)
{
    var old = diameters[i];
    diameters[i] -= 2 × touches[i];
    var new = diameters[i];
Assert(new > 0);
    volume += (PI/4) x (old^2 - new^2);
}
assert(volume >= 0);
// volume adds up the top
var last = diameters.Last();
int count = volume/((PI/4)×(last^2));
if (count > 0)
diameters.AddRange(Enumerable.Repeat(last, count));
return diameters;
}

Monday, July 24, 2017

Yesterday we were discussing the design of snowflake data warehouse. We continue with the discussion on some of its features.In particular, we review the storage versus compute considerations for a warehouse as a cloud service. The shared-nothing architecture is not specific to warehouse and is widely accepted as a standard for scalability and cost-effectiveness. With the use of commodity hardware and scale out based on additional nodes, this design lets every node have the same duties and runs on the same hardware. With the contention minimized and the processing homogeneous, it becomes easy to scale out to larger and larger workloads. Every query processor node has its own local disks. Tables are horizontally partitioned across nodes and each node is only responsible for the rows on its local disks. This works particularly well for star schema because the fact table is large and partitioned while the dimension table is small. The main drawback of this design is that compute and storage is now in the form of clusters and tightly coupled. First, the workload need not be homogeneous and hardware does not need to be forced to have low average utilization. Some workloads can be highly compute intensive. Second, if membership changes because nodes fail, all their associated data now gets reassigned. This transfer is usually done by the same nodes that are also performing the query processing which limits their elasticity and availability. Third, every node is a liability and requires upgrades or system resizes. When the upgrades are done with little or no downtime, they do not affect query processing. However, this design makes such online upgrades difficult. Replication is used to mitigate the reliance on a single copy of data and we may recollect how nodes are rebuilt but these add to the processing under the condition that nodes have to be homogeneous. This varies from on-premise to cloud based deployments where the compute may be heterogeneous with far more frequent failures. Snowflake works around this by having separate compute and storage layer where the storage is based on any cloud that provides blob storage but in this case, it relies on Amazon S3. By letting the data be remote instead of local to the compute nodes, the local disk space is not used for replicating the base data and the compute node can use it to cache some table data.
With the compute separated from storage, it is now possible to include more than one clusters for the same cloud service or dedicate a cluster for a single microservice.

#codingexercise
http://ideone.com/TqZ8jQ
Another technique can be to enumerate all the increasing subsequences and then apply filter.