Cluster computing: July 2017

Monday, July 31, 2017

Today we continue the discussion on Snowflake architecture.The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. By using a hierarchy , we reduce the scope of the keys and the data to be secured. Encryption keys go through four stages in their life cycle. First, they are created, then they are used to encrypt or decrypt, then they are marjed as no longer in use and finally decommissioned.Keys are rotated at periodic intervals.Retired keys can still be used to decrypt data but only the new ones are used to encrypt. Before a retired key is destroyed, data is reencrypted with the latest key. This is called rekeying.
Generally key rotation and compute resources require data redistribution. However, Snowflake allows users to scale up or down and even pause resources without any data movement.
Snowflake draws inspiration from BigQuery, Google's approach to fast infinite sql processing. However BigQuery does not adhere strictly to SQL, its tables are append only and require schemas. mSnowflake provides ACID guarantees and full DML and does not require schemas for semi structured data.
#codingexercise
Find the length of the longest subsequence of consecutive integers in a given array
int GetLongest(List<int>A)
{
if (A == null || A.Count == 0) return 0;
if (A.Count == 1) return 1;
A.sort();
int max = 1;
int cur = 1;
for (int i = 1; i < A.Count; i++)
{
if (A[i-1] + 1 == A[i])
{
cur = cur + 1;
}
else
{
max = Math.Max(max, cur);
cur = 1;
}
}
max = Math.Max(max, cur);
return max;
}

Sunday, July 30, 2017

Today we continue the discussion on Snowflake architecture.The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
The cloud Services layer is always on and comprises of services that manage virtual warehouses, queries and transactions and all the metadata. The Virtual warehouses consist of elastic clusters of virtual machines. These are instantiated on demand to scale the query processing. The data storage spans availability zones and therefore is setup with replication to handle the failures from these zones.
We now review the security features of Snowflake which is designed to protect user data with two factor authentication, encrypted data import and export, secure data transfer and storage and role based access control for database objects. Data is encrypted in transit and before being written to storage. Key management is supported with key hierarchy so that the keys can be rotated and re-encrypted. Encryption and key management together complete the security. This key hierarchy has for four levels - root keys, account keys, table keys and file keys. Each layer encrypts the lower layer. Each account key corresponds to one user account, each table key corresponds to one database and each file key corresponds to one table file. By using a hierarchy , we reduce the scope of the keys and the data to be secured.

Saturday, July 29, 2017

Covert Redirect Vulnerability
-------------------------------------

Almost every business on the internet requires you to login. This is the way they secure and isolate your data from that of anybody else. It used to be that every site required its own username and password.
But this proliferated the number that you needed to remember. On the other hand, web protocols found it easy to delegate the login to a referral website as long as that website could authoritatively perform the user authentication.
This delegation now spanned to different companies and as with everything that is shared between businesses, soon an accepted version also called a standard was adopted. It was called OpenID and OAuth. The former performs authentication which is a way to say who you are and the latter performs authorization, which is a way to say what access policy is associated with you. If a store wants to know you intend to use its services, it would use the OpenID protocol to know the user between different services. If the store wanted to have access to your photos for publishing or printing, it would require OAuth.
When this process of redirecting user to a third party site to login can be compromised, it is referred to as a security vulnerability. One such issue was a serious CovertRedirect security related to OAuth 2.0 and OpenID These attacks might jeopardize the user to divulge information to a potential hacker. A covert redirect happens when a site relies on its partners and it does not validate the redirect URLs. It can generally be avoided with a whitelist of redirect URLs but many companies decline to do so because everyone must opt in or the white list doesn't mean anything. This is harder to enforce. The exploit does not need the user to complete the login because the identity itself is information.
The currency in this delegated login is usually a token. A token is an amalgamation of representations for user, client and a stamp of authority. The client is usually the provider that acts as the one requiring to grant access for certain resources. When this three-some information is brought together, in a token we can guarantee that it is valid. It is very similar to carrying an entitlement paper on a motorbike. That paper has the driver information, the vehicle information and a stamp of authority. Together, this gives assurance to law enforcement that the bike is not stolen.
A webservice requiring a token needs to know which partner to redirect the customer to. However, if it does not validate the redirect uri, then it is hard to enforce whether the redirection is to a partner. That is why blessing the list of partners and making sure the referrals are to the partners only is sufficient practice. Often this can be mitigated by requiring the client to specify the redirect uri at the time of the registration with the identity provider. In the absence of enough trust in the redirect uri, this vulnerability may result.

Friday, July 28, 2017

We were discussing Snowflake cloud services from their whitepaper. The engine for Snowflake is columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
Can Snowflake replace Splunk ? This is probably unlikely because a warehouse and a time series database serve different purposes. Moreover, Splunk is lightweight enough to run on desktop and appliances. That said Snowflake can perform time travel. Let us take a closer look at this. Snowflake implements Snapshot isolation on top of multi-version concurrency control. This means that a copy on write occurs and a new file is added or removed. When the files are removed by a new version, they are retained for a configurable duration. Time Travel in this case means walking through different versions of the data This is done with the SQL keywords AT or BEFORE syntax. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes execept that we get there differently.
#codingexercise
Find the length of the longest subsequence of consecutive integers in a given array
int GetLongest(List<int>A)
{
if (A == null || A.Count == 0) return 0;
if (A.Count == 1) return 1;
A.sort();
int max = 1;
int cur = 1;
for (int i = 1; i < A.Count; i++)
{
if (A[i-1] + 1 == A[i])
{
cur = cur + 1;
}
else
{
max = Math.Max(max, cur);
cur = 1;
}
}
max = Math.Max(max, cur);
return max;
}

Thursday, July 27, 2017

We were discussing Snowflake cloud services from their whitepaper. The engine for Snowflake is Columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops.
We now revisit the multi data center software as a service design of Snowflake. A web user interface is provided that supports not only SQL operations, but also gives access to database catalog, user and system management, monitoring and usage information. The web user interface is only one of the interfaces to the system but it is convenient not only to use Snowflake but perform administrator tasks as well. Behind the web user interface, Snowflake is designed as a Cloud Service that operates on several Virtual Warehouse compute instances all of which share a Data Storage layer where the data is replicated across multiple availability zones.
The cloud Services layer is always on and comprises of services that manage virtual warehouses, queries and transactions and all the metadata. The Virtual warehouses consist of elastic clusters of virtual machines. These are instantiated on demand to scale the query processing. The data storage spans availability zones and therefore is setup with replication to handle the failures from these zones. If a node fails, other nodes can pick up the activities without much impact on the end users. This differs from Virtual warehouses which do not span availability zones.
#codingexercise
static int GetCountIncreasingSequences(List<int> A, uint subarraysize)

{

int[] dp = new int[A.Count];

for (int i = 0; i < A.Count; i++)

{

dp[i] = 1;

for (int j = 0; j <= i - 1; j++)

{

if (A[j] < A[i])

{

dp[i] = dp[i] + dp[j];

}

}

}

return dp.ToList().GetRange(0, subarraysize).Sum();

}

Find and print longest consecutive number sequence in a given sequence

Int GetLongestContiguousSubsequence(List<uint> A)

{

Var h = new Hashtable();

For (int I = 0; I < A.Count; i++)

If (h.ContainsKey(A[i]) == false)

h.Add(A[i], 1);

int max = INT_MIN;

for (int I = 0; I < A.Count; i++)

{

int cur = 0;

for (int j = A[i]; j >= 0; j--)

if (h.ContainsKey(j))

cur++;

max = Math.Max(max, cur);

}

return max;

}

The nested for loops have overlapping sub problems, so we could at least memoize the results. Alternatively we can sort the array to find longest span of consecutive integers for the whole array.

Wednesday, July 26, 2017

We were discussing cloud services and compute or storage requirements. We mentioned services being granular. Today we continue with the discussion on Snowflake cloud services from their whitepaper. The engine for Snowflake is Columnar, vectorized and push-based. The columnar storage is suitable for analytical workloads because it makes more effective use of CPU caches and SIMD instructions. Vectorized execution means data is processed in a pipelined fashion. Batches of few thousand rows in columnar format are processed at a time. However it differs from Map-Reduce because it does not materialize intermediate results. The Push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. It removes control flow from tight loops. Query plans are not just trees, they can also be DAG-shaped. With push operators, this results in efficiency. Overhead of traditional query processing is not there in Snowflake. There is no need for transaction management during execution because queries are executed against a fixed set of immutable files. There is no buffer pool. This was used for table buffering but is no longer required. Instead the memory is used for operators. Queries can scan large amounts of data so there is more efficiency is using the memory for the operators. All major operators are allowed to spill to disk and recurse when memory is exhausted. Many analytical workloads require large joins or aggregations. Instead of requiring them to operate in pure memory, they can spill to disk. The Cloud Services layer is heavily multi-tenant. Each Snowflake service in this layer is shared across many users. This improves utilization of the nodes and reduces administrative overhead. Running a query over fewer nodes is more beneficial than running it over hundreds of nodes. Scale out is important but this efficiency per node is helpful.
#codingexercise
static int GetCountIncreasingSequences(List<int> A)

{

int[] dp = new int[A.Count];

for (int i = 0; i < A.Count; i++)

{

dp[i] = 1;

for (int j = 0; j <= i - 1; j++)

{

if (A[j] < A[i])

{

dp[i] = dp[i] + dp[j];

}

}

}

return dp.Sum();

}

Tuesday, July 25, 2017

Yesterday we were discussing cloud services and compute or storage requirements. We briefly mentioned services being granular. Services were hosted on compute and even when there were multiple service instances, each instance was whole. One of the ways to make this more granular, was to break down the processing with serverless computing. The notion here is that computations within a service can be packaged and executed elsewhere with little or no coupling to compute resources. This is a major change in the design of services. From service oriented architecture, we are going to microservices and from microservices we are going to serverless computing.

There are a few tradeoffs in the serverless computing that may be taken into perspective. First, we introduce latency in the system because the functions don't execute local to the application and require setup and teardown routines during invocations. Moreoever, debugging of serverless computing functions is harder to perform because the functions are responding to more than one applications and the callstack is not available or may have to be put together by looking at different compute resources. The same goes for monitoring as well because we now rely on separate systems. We can contrast this with applications that are hosted with load balancer services to improve availability. The services registered for load balancing is the same code on every partition. The callstack is coherent even if it is on different servers. Moreover, these share the same persistence even if the entire database server is also hosted on say Marathon with the storage on a shared volume. The ability of Marathon to bring up instances as appropriate along with the health checks improves the availability of the application. The choice of using platform as a service or a marathon cluster based deployment or serverless computing depends on the application.

That said, all the advantages that come with deploying code on containers in PaaS is the same for serverless computing only on smaller granularity.

The serverless architecture may be standalone or distributed. In both cases, it remains an event-action platform to execute code in response to events. We can execute code written as functions in many different languages and a function is executed in its own container. Because this execution is asynchronous to the frontend and backend, they need not perform continuous polling which helps them to be more scaleable and resilient. OpenWhisk introduces event programming model where the charges are only for what is used. Moreover, it scales on a per-request basis.

#codingexercise
Implement a virtual pottery wheel method that conserves mass but shapes it according to external factor
List<int> ShapeOnPotteryWheel(List<int> diameters, List<int> touches)
{
assert(diameters.count == touches.count); // height of clay
Assert(touches.All( x => x >= 0));
double volume = 0;
for (int i = 0; i < touches.count; i++)
{
    var old = diameters[i];
    diameters[i] -= 2 × touches[i];
    var new = diameters[i];
Assert(new > 0);
    volume += (PI/4) x (old^2 - new^2);
}
assert(volume >= 0);
// volume adds up the top
var last = diameters.Last();
int count = volume/((PI/4)×(last^2));
if (count > 0)
diameters.AddRange(Enumerable.Repeat(last, count));
return diameters;
}

Monday, July 24, 2017

Yesterday we were discussing the design of snowflake data warehouse. We continue with the discussion on some of its features.In particular, we review the storage versus compute considerations for a warehouse as a cloud service. The shared-nothing architecture is not specific to warehouse and is widely accepted as a standard for scalability and cost-effectiveness. With the use of commodity hardware and scale out based on additional nodes, this design lets every node have the same duties and runs on the same hardware. With the contention minimized and the processing homogeneous, it becomes easy to scale out to larger and larger workloads. Every query processor node has its own local disks. Tables are horizontally partitioned across nodes and each node is only responsible for the rows on its local disks. This works particularly well for star schema because the fact table is large and partitioned while the dimension table is small. The main drawback of this design is that compute and storage is now in the form of clusters and tightly coupled. First, the workload need not be homogeneous and hardware does not need to be forced to have low average utilization. Some workloads can be highly compute intensive. Second, if membership changes because nodes fail, all their associated data now gets reassigned. This transfer is usually done by the same nodes that are also performing the query processing which limits their elasticity and availability. Third, every node is a liability and requires upgrades or system resizes. When the upgrades are done with little or no downtime, they do not affect query processing. However, this design makes such online upgrades difficult. Replication is used to mitigate the reliance on a single copy of data and we may recollect how nodes are rebuilt but these add to the processing under the condition that nodes have to be homogeneous. This varies from on-premise to cloud based deployments where the compute may be heterogeneous with far more frequent failures. Snowflake works around this by having separate compute and storage layer where the storage is based on any cloud that provides blob storage but in this case, it relies on Amazon S3. By letting the data be remote instead of local to the compute nodes, the local disk space is not used for replicating the base data and the compute node can use it to cache some table data.
With the compute separated from storage, it is now possible to include more than one clusters for the same cloud service or dedicate a cluster for a single microservice.

#codingexercise
http://ideone.com/TqZ8jQ
Another technique can be to enumerate all the increasing subsequences and then apply filter.

Sunday, July 23, 2017

We were discussing the replacement of MongoDB with snowflake data warehouse. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, are best understood with a case study of DoubleDown in their migration of data from MongoDB to Snowflake. Snowflake's primary advantage is that it brings clouds elastic and isolation capabilities to a warehouse where compute is added in the form of what it calls virtual clusters and the storage is shared. Each cluster is like a virtual warehouse and serves a single user although they are never aware of the nodes in the cluster. Each query is executed in a single cluster. It utilizes micro partitions to securely and efficiently store customer data. It is appealing over traditional warehouses because it provides software as a service experience. Its query language is SQL so development pace can be rapid. It loads JSON natively so several lossy data transformations and ETL could be avoided. It is able to store highly granular staging data in S3 which makes it very effective to scale.
While Hadoop and Spark are useful to increase analytics in their own way, this technology brings warehouse to the cloud. Since some of the warehouse capabilities are not transactional in nature, it eliminates transaction management during execution. Moreover queries are heavy on aggregation so Snowflake allows these operators to make the best use of cloud memory and storage. Each warehouse features such as the access control, query optimizer, and others is implemented as a cloud service. The failure of individual service nodes does not cause data loss or loss of availability. Concurrency is handled in these cloud services with the help of snapshot isolation and MVCC.
Snowflake provides the warehouse in a software as a service manner.Its this model that is interesting to text summarization service. There is end to end security in the offering and Snowflake utilizes micro partitions to securely and efficiently store customer data. Another similarity is that Snowflake is not a full service model. Instead it is meant as a component in the workflows often replacing those associated with the use of say MongoDB in an enterprise. The interface for Snowflake is SQL which is a widely accepted interface. The Summarization service does not have this benefit but it is meant for participation in workflows with the help of REST APIs. If the SQL standard is enhanced at some point to include text analysis then the summarization service can be enhanced to include these as well.

Saturday, July 22, 2017

We were discussing the replacement of MongoDB with snowflake data warehouse. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, are best understood with a case study of DoubleDown in their migration of data from MongoDB to Snowflake. Towards the end of the discussion, we will reflect on whether Snowflake can be built on top of DynamoDB or CosmosDB.
DoubleDown is an internet casino games provider that was acquired by International Game Technology. Its games are available through different applications and platforms. While these games are free to play, it makes money from in game purchases and advertising partners. With existing and new games, it does analysis to gain insights that influence game design, marketing campaign evaluation and management. It improves the understanding of player behaviour, assess user experience, and uncover bugs and defects. This data comes from MySQL databases, internal production databases, cloud based game servers, internal Splunk servers, and several third party vendors. All this continuous data feeds amount to about 3.5 terabytes of data per day which come from separate data paths, ETL transformations and stored in large JSON files. MongoDB was used for processing this data which supported a collection of collectors and aggregators. The data was then pulled into a staging area where it was cleaned, transformed and conformed to a star schema before loading into a data warehouse. This warehouse then supported analysis and reporting tools including Tableau. Snowflake not only replaced the MongoDB but it also streamlined the data operation while expediting the process that took nearly a day earlier. Snowflake brought the following advantages:
Its query language is SQL so development pace was rapid
It loads JSON natively so several lossy data transformations were avoided
It was able to store highly granular stage and store data in S3 which made it very effective.
It was able to process JSON data using SQL which did away with a lot of map-reduce
Snowflake utilizes micro partitions to securely and efficiently store customer data
Snowflake is appealing over traditional warehouses because it provides software as a service experience. It is elastic where storage and compute resources can be scaled independently and seamlessly without impact on data availability or performance of concurrent queries. It is a multi-cluster technology that is highly available. Each cluster is like a virtual warehouse and serves a single user although they are never aware of the nodes in the cluster. The cluster sizes vary in T-shirt sizes. Each such warehouse is a pure compute resource while storage is shared and the compute instances work independent of the data volume. The execution engine is columnar which may be better suited for analytics as opposed to row wise storage. Data is not materialized in the form of intermediate results but processed as if in a pipeline. It is not clear why the virtual clusters are chosen to be homogeneous and not allowed to be heterogeneous where the compute may scale up rather than scale out. It is also not clear why the clusters are not supporting some that are outsourced as commodity clusters such as Mesos and Marathon stack. Arguably performance improvements to the tune of relational counterpart requires homogeneous architecture.

Friday, July 21, 2017

As we discussed Content Databases, document libraries and S3 storage together with their benefits for text summarization service, we now review NoSQL databases as a store for the summaries. We argued that the summaries can be represented as JSON documents and that the store can grow arbitrarily large to the tune of 40GB per user and that the text of the original document may also be stored. Consequently one of the approaches could be to use a MongoDB store which can be a wonderful in-premise solution but one that requires significant processing for the purposes of analysis and reporting especially when the datasets are large However we talked about the text summarization service as a truly cloud based offering. The challenges faced in scaling MongoDB and to use a solution that does not pose any restrictions around limits, we could do well to review the case study of DoubleDown in their migration of data from MongoDB to Snowflake. Towards the end of the discussion, we will reflect on whether Snowflake can be built on top of DynamoDB or CosmosDB.
DoubleDown is an internet casino games provider that was acquired by International Game Technology. Its games are available through different applications and platforms. While these games are free to play, it makes money from in game purchases and advertising partners. With existing and new games, it does analysis to gain insights that influence game design, marketing campaign evaluation and management. It improves the understanding of player behaviour, assess user experience, and uncover bugs and defects. This data comes from MySQL databases, internal production databases, cloud based game servers, internal Splunk servers, and several third party vendors. All this continuous data feeds amount to about 3.5 terabytes of data per day which come from separate data paths, ETL transformations and stored in large JSON files. MongoDB was used for processing this data which supported a collection of collectors and aggregators. The data was then pulled into a staging area where it was cleaned, transformed and conformed to a star schema before loading into a data warehouse. This warehouse then supported analysis and reporting tools including Tableau. Snowflake not only replaced the MongoDB but it also streamlined the data operation while expediting the process that took nearly a day earlier. Snowflake brought the following advantages:
Its query language is SQL so development pace was rapid
It loads JSON natively so several lossy data transformations were avoided
It was able to store highly granular stage and store data in S3 which made it very effective.
It was able to process JSON data using SQL which did away with a lot of map-reduce
Snowflake utilizes micro partitions to securely and efficiently store customer data
#codingexercise
int GetMaxValue(List<int> A)
{
var max = INT_MIN;
for (int i = 0; i < A.Count; i++)
if (A[i] > max)
max = A[i];
return max;
}

Thursday, July 20, 2017

We were discussing file synchronization services. It uses events and callbacks to indicate progress, preview the changes to be made, handle user specified conflict resolution and graceful error handling per file.The file synchronization provider is designed to handle concurrent operations by applications. Changes to the file will not be synchronized until the next synchronization session so that concurrent changes to the source or destination is not lost. Each file synchronization is atomic so that the user does not end up with a partially correct copy of the file. This service provide incremental synchronization between two file system locations based on change detection which is a process that evaluates changes from last synchronization. The service stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. Changes are detected by comparing the current file metadata with the version last saved in the snapshot. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
#codingexercise
Find the length of the longest subsequence of one string which is a substring of another string.
// Returns the maximum length of subseq in substr
static int MaxSubSeqInSubString(string X, string Y)
{
char[] subseq = X.ToCharArray();
int n = subseq.Length;
char[] substr = Y.ToCharArray();
int m = substr.Length;

Debug.Assert (m < 1000 && n < 1000);
var dp = new int[1000, 1000];
for (int i = 0; i <= m; i++) // for substr
for (int j = 0; j <= n; j++) // for subseq
dp[i, j] = 0;

for (int i = 1; i <= m; i++) // for substr
{
for (int j = 1; j <= n; j++) // for subseq
{
if (subseq[j-1] == substr[i-1])
dp[i,j] = dp[i-1,j-1] + 1;
else
dp[i,j] = dp[i,j-1];
}
}

int result = 0;
for (int i = 1; i <=m; i++)
result = Math.Max(result, dp[i,n]);
return result;
}
/*
A, ABC => 1
D, ABC => 0
D, D => 1
A, D => 0
, => 0
A, => 0
, A => 0
A, DAG => 1
AG, DAGH => 2
AH, DAGH => 1
DAGH, AH => 2
ABC, ABC => 3
BDIGH, HDGB => 2
*/
Node GetNthElementInInorder(Node root, int N)
{
var ret = new List<Node>();
InorderTraversal(root, ref ret);
if (N <= ret.Count)
return ret[N-1];
else
return null;
}

Wednesday, July 19, 2017

We were discussing document libraries and file synchronization service. This service provide incremental synchronization between two file system locations. Changes are detected from the last synchronization.
It stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
Since change detection evaluates all files, a large number of files in the replica may degrade performance. Users are notified with progress during the synchronization operation with the help of events raised from managed code execution or from callbacks in the unmanaged code. If the progress is displayed during preview mode, the changes are not committed. If the users modify different file system replicas and they get out of sync, a process of conflict resolution is performed. The conflict resolution is deterministic. It does not matter which replica initiates the conflict resolution. In all cases, it avoids data loss and applies the most recent update or preserves different files.
The callbacks and events don't just come helpful for progress or preview, they are also helpful for error handling and recovery. These enable graceful error handling per file during the synchronization of a set of files.Errors may come from locked files, changes after change detection, access denied, insufficient disk space, etc. If an error is encountered, the file is skipped so that the rest of the synchronization proceeds. The application gets the file details and error information which it may use to re-synchronize after fixing up the problem. If the entire synchronization operation fails, the application may get a proper error code. For example, a replica in use error code is given when there are concurrent synchronization operations on the same replica.
The file synchronization provider is designed to handle concurrent operations by applications. Changes to the file will not be synchronized until the next synchronization session so that concurrent changes to the source or destination is not lost. Each file synchronization is atomic so that the user does not end up with a partially correct copy of the file.
#codingexercise
void InOrderTraversal(Node root, ref List<Node> lasttwo)
{
if (root == null) return;
InOrderTraversal(root.left, ref lasttwo);
ShiftAndAdd(root, ref lasttwo);
InOrderTraversal(root.right, ref lasttwo);
}
void ShiftAndAdd(Node root, ref List<Node> lasttwo)
{
if (lasttwo.Count < 2) {lasttwo.Add(root); return;}
lasttwo[0] = lasttwo[1];
lasttwo[1] = root;
return;
}

Tuesday, July 18, 2017

We were discussing document libraries and file synchronization service. This service provide incremental synchronization between two file system locations based on change detection which is a process that evaluates changes from last synchronization.
It stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
Since change detection evaluates all files, a large number of files in the replica may degrade performance.
The file synchronization provider supports extensive progress reporting during the synchronization operation. This can be visualized through the User Interface as a progress bar. This information is reported to the application via events in the managed code or callbacks in the unmanaged code.
The preview mode displays what changes would happen during synchronization. The changes are not committed but the progress notifications are still sent out.
If the users modify different file system replicas, then they may get out of sync and a process of conflict resolution is performed. The conflict resolution is deterministic. It does not matter which replica initiates the conflict resolution. In all cases, it avoids data loss. The policy for conflict resolution is usually the same. If an existing file is modified independently on two replicas, the file with the last write wins the selection.If an existing file is modified and the same file or folder is deleted in another replica, the deleted will be resurrected. The override for the delete is also the option used when a file is created in a folder which is deleted in another replica. If there are name collisions on files or folders when say one of the folders is renamed, the user will end up with a single folder and the contents merged. The name collision conflicts may happen in other cases also.
If two files are created independently on two replicas with the same name, then there will certainly be a file with that name but with the contents from the side that had the most recent update.
If an existing file is renamed with the name colliding with that of a new file created,then both files are kept and one of them is renamed because the user intended to keep both.
#codingexercise
Given two substrings find if a first string is a subsequence of another
bool IsSubsequence(string one, int len1, string two, int len2)
{
assert (len1 <= one.Length && len2 <= two.Length);
if (len1 == 0) return true;
if (len2 == 0) return false;
if (one[len1-1] == two[len2-1])
return IsSubsequence(one, len1-1, two, len2-1);
return IsSubsequence(one, len1, two, len2-1);
}
/*
AH, DAGH => True
DAGH, AH => False
ABC, ABC => True

*/
int GetSecondLastInBST(Node root)
{
var inorder = new List<int>();
InorderTraversal(root, ref inorder);
if (inorder.Count-2 < 0) throw new Exception("Invalid");
return inorder[inorder.Count-2];
}
// The above can be modified to store only last two elements

Monday, July 17, 2017

We were discussing document libraries and file synchronization service. This service provide incremental synchronization between two file system locations based on change detection which is a process that evaluates changes from last synchronization.
It stores metadata about the synchronization which describes where and when the item was changed, giving a snapshot of every file and folder in the replica. Changes are detected by comparing the current file metadata with the version last saved in the snapshot. For files, the comparison is done on the file size, file times, file attributes, file name and optionally a hash of the file contents. For folders, the comparison is done on folder attributes and folder names.
When a file is renamed or moved, just that operation will be performed on the replica avoiding a copy operation. If a folder is renamed or moved, it is deleted and created on other replicas. Files within the folders are processed as renames.
Since change detection evaluates all files, a large number of files in the replica may degrade performance. Therefore, this expensive operation should be done as often as required by the applications using the sync framework.
The file synchronization provider supports extensive progress reporting during the synchronization operation. This can be visualized through the User Interface as a progress bar. This information is reported to the application via events in the managed code or callbacks in the unmanaged code. These event handlers and callbacks enable an application to even skip a change.
The preview mode displays what changes would happen during synchronization. The changes are not committed but the progress notifications are still sent out. It can be used by the application to present the verification UI to the user with all the changes that will be made if synchronization is executed.
The file synchronization provider does not provide an upfront estimate of the total number of files to be synchronized before the synchronization starts because this can be expensive to perform. However, some statistics can be collected using a two pass approach where a preview mode is run before the real synchronization session.
Files can be filtered out of synchronization based on filename-based exclusion, filename based inclusion, subdirectory exclude and file attribute based exclusion. Certain files may always be excluded if they are marked with both SYSTEM and HIDDEN attributes.

Sunday, July 16, 2017

We were discussing document libraries including OneDrive and OneDrive for business. OneDrive for Business is different from OneDrive for users. The former is an integral part of Office 365 or Sharepoint Server and provides place in the cloud where users can store, share and sync their work files. It is managed by the organization with the help of Sharepoint services and is virtually isolated from any or all personal storages of users such as OneDrive personal accounts. That said, the files are easy to be moved from one to the other if the users have setup access using corresponding accounts.
Sharepoint services is however different from OneDrive for Business. While both are offered through Office365 business plans, OneDrive for Business evolved from Sharepoint workspace and before that Groove whereas Sharepoint online is a Cloud-based version of the Sharepoint Service that dates back to Office XP. Both are powered by Sharepoint. While one is referred to as location, another is referred to as team site. All files in the former default as private while those in the latter inherit the permissions of the folder they are uploaded in. The interface is also different between the two where the former is exclusive to the user and the latter has a theme shared by the organization.

File Synchronization services allow files to be sync’ed between local desktop and the cloud. The Microsoft Sync framework is actually well known in this space. It is a comprehensive synchronization platform that can synchronize any type of data, using any protocol over any network. It uses a powerful metadata model that enables peer to peer synchronization of file data with support for arbitrary topologies. One of the main advantages for the developers is that they can use it to build file synchronization and roaming scenarios without having to worry about directly interacting with the file system.

Some of the features of this system include incremental synchronization of changes between two file system locations specified via local or UNC path, synchronization of file contents, file and folder names, file timestamps, and attributes. It provides support for optional filtering of files based on filenames/extensions, sub-directories or file attributes. It provides optional use of file-hashes to detect changes to file contents if file timestamps are not reliable. It provides reliable detection of conflicting changes to the same file and automatic resolution of conflicts with a no-data-loss policy. It allows for limited user undo operation by optionally allowing file deletes and overwrites to be moved to the recycle bin. It supports Preview mode which provides a preview of the incremental synchronization operation without committing changes to the file system. It lets user start synchronization with equal or partially equal file hierarchies on more than one replica. It supports graceful cancellation of an ongoing synchronization operation such that the remaining changes can be synchronized later without having to re-synchronize changes that were already synchronized.

Saturday, July 15, 2017

We were discussing document libraries. Sharepoint is an implementation of Content Databases. OneDrive is also a document library. In fact, this is one of the earliest file hosting service which is operated by Microsoft. Every user gets a quota which can be enhanced with subscriptions. The service was initially named SkyDrive and was made available in many countries. Later, Photos and videos were allowed to be stored on SkyDrive via Windows Live Photos, which allowed users to access their photos and videos stored on SkyDrive. It was therafter expanded to include Office Live Workspace. Files and folders became accessible to Windows Live Users and Groups which made sharing and file management easier. Subsequently SkyDrive began to be used with AppStore and Windows Phone Store via the applications released. APIs are also available for OneDrive.
OneDrive for Business is different from OneDrive for users. The former is an integral part of Office 365 or Sharepoint Server and provides place in the cloud where users can store, share and sync their work files. It is managed by the organization with the help of Sharepoint services and is virtually isolated from any or all personal storages of users such as OneDrive personal accounts. That said, the files are easy to be moved from one to the other if the users have setup access using corresponding accounts.
Sharepoint services is however different from OneDrive for Business. While both are offered through Office365 business plans, OneDrive for Business evolved from Sharepoint workspace and before that Groove whereas Sharepoint online is a Cloud-based version of the Sharepoint Service that dates back to Office XP. Both are powered by Sharepoint. While one is referred to as location, another is referred to as team site. All files in the former default as private while those in the latter inherit the permissions of the folder they are uploaded in. The interface is also different between the two where the former is exclusive to the user and the latter has a theme shared by the organization.

Friday, July 14, 2017

We discussed the Similarity measure between skills vector from a resume and a role to be matched. We could also consider using an ontology of skills for measuring similarity. For example, we can list all the skills a software engineer must have and connect the skills that have some degree of similarity using domain knowledge and human interpretation or from a weighted skills collocation matrix as resolved from a variety of resumes in a training set. With the help of this skills graph, we can now determine similarity as a measure of distance between vertices. This enables translation of skills into semantics based similarity.
The collocation based weights matrix we had come up with so far also can be represented as a graph which we can use for page rank to determine the most important features.
This concludes the text analysis as a service discussion and we now look into the store discussions for text content. In this regard, we briefly mentioned content libraries such as Sharepoint but we are going to discuss their cloud based versions. systems design for cloud based text analysis as a service can make use of such document libraries as an alternative to using S3. We discussed cloud native technologies. Let us now take a look at cloud versions of document libraries.
Sharepoint as an implementation of Content Databases. OneDrive is also a document library. In fact, this is one of the earliest file hosting service which is operated by Microsoft. Every user gets a quota which can be enhanced with subscriptions. The service was initially named SkyDrive and was made available in many countries. Later, Photos and videos were allowed to be stored on SkyDrive via Windows Live Photos, which allowed users to access their photos and videos stored on SkyDrive. It was therafter expanded to include Office Live Workspace. Files and folders became accessible to Windows Live Users and Groups which made sharing and file management easier. Subsequently SkyDrive began to be used with AppStore and Windows Phone Store via the applications released. APIs are also available for OneDrive.
#codingexercise
we discussed the methods of finding the length of the longest subsequence of one string as a substring of another.
Let us compare the performance:
1) Iterative approach brute force O(N^2) works well when substring is small and subsequence is large
2) dynamic programming based on increasing finds is also O(N^2) but it is more efficient because it is supposed to reuse overlapping subproblems. But the dp solution is based for 0 to current index substring.

Thursday, July 13, 2017

Domain Specific Text Classification:
The writeup here introduced domain specific text summarization which unlike the general approach to text summarization utilizes the desired outcome as a vector in itself. Specifically, we took the example of a resume matching to a given skillset required for a job. We said that the role can be described in the form of a vector of features based on skillsets and for a given candidate, we can determine the match score between the candidate’s skill sets and that of the role.
We could also extend this reasoning to cluster resumes of more than one candidate as potential match for the role. Since we compute the similarity score between vectors, we can treat all the resumes as vectors in a given skills matrix. Then we can use a range based separation to draw out resumes that have similarity score in k ranges of the scores between 0 and 1. This helps us determine the set of resumes that are the closest match.
We could also extend this technique between many resumes and many positions. For example, we can match candidates to roles based on k-means clustering. Here, we have the roles as centroids of the clusters we want to form. Each resume matches against all the centroids to determine the cluster closest to it. All resumes are then separated into clusters surrounding the available roles.
By representing the skills in the form of a graph based on similarity, we can even do page rank on those skills. Typically we need training data and test data for this. The training data helps build the skills weight matrix.
Conclusion: With the help of a skillset match and a similarity score, it is possible to perform matches between jobs and candidates as a narrow domain text classification. The similarity measure here is a cosine similarity between the vectors.
#exercise
yesterday we were discussing the length of the longest substring as a subsequence of another. We could do this with dynamic programming in a bottom up approach where the required length is one more than that computed for the previous match if the characters match or its the same as what was computed for the previous length of the substring.

Wednesday, July 12, 2017

Domain Specific Text Summarization:

The writeup here introduced text summarization which is a general approach to reduce the content so we can get the gist of a text with fewer sentences to read and it is available here. This would have been great if it could translate to narrow domains where the jargon and the format also matter and the text is not necessarily arranged in the form of a discourse. For example, software engineering recruiters find it hard to read through resumes because it does not necessarily appear as sentences to read. Further, even the savviest technical recruiter may find that the candidate meant one thing when the resume says another thing. This is especially true for missing technical or buzz words in the resume. On the other hand, recruiters want to sort the candidates into roles. Given a resume, the goal is to label the resume with one of the labels that the recruiter has come up with for the jobs available with her. If such was possible, it would then avoid the task of reading a resume for translations to see if it is a fit for a role. Such translations are not obvious even with a conversation with the candidate.

How do we propose to solve this? When we have pre-determined labels for the open and available positions, we can automate the steps taken by a recruiter to decide if the label is a fit or not for the candidate. The steps are quite comprehensive and rely on a semantic network to do correlation between the resume text and an available vocabulary to determine a score for the match between the candidate’s resume and the role requirements for the label. If the score exceeds a threshold, we determine the candidate can be assigned the label and given the green signal to take the screening test. Both the general text summarization and the narrow domain resume matching rely on treating the document as a bag of words. The key differences however are the features used with the documents. For example, we use the features that include the skill sets in terms of technologies, languages and domain specific terms. By translating the word in the resume to be vectors of features, we are able to better classify a requirement to a role where the role is also described in terms of the features required to do the job. This tremendously improves the reliability of a match and works behind the scenes.

Conclusion: With the help of the text summarization but a predetermined word-vector space for narrow domains, it is possible to avoid a lot of work for the recruiters while relying on latent knowledge about what is being mentioned in the resume.
#word vector match http://ideone.com/2NUwvu
#codingexercise
Find the maximum length of the subsequence of string X which is a substring of another string Y
int GetMax(String X, String Y)
{
int max = 0;
for (int i = 0; i < Y.Length; i++)
{
var single = Y.SubString(i,1);
if (single.IsSubsequence(X) && single.Length > max)
max = single.Length;
for (int j = i+1; j < Y.Length; j++)
{
var sub = Y.SubString(i, j-i+1);
if (sub.IsSubsequence(X) && sub.Length > max)
max = sub.Length;
}
}
return max;
}
http://ideone.com/RNERDx

Tuesday, July 11, 2017

We reviewed the writeup on Storage requirements for the summarization service indicating that it can be hosted on top of a document library such as sharepoint, OneDrive, Google drive or S3 as long as we can keep track of the summaries generated and the documents from which the summary was generated. Today we discuss the compute requirements for the summarization service. The tasks for summarization involve using a latent word vector space and a Markov random walk from word vector to word vector. This translated to about two times the two dimensional matrix of floating point values with about three hundred features. The selection of keywords and the ranked representation of sentences having those keywords take only linear order storage. Therefore, we can make a back of the envelope calculation for the allocation of these two dimensional matrices as roughly half MB space each. If we introduce a microbenchmark of processing ten thousand documents per second, we will require a total of about 10GB space for the service in addition to all operational requirements for memory. Therefore, we consider the service to operate on a scale out basis on any cloud compute capabilities or a Marathon stack. With the storage being shared and the isolation per document per user, we have no restrictions in keeping our service elastic to growing needs. At the same time, we also allow the web service the same level of production level standards as with any public domain web service. These include troubleshooting, logging, monitoring, health checks and notifications for performance degradations and such other services. Most platform as a service model for deployments automatically come with these benefits or can be extended easily for our needs. Similarly, the public cloud also has features for monitoring and billing that come very useful to plan and scale out the service as appropriate. Service may also be hosted in one region or the other. The web application hosting best practices recommend to create one or more availability zones each with the traditional model of hosting a web service for improving the reliability and availability of the service to the users. Furthermore, it recommends the use of available cloud based features such as routing and caching to improve performance and security for the web application. For example, a global network of edge locations can be used to deliver dynamic, static and streaming content. The requests for the content are routed to the nearest edge location in such case. Network traffic can also be filtered and security improved not just at the edge routers but also at the host level. Security groups can be used to manage access to the ports on the host. Similarly identity and access management services can be utilized to authenticate and authorize users. Data access layer that interacts with the database or document library can also be secured with internal service accounts and privileges. Data entering the data tier can be secured at rest and in transit. Routine backups and maintenance activities can be planned for safeguarding and protecting the data. Scaling and Failover can be automatically provisioned for these services.
#word vector match http://ideone.com/2NUwvu

Monday, July 10, 2017

Content Databases

In the writeup, we describe the storage requirements of the text summarization service. We said that this is equivalent to using a cloud based NoSQL document store because our summaries are not files but JSON documents, which we generate and maintain for all users of our service and we intend to use it for analysis. And we referred to the original documents from which the summaries were created to be made available via document libraries such as Sharepoint or OneDrive or Google Drive. When users upload a document to the summarization service for its processing, it could be stored the same way as we do with say Sharepoint that is backed by Microsoft SQL Server. Sharepoint uses HTTP routing mechanism and integrated windows authentication. Sharepoint services maintains multiple databases – system databases which include configuration, administration and content related data, search service database, user-profile databases and many other miscellaneous data stores. The Sharepoint system databases include configuration which contains data about all Sharepoint databases, web services, sites, applications, solutions, packages and templates, application and farm settings specific to Sharepoint Server, default quota and blocked file types. Content databases are separate from configuration. One specific content database is earmarked for central administration web site. The content databases otherwise store all the site content including documents, libraries, web part properties, audit logs, applications, user names, rights and project server data. Usually the size of content databases is kept under 200GB but size upto 1TB is also feasible. The Search service databases include search service application configuration and access control list for the crawl. The crawl database stores the state of the crawled data and the crawl history. The Link database stores the information that is extracted by the content processing component and the click through information. The crawl databases are typically scaled out for every twenty million items crawled. The Link databases stores the information that is extracted with the help of content processing and click through. It might be relevant to note that the crawl database is read heavy where as the link database is write heavy. The user profile service databases can scale up and out because they store and manage users and their social information. These databases also include social tagging information which is the notes created by the users along with their respective URLs. The size is determined by the number of ratings created and used. The synchronization database is also a user profile database and used when profile data is being synchronized with directory services such as Active Directory. This size is determined by the number of users and groups. Miscellaneous services include those that store app licenses and permissions, Sharepoint and access apps, external content-types and related objects, managed metadata and syndicated content-types, temporary objects and persisted user comments and settings, account names and passwords, pending and completed translations, data refresh schedules, state information from InfoPath forms, Web parts and charts, features and settings information for hosted customers, usage and health data collection and document conversions and updates. The tasks and their databases associated with content management indicate a planning required for the summarization service. It might therefore help if the content-management service can be used as a layer below the summarization service so the storage is unified. At the cloud scale, we plan for such stores in the cloud databases or use the Big Table and file storage based solution.
Courtesy: msdn
#codingexercise
Check if the nth bit from last is set in the binary representation of a given number
bool IsSet(int number, int pos)
{
var result = Convert.ToString(number, 2);
if (pos > result.Length)
return false;
else
return result[result.Length-pos] == '1';
}