Cluster computing

Sunday, September 17, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
The presentation for U-SQL explains three scenarios which include - a Cognitive example , a text analysis example and a JSON processing.
The cognitive example identifies objects in images. This kind of example show how the entire image processing on image files can be considered custom logic and used with the query language. As long as we define the objects, the input and the logic to analyze the objects, it can be made part of the query to extract the desired output dataset.
The text analysis example is also similar where we can extract the text prior to performing the analysis. Its interesting to note that the classifier used to tag the text can be written in R language and is not dependent on the query. The outputters also result in different output.
JSON processing is another example cited by the presentation probably because it has become important to extract transform load in analytical processing whether it is a cloud data warehouse or big data operations. This "schema later" approach is popular because it decouples producers and consumers which saves co-ordination and time-consuming operations between say departments. While some applications with query languages such as SnowSQL import the Json into a columnar table and then execute a query or declare their own syntax to flatten the Json, the approach taken by U-SQL is more general purpose with its Extract, select and outputters that are either built-in or the customizations that the user can make.
Courtesy U-SQL slide shares
My updates on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8
#codingexercise
Recursive function for a palindrome:
If string Is empty or one character return true
If string.first() == string.last() return recursively by stripping first and last
If string.first() != string.last() return false

Also an update on classifiers : https://1drv.ms/w/s!Ashlm-Nw-wnWsFzHx5Hrcl633js_

Saturday, September 16, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
One of the improvements in this language design is the consideration for single-node versus parallel versus distributed computing. Queries often have to manage parallelism, synchronizations and transactions. But the language not only has to allow implicit considerations by the system but also enable explicit constructs for the users. Moreover, execution is no longer just scale-up but also scale-out and therefore libraries as well as language needs to handle parallelism.
The data processing language is independent of the scale of data but the data is a part of the language model. Programming languages treat data as something in a store and tie the data and the logic together. This data processing language allows data to chnage and evolve independent of the application.
U-SQL provides all this for the user with custom operator extensions called UDO's which are scaled out. It includes User-defined extractors, outputters, processors, appliers, combiners and reducers. The scale-out can also be explicitly requested with hint keywords.
UDO's can be written in any .Net language and they can be deployed in the service as an assembly after registering them with U-SQL script. Therefore UDOs like SQLCLR can invoke managed code, other runtimes like Python, R and all with the option to scale out. UDOs cannot interact with one another and are isolated in the scope that they are registered with. The U-SQL script allows these UDOs residing in assemblies to be invoked with the different data processing options such as extract, reduce etc.
One simple example to use UDOs for text summarization that we talked about earlier with trimpy python extension can be shown to be similar to the following simpler but only for illustration query as follows:
@text = EXTRACT text string
FROM @"filename"
USING new Trimpy.Extractor();
@summary = SELECT Trimpy.Summarize(text)
FROM @text
OUTPUT @summary
TO "/summary.txt"
USING Outputters.text();
This is simple but tasks like text classification or prediction or data mining can also be called via U-SQL.

Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

Count all palindromic subsequences of a string
we can use a recursive solution to count this as we shrink the string.
if the boundary characters match, we can count the following two subsequences
first from start to end - 1
second from start +1 to end - 1
plus 1 for the match with the current boundary
otherwise we count the same two subsequences again and reduce the count from subsequence starting at start + 1 and ending at end -1 because it would have been included twice in each subsequence.

This same logic holds true for substrings if the subsequences can be confirmed to exist in the string.

Friday, September 15, 2017

We maintain a hashtable for the N letters of the search string with a linked list of positions of occurrences for each letter. As we scan the containing string, we insert the index into the linked list against that letter in the hash table.

Next for each position in the min value of the linked lists as a candidate and an initialized max value, and for every other letter in the hash table, we find the positions of the letters in the N that is closest and after the candidate's position and update the max we have found so far. With this max value minus the candidate's position as offset, we can form a tuple of start, offset for every candidate. The tuple that gives the smallest offset is the answer to return.

Thursday, September 14, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows and may require parallelism to be explained. With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural. Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and
U-SQL like T-SQL provides important benefits with query language. First and foremost, there is consistency and familiarity with its usage. The learning curve and the onboarding from T-SQL to U-SQL is not very steep. Moreover, there is a lot of thought behind the syntax. It is context independent and defined in data processing language. It is also composable. It is important for the query language to be precise and accurate while at the same time be convenient for the user. There is writeability versus readability separation of concerns. All of these are important considerations in U-SQL. In addition to the syntax, the semantics also matter. U-SQL tries to avoid surprises and complexities. Moreoever as a language it is composable for the user and optimizable for the system.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
Given n friends, each one can remain single or can be paired up with some other friend. Each friend can be paired only once so ordering is irrelevant. Find the total number of ways in which the friends can be paired.
Let us define a recursive method to do this which takes a parameter n.
When n is less than two, the number of friends returned is n
If the number of friends is greater than two, then they can be formed excluding one which is the equivalent of calling the recursive method n-1 times. In addition, they can also be formed by pairing all the others excluding this one which is n-1 with the number of friends formed by removing two members
F(n) = F(n-1) + (n-1) F(n-2)

Wednesday, September 13, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows and may require parallelism to be explained. With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural. Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and synchronization.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

If we have a matrix where each cell has a cost to travel and we can only move right or down, what is the minimum cost to travel from top left of matrix to bottom right corner ?

We keep track of the row and column index. The cell values are unit displacement costs.
Therefore starting from the ending position, we can work our way backwards to start using the path that gives minimum cost
new cost = cell value + minimum of recursive cost at position to the left or recursive cost at position to the right
if we go out of the board, the recursive cost is infinite
if we reach the start position, the recursive cost is the value of the cell at the start position.

The same holds for path taken in any direction.

Tuesday, September 12, 2017

Monday, September 11, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.
U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language.U-SQL is extensible and it can be extended with C# and .Net
If we look at the pattern of separating query from data source, we quickly see it's no longer just a consolidate of data sources. It is also pushing down the query to the data sources and thus can act as a translator. Projections, filters and joins can now take place where the data resides. This was a design decision that came from the need to support heterogeneous data sources. Moreover, it gives a consistent unified view of the data to the user.

Courtesy : U-SQL slideshare
#codingexercise
given two strings find if one string is a subsequence of another
solution:
check the lengths of the two string.
If the subsequence candidate string is of length zero return true
if the input to be matched with is of length zero, return false
If the last character of both strings match, check recursively with decremented length in both
else check recursively with decremented length in input.