Cluster computing

Saturday, September 16, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
One of the improvements in this language design is the consideration for single-node versus parallel versus distributed computing. Queries often have to manage parallelism, synchronizations and transactions. But the language not only has to allow implicit considerations by the system but also enable explicit constructs for the users. Moreover, execution is no longer just scale-up but also scale-out and therefore libraries as well as language needs to handle parallelism.
The data processing language is independent of the scale of data but the data is a part of the language model. Programming languages treat data as something in a store and tie the data and the logic together. This data processing language allows data to chnage and evolve independent of the application.
U-SQL provides all this for the user with custom operator extensions called UDO's which are scaled out. It includes User-defined extractors, outputters, processors, appliers, combiners and reducers. The scale-out can also be explicitly requested with hint keywords.
UDO's can be written in any .Net language and they can be deployed in the service as an assembly after registering them with U-SQL script. Therefore UDOs like SQLCLR can invoke managed code, other runtimes like Python, R and all with the option to scale out. UDOs cannot interact with one another and are isolated in the scope that they are registered with. The U-SQL script allows these UDOs residing in assemblies to be invoked with the different data processing options such as extract, reduce etc.
One simple example to use UDOs for text summarization that we talked about earlier with trimpy python extension can be shown to be similar to the following simpler but only for illustration query as follows:
@text = EXTRACT text string
FROM @"filename"
USING new Trimpy.Extractor();
@summary = SELECT Trimpy.Summarize(text)
FROM @text
OUTPUT @summary
TO "/summary.txt"
USING Outputters.text();
This is simple but tasks like text classification or prediction or data mining can also be called via U-SQL.

Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

Count all palindromic subsequences of a string
we can use a recursive solution to count this as we shrink the string.
if the boundary characters match, we can count the following two subsequences
first from start to end - 1
second from start +1 to end - 1
plus 1 for the match with the current boundary
otherwise we count the same two subsequences again and reduce the count from subsequence starting at start + 1 and ending at end -1 because it would have been included twice in each subsequence.

This same logic holds true for substrings if the subsequences can be confirmed to exist in the string.

Cluster computing

Saturday, September 16, 2017

No comments:

Post a Comment