Cluster computing

Thursday, September 14, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows and may require parallelism to be explained. With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural. Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and
U-SQL like T-SQL provides important benefits with query language. First and foremost, there is consistency and familiarity with its usage. The learning curve and the onboarding from T-SQL to U-SQL is not very steep. Moreover, there is a lot of thought behind the syntax. It is context independent and defined in data processing language. It is also composable. It is important for the query language to be precise and accurate while at the same time be convenient for the user. There is writeability versus readability separation of concerns. All of these are important considerations in U-SQL. In addition to the syntax, the semantics also matter. U-SQL tries to avoid surprises and complexities. Moreoever as a language it is composable for the user and optimizable for the system.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
Given n friends, each one can remain single or can be paired up with some other friend. Each friend can be paired only once so ordering is irrelevant. Find the total number of ways in which the friends can be paired.
Let us define a recursive method to do this which takes a parameter n.
When n is less than two, the number of friends returned is n
If the number of friends is greater than two, then they can be formed excluding one which is the equivalent of calling the recursive method n-1 times. In addition, they can also be formed by pairing all the others excluding this one which is n-1 with the number of friends formed by removing two members
F(n) = F(n-1) + (n-1) F(n-2)

Wednesday, September 13, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows and may require parallelism to be explained. With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural. Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and synchronization.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

If we have a matrix where each cell has a cost to travel and we can only move right or down, what is the minimum cost to travel from top left of matrix to bottom right corner ?

We keep track of the row and column index. The cell values are unit displacement costs.
Therefore starting from the ending position, we can work our way backwards to start using the path that gives minimum cost
new cost = cell value + minimum of recursive cost at position to the left or recursive cost at position to the right
if we go out of the board, the recursive cost is infinite
if we reach the start position, the recursive cost is the value of the cell at the start position.

The same holds for path taken in any direction.

Tuesday, September 12, 2017

Monday, September 11, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.
U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language.U-SQL is extensible and it can be extended with C# and .Net
If we look at the pattern of separating query from data source, we quickly see it's no longer just a consolidate of data sources. It is also pushing down the query to the data sources and thus can act as a translator. Projections, filters and joins can now take place where the data resides. This was a design decision that came from the need to support heterogeneous data sources. Moreover, it gives a consistent unified view of the data to the user.

Courtesy : U-SQL slideshare
#codingexercise
given two strings find if one string is a subsequence of another
solution:
check the lengths of the two string.
If the subsequence candidate string is of length zero return true
if the input to be matched with is of length zero, return false
If the last character of both strings match, check recursively with decremented length in both
else check recursively with decremented length in input.

Sunday, September 10, 2017

Today we start reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.
U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language.U-SQL is extensible and it can be extended with C# and .Net
Courtesy : U-SQL slideshare
#codingexercise
Count binary strings with k times appearing adjacent set bits:
Given a string of bits with length n and to find the the number of times k adjacent set bits appear, we solve it recursively:
1) if n == 1 then there is no count
2) if k == 0 or k >n return no count
3) if string ends with 0 add the recursive count for string ending at n-1 and for k adjacent bits
4) else
if the substring upto n-1 endswith 0
add the recursive count for string ending at n-1 and for k adjacent bits
if the substring ends with 1
add the recursive count for string ending at n-1 and k-1 adjacent bits plus one
5) return the count

Saturday, September 9, 2017

We were reviewing Bing Maps API.

Bing Maps helps you visualize spatial data. Users love to know where a store is or how to get there. And the data does not always need to come from Bing. It can be overlaid over the map. As long as the data has geographical coordinates, it can be used with the maps. Previously this data was sent over as xml, now there is support for more succint format as GeoJson

Recently Bing Maps announced support for drag and drop of GeoJson format data over maps. When an application is written to use Bing Maps, it can load the maps and the geojson module. It will listen for drag and drop events. with file reader APIs, each geojson file dropped on the map is read and overlaid.

The steps involved for the application include :

Using HTML5 to load the local files

Select the files from a form input

Each file contains a collection of features where each feature is a location.

The location list is then overlaid on the map.

Load the GeoJson module:

Microsoft.Maps.loadModule('Microsoft.Maps.GeoJson', function () {
//Setup the drag & drop listeners on the map.
var dropZone = document.getElementById('myMap');
dropZone.addEventListener('dragover', handleDragOver, false);
dropZone.addEventListener('drop', handleFileSelect, false);

});
var shapes = Microsoft.Maps.GeoJson.read(geoJsonText);
myMap.entities.push(shapes);
#codingexercise

Prune a binary tree with root to leaf paths whose sum is less than k

we use a recursive post order traversal and eliminate the leaves whose sum is greater than k. Nodes higher up become leaves.

Traverse the left of the root and get left

Traverse the right of the root and get right

if the root is a leaf, and the max of left and right together with the current sum is lesser than k, delete the root and return null

return root.

Friday, September 8, 2017

Today we continue reviewing Bing Maps API

These APIs include map control and services that can be used to make maps part of your application. These APIs are an authoritative source for geospatial features. They offer static and interactive maps, geocoding, route and traffic data.Any spatial data such as store locations can be queried and stored with these APIs. Typically an account and a key is needed to use the APIs. The key is used as a license to utilize their services. They are classified as used for public website, private website and enterprise assets.

The Bing Maps dev center provides account management functionality. An account is needed to cut a key and to manage data sources.

Recently Bing Maps announced support for drag and drop of GeoJson format data over maps.

#codingexercise

Prune a binary tree with root to leaf paths whose sum is greater than k

we use a recursive post order traversal and eliminate the leaves whose sum is greater than k. Nodes higher up become leaves.

Traverse the left of the root and get left

Traverse the right of the root and get right

if the root is a leaf, and the sum is greater than k, delete the root and return null

reduce the sum and return root.