Cluster computing

Thursday, December 30, 2021

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on the programmability aspects of Azure Data Lake.

The power of the Azure Data Lake is better demonstrated by the U-SQL queries that can be written without consideration that this is being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with the expressive power of your own code. SQLCLR improves programmability of U-SQL. Conventional SQL-Expressions like SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but C# Expressions improve User Defined types (UDTs), User defined functions (UDFs) and User defined aggregates (UDAs). These types, functions and aggregates can be used directly in a U-SQL script. For example, SELECT Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt, dt AS olddt FROM @rs0; where @dt is a datetime variable makes the best of both C# and SQL. The power of SQL expressions can never be understated for many business use-cases and they suffice by themselves but having the SQL programmability implies that we can even take all the processing all the way into C# and have the SQL script just be an invocation.

The trouble with analytics pipeline is that developers prefer open-source solutions to build them. When we start accruing digital assets in the form of U-SQL scripts, the transition to working with something Apache Spark might not be straightforward or easy. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The steps to transform U-SQL scripts to Apache Spark include the following:

- Transform the job orchestration pipeline to include the new Spark programs

- Find the differences between how U-SQL and Spark manage your data.

- Transform the U-SQL scripts to Spark. Choose from one of Azure Data Factory Data Flow, Azure HDInsight Hive, Azure HDInsight Spark or Azure DataBricks services.

With these steps, it is possible to have the best of both worlds while leveraging the benefits of each.

Cluster computing

Thursday, December 30, 2021

No comments:

Post a Comment