This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on the programmability aspects of Azure Data Lake.
The power of the Azure Data Lake is better demonstrated
by the U-SQL queries that can be written without consideration that this is
being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with
the expressive power of your own code. SQLCLR improves programmability of
U-SQL. Conventional SQL-Expressions like
SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but
C# Expressions improve User Defined types (UDTs), User defined functions (UDFs)
and User defined aggregates (UDAs).
These types, functions and aggregates can be used directly in a U-SQL
script. For example, SELECT Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt, dt AS olddt FROM @rs0; where @dt is a
datetime variable makes the best of both C# and SQL. The power of SQL
expressions can never be understated for many business use-cases and they
suffice by themselves but having the SQL
programmability implies that we can even take all the processing all the way
into C# and have the SQL script just be an invocation.
The trouble with analytics pipeline is that developers
prefer open-source solutions to build them. When we start accruing digital
assets in the form of U-SQL scripts, the transition to working with something
Apache Spark might not be straightforward or easy. The Azure analytics layer
consists of both HD Insight and Azure data Lake analytics (HDLA) which target
data differently. The HDInsight works on managed Hadoop clusters and allows
developers to write map-reduce with open source. The ADLA is native to Azure
and enables C#, SQL over job services. We will also recall that Hadoop was
inherently batch processing while Microsoft stack allowed streaming as
well. The steps to transform U-SQL
scripts to Apache Spark include the following:
-
Transform the job orchestration
pipeline to include the new Spark programs
-
Find the differences between how
U-SQL and Spark manage your data.
-
Transform the U-SQL scripts to Spark.
Choose from one of Azure Data Factory Data Flow, Azure HDInsight Hive, Azure
HDInsight Spark or Azure DataBricks services.
With these steps, it is possible to have the best of both
worlds while leveraging the benefits of each.
No comments:
Post a Comment