This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on Data Lake monitoring and usages.
The monitoring for the Azure Data Lake leverages the
monitoring for the storage account. Azure Storage Analytics performs logging
and provides metric data for a storage account. This data can be used to trace
requests, analyze usage trends, and diagnose issues with the storage account.
The power of the Azure Data Lake is better demonstrated
by the U-SQL queries that can be written without consideration that this is
being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with
the expressive power of your own code. This is said to work very well with all
kinds of data stores – file, object and relational. U-SQL works on the Azure
ecosystem which involves the Azure data lake storage as the foundation and the
analytics layer over it. The Azure analytics layer consists of both HD Insight
and Azure data Lake analytics (HDLA) which target data differently. The
HDInsight works on managed Hadoop clusters and allows developers to write
map-reduce with open source. The ADLA is native to Azure and enables C#, SQL
over job services. We will also recall that Hadoop was inherently batch
processing while Microsoft stack allowed streaming as well. The benefit of the
Azure storage is that it spans several kinds of data formats and stores. The
ADLA has several other advantages over the managed Hadoop clusters in addition
to working with a store for the universe. It enables limitless scale and
enterprise grade with easy data preparation. The ADLA is built on Apache yarn,
scales dynamically and supports a pay by query model. It supports Azure AD for
access control and the U-SQL allows programmability like C#.
U-SQL supports big data analytics which generally have
the characteristics that they require processing of any kind of data, allow use
of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data
analytics. In addition, it supports SQL for Big Data which allows querying over
structured data Also it enables scaling and parallelization. While Hive
supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and
Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language
itself. It can unify querying over structured and unstructured data. It has
declarative SQL and can execute local and remote queries. It increases
productivity and agility. It brings in features from T-SQL, Hive SQL, and SCOPE
which has been Microsoft's internal Big Data language. U-SQL is extensible, and
it can be extended with C# and .NET
If we look at the pattern of separating query from data
source, we quickly see it's no longer just a consolidation of data sources. It
is also pushing down the query to the data sources and thus can act as a
translator. Projections, filters and joins can now take place where the data
resides. This was a design decision that came from the need to support
heterogeneous data sources. Moreover, it gives a consistent unified view of the
data to the user.
SQLCLR improves programmability of U-SQL. Conventional SQL-Expressions like SELECT,
EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but C#
Expressions improve User Defined types (UDTs), User defined functions (UDFs)
and User defined aggregates (UDAs).
These types, functions and aggregates can be used directly in a U-SQL
script. For example, SELECT
Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS
dt, dt AS olddt FROM @rs0; where @dt is a datetime variable makes the best of
both C# and SQL. The power of SQL expressions can never be understated for many
business use-cases and they suffice by themselves but having the SQL programmability implies that we can even
take all the processing all the way into C# and have the SQL script just be an
invocation. This requires assembly to be
registered and versioned. U-SQL runs code in x64 format. An uploaded assembly
DLL and resource file, such as a different runtime, a native assembly or a
configuration file can be at most 400 MB.
The total size of all registered resources cannot be greater than 3 GB.
There can only be one version of any given assembly. This is sufficient for many business cases
which can often be written in the form of a UDF that can take simple parameters
and output a simple datatype. These functions can even keep state between
invocations. U-SQL comes with a test SDK and together with the local run SDK,
script level tests can be authored. Azure Data Lake Tools for Visual Studio
enables us to create U-SQL script test cases. A test data source can also be
specified for these tests.
No comments:
Post a Comment