Cluster computing

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on Data Lake monitoring and usages.

The monitoring for the Azure Data Lake leverages the monitoring for the storage account. Azure Storage Analytics performs logging and provides metric data for a storage account. This data can be used to trace requests, analyze usage trends, and diagnose issues with the storage account.

The power of the Azure Data Lake is better demonstrated by the U-SQL queries that can be written without consideration that this is being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kinds of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.

U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility. It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language. U-SQL is extensible, and it can be extended with C# and .NET
If we look at the pattern of separating query from data source, we quickly see it's no longer just a consolidation of data sources. It is also pushing down the query to the data sources and thus can act as a translator. Projections, filters and joins can now take place where the data resides. This was a design decision that came from the need to support heterogeneous data sources. Moreover, it gives a consistent unified view of the data to the user.

SQLCLR improves programmability of U-SQL. Conventional SQL-Expressions like SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but C# Expressions improve User Defined types (UDTs), User defined functions (UDFs) and User defined aggregates (UDAs). These types, functions and aggregates can be used directly in a U-SQL script. For example, SELECT Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt, dt AS olddt FROM @rs0; where @dt is a datetime variable makes the best of both C# and SQL. The power of SQL expressions can never be understated for many business use-cases and they suffice by themselves but having the SQL programmability implies that we can even take all the processing all the way into C# and have the SQL script just be an invocation. This requires assembly to be registered and versioned. U-SQL runs code in x64 format. An uploaded assembly DLL and resource file, such as a different runtime, a native assembly or a configuration file can be at most 400 MB. The total size of all registered resources cannot be greater than 3 GB. There can only be one version of any given assembly. This is sufficient for many business cases which can often be written in the form of a UDF that can take simple parameters and output a simple datatype. These functions can even keep state between invocations. U-SQL comes with a test SDK and together with the local run SDK, script level tests can be authored. Azure Data Lake Tools for Visual Studio enables us to create U-SQL script test cases. A test data source can also be specified for these tests.

Cluster computing

Wednesday, December 29, 2021

No comments:

Post a Comment