Cluster computing

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on the programmability aspects of Azure Data Lake.

We discussed that there are two forms of programming with Azure Data Lake – one that leverages U-SQL and another that leverages open-source programs such as for Apache Spark. U-SQL unifies the benefits of SQL with the expressive power of your own code. SQLCLR improves programmability of U-SQL by allowing user to write user-defined operators such as functions, aggregations and data types that can be used in conventional SQL-Expressions or required only an invocation from a SQL statement.

The other form of programming is largely applied to HD Insight as opposed to U-SQL for Azure data Lake analytics (ADLA) which target data in batch processing often involving map-reduce algorithms on Hadoop clusters. Also, Hadoop is inherently batch processing while Microsoft stack allows streaming as well. Some open source like Flink, Kafka, Pulsar and StreamNative support stream operators. While Kafka uses stream processing as a special case of batch processing, the Flink does just the reverse. Apache Flink also provides a SQL abstraction over its Table API. Sample Flink query might look like this:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.fromCollection(List<Tuple2<String, Integer>> tuples)

.flatMap(new ExtractHashTags())

.keyBy(0)

.timeWindow(Time.seconds(30))

.sum(1)

.filter(new FilterHashTags())

.timeWindowAll(Time.seconds(30))

.apply(new GetTopHashTag())

.print();

Notice the use of pipelined execution and the writing of functions to do processing on input by basis. A sample function looks like this:

package org.apache.pulsar.functions.api.examples;

import java.util.function.Function;

public class ExclamationFunction implements Function<String, String> {

@Override

public String apply(String input) {

return String.format("%s!", input);

}

Both forms have their purpose and the choice depends on the stack used for the analytics.

Cluster computing

Friday, December 31, 2021

No comments:

Post a Comment