This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on the programmability aspects of Azure Data Lake.
We discussed that there are two forms of programming with
Azure Data Lake – one that leverages U-SQL and another that leverages
open-source programs such as for Apache Spark.
U-SQL unifies the benefits of SQL with the expressive power of your own
code. SQLCLR improves programmability of U-SQL by allowing user to write
user-defined operators such as functions, aggregations and data types that can
be used in conventional SQL-Expressions or required only an invocation from a
SQL statement.
The other form of programming is largely applied to HD
Insight as opposed to U-SQL for Azure data Lake analytics (ADLA) which target
data in batch processing often involving map-reduce algorithms on Hadoop clusters.
Also, Hadoop is inherently batch processing while Microsoft stack allows
streaming as well. Some open source like
Flink, Kafka, Pulsar and StreamNative support stream operators. While Kafka
uses stream processing as a special case of batch processing, the Flink does
just the reverse. Apache Flink also provides a SQL abstraction over its Table
API. Sample Flink query might look like this:
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.fromCollection(List<Tuple2<String, Integer>> tuples)
.flatMap(new ExtractHashTags())
.keyBy(0)
.timeWindow(Time.seconds(30))
.sum(1)
.filter(new FilterHashTags())
.timeWindowAll(Time.seconds(30))
.apply(new GetTopHashTag())
.print();
Notice the use of pipelined execution and the writing of functions to
do processing on input by basis. A sample function looks like this:
package org.apache.pulsar.functions.api.examples;
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String>
{
@Override
public String apply(String
input) {
return
String.format("%s!", input);
}
}
Both forms have their purpose and the choice depends on the stack used
for the analytics.
No comments:
Post a Comment