Cluster computing

Saturday, May 30, 2020

Application troubleshooting continued...

Debugging and monitoring:

Flink Applications have support for extensible user metrics in addition to System metrics. Although not all of these might be exposed via the stream and store analytics user interface, the applications can register metric types such as Counters, guages, Histograms and meters.
Log4j and Logback can be configured with the appropriate log levels to emit log entries for the appropriate operators. These logs can also be collected continuously as the system makes progress.
The status and statistics of completed jobs that have been archived by the JobManager can be viewed via the HistoryServer after they are configured. The Flink user interface may have support for it.

Since the Job graph involves multiple jobs, they can each be independently queried using the job id.
Checkpoints can also be monitored although the stream store and analytics might not support it via the user interface.
Checkpoints can be triggered and restorations can be performed.
Backpressure can be detected. If a task is producing data faster than the downstream operators can consume, it will have a rating.
There are REST apis’ available for monitoring from the ‘flink-runtime’ project and is hosted by the Dispatcher. The web dashboard for monitoring also shows this information. It is also possible to extend these APIs.

Event time and watermarks are powerful features that enable applications to handle late events and out-of-order events so that the events remain sequenced. The Flink runtime provides a way to allow source to issue timestamps or have the Flink assign timestamps using event origination time, ingestion time or processing time. Applications don’t have to implement time-windows themselves although they are not restricted.
Monitoring event time is tricky. There are two reasons for it. First, when the event doesn’t come, it is not clear whether the time is advancing or whether there is no data. Second, when an event is received, there is no knowledge of whether there is another event coming with that timestamp.
The Stream is usually not a single sequence of bytes but a co-ordination of multiple parallel segments. Segments are sequence of bytes and is not mixed with anything that is not data. Metadata exists in its own stream and is usually internal.

Friday, May 29, 2020

Application troubleshooting guide continued...

Stream Store and analytics automates deployments of applications written using Flink. This includes options for

Authentication and authorization
Stream store metrics
High availability
State and fault tolerance via state backends

Configuration including memory configuration and
All of the production readiness checklist, that includes:

Setting an explicit max parallelism
Setting UUID for all operators
Choosing the right State backend

Configuring high availability for job managers

Debugging and monitoring:

Flink Applications have support for extensible user metrics in addition to System metrics. Although not all of these might be exposed via the stream and store analytics user interface, the applications can register metric types such as Counters, guages, Histograms and meters.
Log4j and Logback can be configured with the appropriate log levels to emit log entries for the appropriate operators. These logs can also be collected continuously as the system makes progress.
The status and statistics of completed jobs that have been archived by the JobManager can be viewed via the HistoryServer after they are configured. The Flink user interface may have support for it.

Since the Job graph involves multiple jobs, they can each be independently queried using the job id.
Checkpoints can also be monitored although the stream store and analytics might not support it via the user interface.
Checkpoints can be triggered and restorations can be performed.
Backpressure can be detected. If a task is producing data faster than the downstream operators can consume, it will have a rating.
There are REST apis’ available for monitoring from the ‘flink-runtime’ project and is hosted by the Dispatcher. The web dashboard for monitoring also shows this information. It is also possible to extend these APIs.

Event time and watermarks are powerful features that enable applications to handle late events and out-of-order events so that the events remain sequenced. The Flink

Thursday, May 28, 2020

Application troubleshooting guide

Application debugging, job submission result and logging are not interactive during stream processing. Consequently, application developers rely on runtime querying and state persistence. A few other options are listed here:

• Use of side outputs

When the events are processed from an infinite sequence in the main stream, the results can be stored in one or more additional output streams. This resulting stream is called side output. The type of events stored in a side output does not have to be the same type as the main stream.

The side output stream is denoted with the OutputTag as shown here:

OutputTag<String> outputTag = new OutputTag<String>("side-output") {};

The side-output stream is retrieved using the getSideOutput(OutputTag) method on the result of the DataStream operation.

o Testing:

Testing of user-defined functions is made possible with mocks. The Flink collection comes with the so-called test harnesses such as OneInputStreamOperatorTestHarness for operator on DataStreams, KeyedOneInputStreamOperatorTestHarness for operators on KeyedStream, TwoInputStreamOperatorTestHarness for operators of ConnectedDataStreams of two DataStreams, KeyedTwoInputStreamOperatorTestHarness for operators on ConnectedStreams of two KeyedStreams

The ProcessFunction is the most common method used for processing the events in a stream. This can be unit-tested with the ProcessFunctionTestHarnesses. The PassThroughProcessFunction along with the previously mentioned types of DataStreams come useful to invoke the processElement method on the ProcessFunctionTestHarness. The extractOutputValues method on the harness comes useful to retrieve the list of emitted records for assertions.

• Deployments:

Stream Store and analytics automates deployments of applications written using Flink. This includes options for

o Authentication and authorization

o Stream store metrics

o High availability

o State and fault tolerance via state backends

o Configuration including memory configuration and

o All of the production readiness checklist, that includes:

 Setting an explicit max parallelism

 Setting UUID for all operators

 Choosing the right State backend

 Configuring high availability for job managers

• Debugging and monitoring:

o Flink Applications have support for extensible user metrics in addition to System metrics. Although not all of these might be exposed via the stream and store analytics user interface, the applications can register metric types such as Counters, guages, Histograms and meters.

o Log4j and Logback can be configured with the appropriate log levels to emit log entries for the appropriate operators. These logs can also be collected continuously as the system makes progress.

Wednesday, May 27, 2020

Some more Flink CEP +ML examples

FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an endless stream of events. Machine Learning helps with utilizing patterns discovered in events to make predictions. Together ML and FlinkCEP can improve analytics and predictions.

public static void main(String[] args) throws Exception {

File file = StreamHandler.writeToFile(StreamHelper.resolve("sequences"));

Dataset data = FileHandler.loadDataset(file, 4, ",");

// ensemble feature selection

RecursiveFeatureEliminationSVM[] svmrfes = new RecursiveFeatureEliminationSVM[10];

for (int i = 0; i < svmrfes.length; i++)

svmrfes[i] = new RecursiveFeatureEliminationSVM(0.2);

LinearRankingEnsemble ensemble = new LinearRankingEnsemble(svmrfes);

ensemble.build(data);

for (int i = 0; i < ensemble.noAttributes(); i++)

System.out.println(ensemble.rank(i));

// feature scoring

GainRatio ga = new GainRatio();

ga.build(data);

for (int i = 0; i < ga.noAttributes(); i++)

System.out.println(ga.score(i));

// feature ranking

RecursiveFeatureEliminationSVM svmrfe = new RecursiveFeatureEliminationSVM(0.2);

svmrfe.build(data);

for (int i = 0; i < svmrfe.noAttributes(); i++)

System.out.println(svmrfe.rank(i));

}

Tuesday, May 26, 2020

Flink CEP and ML:

This article describes some of those possibilities. We begin with the CEP’s Pattern API, which allows us to specify the patterns that we want to detect in the stream. This can be used to predict with matching event sequences.

A pattern is a sequence described by the invariants for beginning, middle and end. For example,

DataStream<Event> input = …

Pattern<Event, ?> pattern = Pattern<Event>

.begin(“start”).where(predicate)

.next(“middle”).subType(SubEvent.class).where(predicate)

.followedBy(“end”).where(predicate);

PatternStream<Event> patternStream = CEP.pattern(input, pattern);

DataStream<Alert> result = patternStream.process(patternProcessFunction);

start.oneOrMore().greedy();

The patternProcessFunction is where we plugin in ML api involved processing such as classifying, grouping, ranking or sorting

The syntax involving one or more occurrences and greedily looking for repetitions is the one involving pattern detection.

Predicates are specified with conditions where an event is either accepted or rejected. This is inlined in the predicate for each filter. If we want historical evaluation of all events for this predicate, the actual acceptance or rejection of those events occurs with the call to ctx.getEventsForPattern(“patternName”) which may be quite expensive given the data for the stream processing.

The way to describe complicated or conjunction-based pattern predicates is described in the CEP’s Pattern API documentation. Instead, we look at some of the ML routines.

from recommendations import patterns

recommendations.sim_distance(recommendations.patterns, ... stream1, stream2)

This applies to StreamCuts just as much as it applies to Streams.

Also, the ability to find similarities between streams or StreamCuts allows us to use clustering to form groups.

public class KMeansSample {

public static void main(String[] args) throws Exception {

File file = StreamHandler.writeToFile(StreamHelper.resolve("sequences"));

Dataset data = FileHandler.loadDataset(file, 4, ",");

Clusterer km = new KMeans();

Dataset[] clusters = km.cluster(data);

System.out.println("Cluster count: " + clusters.length);

}