Cluster computing

Monday, December 2, 2019

Patenting Introspection:
A software product may have many innovations. Patents protect innovations from infringement and provide some copyrights assertion for the inventor or software maker so that someone else cannot claim as having ownership of the novelty. Patents can also help protect the maker from losing its competitive edge to someone else copying the idea. Patents are therefore desirable and can be used to secure mechanisms, processes and any novelty that was introduced into the product.
Introspection is the way in which the software maker uses the features that were developed for the consumers of the product for themselves so that they can expand the capabilities to provide even more assistance and usability to the user. In some sense, this is automation of workflows combined with the specific use of the product as a pseudo end-user. This automation is also called ‘dogfooding’ because it relates specifically to utilizing the product for the maker itself. The idea of putting oneself in the customers shoes to improve automation is not new in itself. When the product has many layers internally, a component in one layer may reach a higher layer that is visible to another standalone component in the same layer so that the interaction may occur between otherwise isolated components. This is typical for layered communication. However, the term ‘dogfooding’ is generally applied to the use of features available from the boundary of the product shared with external customers.
Consider a storage product which the customers use to store their data. If the software maker for the storage product decided to use the same product for storing data in isolated containers that are reserved for internal use by the maker, then this becomes a good example of trying out the product just like the customers would. This dogfooding is specifically for the software maker to store internal operational data from the product which may come in useful for troubleshooting and production support later. Since the data is stored locally to the instance deployed by the customer and remains internal, it is convenient for the maker to gather history that holds meaning only with that deployment.
The introspection automation is certainly not restricted to the maker. A customer may choose to do so as well. However, an out-of-box automation for the said purpose, can be done once by the maker for the benefit of each and every customer. This removes some burden from the customer while allowing them to focus more on storing data that is more relevant to their business rather than the operations of the product. A competitor to the product or a third-party business may choose to redo the same introspection with similar automation in order to gain a slice of the revenue. If a patent had been issued for the introspection, it would certainly benefit the product maker in this case.
A patent can only be given when certain conditions are met. We will go over these conditions for their applicability to the purpose of introspection.
First, the patent invention must show an element of novelty which is not already available in the field. This body of existing knowledge is called “prior art”. If the product has not been released, introspection becomes part of v1 and prevents the gap where a competitor can grab a patent between the releases of the product by the maker.
Second, the invention must involve an “inventive step” or “non-obvious” step where a person having ordinary skill in the relevant technical field cannot do the same. This is sufficiently met by introspection because the way internal operational data is stored and read is best known to the maker since all the components of the product are visible to the maker and best known to their staff.
Third, the invention must be capable of industrial application such that it is not merely a theoretical phenomenon but one that is useful in practice. Fortunately, all product support personnel will vouch for the utility of records that can assist with troubleshooting and support of the product in mission critical deployments.
Finally, there is an ambiguous criterion that the innovation must be “patentable” under law for a specific country. In many countries, certain theories, creativity, models or discovery are generally not patentable. Fortunately, when it comes to software development, history and tradition serves well just like in any other industry.
Thus, all the conditions of the application for patent protection of innovation can be met. Further, when the invention is disclosed in a manner that is sufficiently clear and complete to enable it to be replicated by a person with an ordinary level of skill in the relevant technical field, it improves the acceptance and popularity of the product.

Sunday, December 1, 2019

We were discussing Flink APIs.

The source for events may generate and maintain state for the events generated. This generator can event restore state from these snapshots. Restoring a state means simply using the last known timestamp from the snapshot. All events subsequent to the timestamp will then be processed from that timestamp. Each event has a timestamp that can be extracted with Flink’s AscendingTimestampExtractor and the snapshot is merely a timestamp. This allows all events to be processed from the last snapshot.

A source may implement additional behavior such as restricting the maximum number of events per second and periodically sampled so that when the number of events have exceeded in a period, the producer sleeps for a duration of the time remaining between the current time and the expiry of the wait period.

It should be noted that writing steams via connectors is facilitated by the store. However, this is not the only convention to send data to a store. For example, we have well-known protocols like S3 which are widely recognized and equally applicable to stream stores just as much as they are applied to object stores.

By the same argument, data transfer can also occur over any proprietary REST based APIs and not just industry standard S3 Apis. Simple http requests to post data to store is another way to allow applications to send data. This is also the method for popular technology stacks such as influxDB, telegraf, chronograf to collect and transmit metrics data. Whether there are dedicated agents involved in relaying the data or the store itself accumulates the data directly over the wire, these are options to widen the audience for the store.

Making it easy for audience who don’t have to code to send data is beneficial not only to the store but also to these folks who support and maintain production level data stores because it gives them an easy way to do a dry run rather than have to go through development cycles. The popularity of the store is also increased by the customer base

Technically, it is not appropriate to encapsulate an Flink connector within the http request handler for data ingestion at the store. This API is far more generic than the upstream software used to send the data because the consumer of this REST API could be the user interface, a language specific SDK, or shell scripts that want to make curl requests. It is better for the rest API implementation to directly accept the raw message along with the destination and authorization.

Implementation iupcoming n Pravega fork tree at https://github.com/ravibeta/pravega

Saturday, November 30, 2019

We were discussing Flink Apis and connector:

A typical query looks like this:

env.readTextFile(ratingsCsvPath)

.flatMap(new ExtractRating())

.groupBy(0)

.sum(1)

.print();

There are connectors for different data sources which can be passed in via addSource

In order to add parallelization, that governs the tasks, replicas and runtime, we can set it globally with env.setParallelism(dop) ;

Otherwise we can set it at the DataSet API

There is even a similar method to set the maximum parallelism. This would prevent the parallelism from going beyond the limit

The newer DataStream API allows methods such as: assignTimestampsAndWatermarks because it supports the watermark and allows the ordering of the events in the stream. A sink can be added to write the events to store.

The Flink connector library for Pravega adds a data source and sink for use with the Flink Dara API.

The Flink Pravega Writer and Flink Pravega Reader work with generic events as long as the events implement serializability and de-serializability respectively.

The FlinkPravegaWriter extends RichSinkFunction and implements ListCheckpointed and CheckpointListener interfaces.

The FlinkPravegaReader extends RichParallelSourceFunction implements ResultTypeQueryable, StoppableFunction and ExternallyInducedSource interfaces.

Notice the extension of RichParallelSourceFunction from Flink is meant for high scalability from reader groups. This is required in use case such as reading from socket where the data may arrive at a fast rate

Flink has a construct for WindowJoin where it can relate two entities within a window. The two sides toa join may have different rates of arrival so the output may not always be the same, but the output of the join will follow a format.
The word count is inherently summation form. This means it can be parallelized by splitting data. If the data arrives on a socket, it just needs to do a map and reduce. For example, the SockerWindowWordCount example in the flink source code repository makes use of a flatMap function that translates a string to a dictionary of word and frequency. The reduce function in this case, merely combines the count from the flatMap for the same word.
We can also create a trigger since the examples are data driven. This trigger calculates a delta or difference between the datapoint that triggered the last and the current datapoint. It the delta is higher than a threshold, the trigger fires. A delta function and a threshold is required for the trigger to work.
The delta function comes in useful for cumulative values because the path of change does not matter except the initial and final values. For example, the cumulative value is always monotonic for positive values. It can climb at different rates but the final value will be at the same or higher level than the initial value. We can take the delta function in this case, to suitably determine a threshold. The word counts are all positive values Therefore a trigger is suitable for a parallelized Flink word count example when we are interested in queries where thresholds are exceeded first.

Friday, November 29, 2019

A methodology to use Flink APIs for text summarization:
Text summarization has been treated as a multi-stage processing of a problem starting with word-vectorization, followed by SoftMax classification and keyword extraction and ending with salient sentences, projected as the summary. These stages may be numerous and each stage may involve multiple operations simultaneously such as when text classification and bounding box regression are run simultaneously. Most of the stages are also batch oriented where-as the Flink APIs are known for their suitability to streaming operations. Yet the data that appears in a text regardless of its length is inherently a stream. If the Flink APIs can count words in a stream of text, then it can be applied with sophisticated operators to the same stream of text for text summarization.
With this motivation, this article tries to look for the application of Flink APIs towards text summarization starting with light-weight peripheral application to become more intrinsic to the text summarization solution.
Flink APIs give the comfort of choosing from a wide variety of transforming operators including map and reduce like operators and with first class standard query operators over Table API. In addition, they support data representation in tabular as well as graph forms by providing abstractions for both. Words and their neighbors can easily by represented as vertices and weighted edges of a graph so long as the weights are found and applied to the edges. The finding of weights such as with SoftMax classification from neural net is well-known and outside the scope of this article. With the graph once established for a given text, Flink APIs can be used to work with the graphs to determine the vertices based on centrality and then the extraction of sentences according to those keywords and their positional relevance. We focus on this latter part.
For example, we can use the following technique to find the shortest path between two candidates:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromCollection(List<Tuple2<Edges, Integer>> tuples)
.flatMap(new ExtractWeights())
.keyBy(0)
.timeWindow(Time.seconds(30))
.sum(1)
.filter(new FilterWeights())
.timeWindowAll(Time.seconds(30))
.apply(new GetTopWeights())
.print();
We can even use single-source shortest path for a classification. It might involve multiple iterations with at least one pass for each vertex as source:
SingleSourceShortestPaths<Integer, NullValue> singleSourceShortestPaths = new SingleSourceShortestPaths<>(sourceVertex, maxIterations);
Advanced techniques may utilize streaming graph operators that are not restricted by the size of the graph and are first-class operators over graph.

Thursday, November 28, 2019

Wednesday, November 27, 2019

The solutions that are built on top of SQL queries are not necessarily looking to write SQL. They want programmability and are just as content to write code using Table API as they are with SQL queries. Consequently, business intelligence and CRM solutions are writing their code with standard query operators over their data. These well-defined operators are applicable to most collections of data and the notion is shared with other programming lConsistency also places hard requirements on the choice of the processing. Strict consistency has traditionally been associated with relational databases. Eventual consistency has been made possible in distributed storage with the help of messaging algorithms such as Paxos. Big Data is usually associated with eventual consistency.anguages such as. Net.
Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Apache Flink provides a SQL abstraction over its Table API. It also provides syntax fo make use of Partitioning and increase parallelism.
The Apache Flink provides the ability to read from various data sources such as env.readFromFile, env.readFromCollection and readFromInput while allowing read and write from streams via connector.

A typical query looks like this:

env.readTextFile(ratingsCsvPath)

.flatMap(new ExtractRating())

.groupBy(0)

.sum(1)

.print();

There are connectors for different data sources which can be passed in via addSource

In order to add parallelization, that governs the tasks, replicas and runtime, we can set it globally with env.setParallelism(dop) ;

Otherwise we can set it at the DataSet API

There is even a similar method to set the maximum parallelism. This would prevent the parallelism from going beyond the limit

The newer DataStream API allows methods such as: assignTimestampsAndWatermarks because it supports the watermark and allows the ordering of the events in the stream. A sink can be added to write the events to store.

The Flink connector library for Pravega adds a data source and sink for use with the Flink Dara API.

The Flink Pravega Writer and Flink Pravega Reader work with generic events as long as the events implement serializability and de-serializability respectively.

The FlinkPravegaWriter extends RichSinkFunction and implements ListCheckpointed and CheckpointListener interfaces.

The FlinkPravegaReader extends RichParallelSourceFunction implements ResultTypeQueryable, StoppableFunction and ExternallyInducedSource interfaces.

Notice the extension of RichParallelSourceFunction from Flink is meant for high scalability from reader groups. This is required in use case such as reading from socket where the data may arrive at a fast rate

Tuesday, November 26, 2019

Apache Flink discussion continued:
Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.
Also Apache Flink provides a SQL abstraction over its Table API
The Apache Flink provides the ability to read from various data sources such as env.readFromFile, env.readFromCollection and readFromInput while allowing read and write from streams via connector.
Some more Flink Queries with examples below:
1) final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
2) env.fromCollection(List<Tuple2<String, Integer>> tuples)
3) .flatMap(new ExtractHashTags())
4) .keyBy(0)
5) .timeWindow(Time.seconds(30))
6) .sum(1)
7) .filter(new FilterHashTags())
8) .timeWindowAll(Time.seconds(30))
9) .apply(new GetTopHashTag())
10) .print();
The Get top hash tag merely picks out the hash tag with the most count.
Notice the methods applied to the DataSet follow one after the other. This is a preferred convention for Flink
Since we use fromCollection, it does not have parallelism.
We can improve parallelism with fromParallelCollection
If we want to find a distribution, we can do something like :
env.readTextFile(ratingsCsvPath)
.flatMap(new ExtractRating())
.groupBy(0)
.sum(1)
.print();
Here is the quintessential wordcount:
lines.flatMap((line, out) -> {
String[] words = line.split("\\W+");
for (String word : words) {
out.collect(new Tuple2<>(word, 1));
}
})
.returns(new TupleTypeInfo(TypeInformation.of(String.class), TypeInformation.of(Integer.class)))
.groupBy(0)
.sum(1)
.print();

Flink also supports graphs, vertices and edges with:
import org.apache.flink.graph.Edge;
import org.apache.flink.graph.Graph;
import org.apache.flink.graph.Vertex;
Graph<Integer, NullValue, NullValue> followersGraph = Graph.fromDataSet(socialMediaEdgeSet, env);
We may need weighted graphs to run single source shortest path:
SingleSourceShortestPaths<Integer, NullValue> singleSourceShortestPaths = new SingleSourceShortestPaths<>(sourceVertex, maxIterations);

There are connectors for different data sources which can be passed in via addSource