Friday, November 29, 2019

A methodology to use Flink APIs for text summarization:
Text summarization has been treated as a multi-stage processing of a problem starting with word-vectorization, followed by SoftMax classification and keyword extraction and ending with salient sentences, projected as the summary. These stages may be numerous and each stage may involve multiple operations simultaneously such as when text classification and bounding box regression are run simultaneously.  Most of the stages are also batch oriented where-as the Flink APIs are known for their suitability to streaming operations. Yet the data that appears in a text regardless of its length is inherently a stream. If the Flink APIs can count words in a stream of text, then it can be applied with sophisticated operators to the same stream of text for text summarization.
With this motivation, this article tries to look for the application of Flink APIs towards text summarization starting with light-weight peripheral application to become more intrinsic to the text summarization solution.
Flink APIs give the comfort of choosing from a wide variety of transforming operators including map and reduce like operators and with first class standard query operators over Table API. In addition, they support data representation in tabular as well as graph forms by providing abstractions for both.  Words and their neighbors can easily by represented as vertices and weighted edges of a graph so long as the weights are found and applied to the edges. The finding of weights such as with SoftMax classification from neural net is well-known and outside the scope of this article. With the graph once established for a given text, Flink APIs can be used to work with the graphs to determine the vertices based on centrality and then the extraction of sentences according to those keywords and their positional relevance. We focus on this latter part.
For example, we can use the following technique to find the shortest path between two candidates:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromCollection(List<Tuple2<Edges, Integer>> tuples) 
            .flatMap(new ExtractWeights())
            .keyBy(0)
            .timeWindow(Time.seconds(30))
            .sum(1)
            .filter(new FilterWeights())
            .timeWindowAll(Time.seconds(30))
            .apply(new GetTopWeights())
            .print();
We can even use single-source shortest path for a classification. It might involve multiple iterations with at least one pass for each vertex as source:
SingleSourceShortestPaths<Integer, NullValue> singleSourceShortestPaths = new SingleSourceShortestPaths<>(sourceVertex, maxIterations);
Advanced techniques may utilize streaming graph operators that are not restricted by the size of the graph and are first-class operators over graph.

No comments:

Post a Comment