Cluster computing

Thursday, March 12, 2020

1) The Flink programming model helps a lot with writing queries for streams. Several examples of this are available on their documentation page and as a sample here. The ability to combine a global index for the stream store with their programming model boosts the analytics that can be performed on the stream. An example to create this index and use it with storage is shown here.
2) The utility of the index is in its ability to lookup based on keywords. The query engine for using the index exposes additional semantics and syntax to the analytics user over the Flink queries or to be used with Flink queries. Then the logic to use the queries and the Flink can be packaged in a maven published jar. Credentials to access the streams can be injected into the jar with the help of a resolver which utilizes the application context.
3) Some of the streams may be generated as part of a running map or flatMap operations on an existing stream and they might come useful later. Unlike the stream for index, there could be a stream for transformation of events. Such transformation will happen once and persist in a new stream. Indexes are rebuilt. Transformation is one time. Indexes are used for a while. Transformations are temporary and persisted only when used with several queries. Indexes might even be stored better as files since they are rewritten. Transformed streams will be append only. The Extract-Transform-Load operation to generate this stream could be packaged in a maven artifact that is easy to write in Flink. If the indexing automatically includes all streams in a project, then this transformed stream would become automatically available to the user. If there is a way for user to blacklist or whitelist the streams for inclusion in the index, it will give more power to the user and prevent unnecessary indexing. All project members can have the privilege to add a stream to the indexing. If the stream is earmarked to be indexed, indexing may even be kicked off by the project member or require the administrator to do so.
4) Overall, there is a comparision made between indexing across stream and transforming one or more streams into another stream. The Flink programmability model works well with transformations. Utilization of index-based querying adds more power to this analytics. Finally, the data for the transformed streams and indexes can be stored with a tier 2 that brings storage engineering best practice and allowing the businesses to focus more on the querying, transformations and indexing.

Cluster computing

Thursday, March 12, 2020

No comments:

Post a Comment