The Stream is usually not a single sequence of bytes but a co-ordination of multiple parallel segments. Segments are sequence of bytes and is not mixed with anything that is not data. Metadata exists in its own stream and is usually internal.
Flink watermarks are available to view via its web interface or to query via the metrics-system. The value is computed in the standard way of looking for the minimum of all watermarks received by the upstream operators.
Every event read from a stream also informs the position from where that event was read. This enables a reader to retry reading that event or use it as the start offset to resume reading after a pause. This comes useful for readers that export data from a stream store to other heterogeneous systems which may not have a support for watermark.
Since an S3 bucket can store any number of objects and they can be listed, it is useful to store each event as an object in the same bucket. The name of a bucket can have a mapping to where the event originated from in the stream store. For example, an event has an event pointer that refers to its location in the stream that allows us to retrieve just the event. Similarly, the object can be given the event pointer as its prefix so that there is always a one to one mapping for the same event between the source and the destination. Since the event pointers maintain sequence in the representation as a string, the prefix of these objects are also sequential. It is now easy to select a sublist of these objects with a selection criteria in the S3 list request. Now with the one to one mapping, the readers are free to complete the data transfer in any order of events because they will be listed sequentially in the S3 listing.
The stream store accepts data via gRPC. It exposes a client for this purpose. The client is at the same level as the stream store. The use of client for standalone instances of the stream store is probably the most preferred way to send data to the stream store because it involves no traversal of layers. The use of connectors on the other hand is used for the purpose of reaching a higher layer which is typical for applications and runtimes hosted on the Kubernetes cluster. This layer sits on top of the stream store and is demonstrated by Flink. Any application hosted on the Flink runtime will work with the streams in the store only via the connector. There is also a layer on top of this layer where we involve upstream of downstream heterogeneous storage system that neither has a connector nor a client and the only data transfer allowed is over the web. There are very few examples of data pipelines connected this way but this is probably the only user mode way to convert data from a file to a stream and in turn to a blob otherwise it is internal to the store and its tier 2.
No comments:
Post a Comment