Cluster computing

Tuesday, August 13, 2019

Summary of the book "Stream processing with Apache Flink"
The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.
The stream processing is facilitated with the help of DataStream API instead of the DataSet API which is more suitable for batch processing. Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.
The difference between Kafka and Flink is that Kafka is better suited for data pipelines while Flink is better suited for streaming applications. Kafka tends to reduce the batch size for the processing. Flink tends to run an operator on a continuous basis, processing one record at a time. Both are capable of running in standalone mode. Kafka may be deployed in a high availability cluster mode whereas Flink can scale up in a standalone mode because it allows iterative processing to take place on the same node. Flink manages its memory requirement better because it can perform processing on a single node without requiring cluster involvement. It has a dedicated master node for co-ordination.
Both systems are able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.

Monday, August 12, 2019

Today we continue reading the book. "Stream processing with Apache Flink" We discussed that the Apache Flink is an open source stream processing framework. Apache Flink which featured 1) low latency 2) high throughput 3) stateful and 4) distributed processing. Flink is a stream processor that can scale out. It can perform arbitrary processing on windows of events with storage and access of intermediary data. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections.
The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.
The stream processing is facilitated with the help of DataStream API instead of the DataSet API which is more suitable for batch processing. Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.

Sunday, August 11, 2019

Today we start reading the book "Stream processing with Apache Flink" Apache Flink is an open source stream processing framework. The salient features of this framework include:
1) low latency
2) high throughput
3) Stateful and
4) Distributed
The first two were independently featured by Apache Spark and Apache Storm respectively. Apache Flink brings both with its ability to process a large number of parallel streams. Although traditionally high throughput has been usually done by a multiplexing of different streams with a publish/subscribe bus, Flink stands out as a stream processor where events are processed in near-real time manner unlike batch processing. The order of latency can be to the tune of hours on a batch processing system where as it can be under a second on a stream processor such as Flink. In fact, Flink has the ability to special case batch processing as one of its capabilities
Similarly, data at rest has traditionally involved a database centric architecture. On the other hand, cloud computing trends have boosted the requirement to analyze data in transit. A stream processor is well-suited for this purpose even if it is for an archivable data such as logs.
Apache Flink is also a distributed stream processor with its capability to scale out. As it does so, it can perform arbitrary processing on windows of events with storage and access of intermediary data. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.

Saturday, August 10, 2019

The queries on streams are also different depending on the stack involved. For example, when the Flink application is standalone and a query has been provided, the user may receive no output for a long while. When the data set is large, the delay might be confusing to the user on whether it comes from the processing time over the data set or whether the logic was incorrectly written. The native web interface for the Apache Flink provides some support in this regard. It gives the ability to watch for watermarks which can indicate whether there is any progress made. If there are no watermarks then it is likely that the event time windows never elapsed. User has syntax from Flink such as savepoints to interpret progress. Users can create, own or delete savepoints which represents the execution state of a streaming job. These savepoints point to actual files on the storage. Flink provides checkpointing which it creates and deletes without user intervention.
On the other hand, some streaming queries generally follow a five-step procedure:
1) define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis
2) define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval
3) Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query
4) Bind the query to a consumer. This could be to a console. For example
Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();
5) Run the query and evaluate it based on time.
It is probably most succinct in SQL where a windowing function can be written as
SELECT COUNT(*)
OVER ( PARTITION BY hash(u.timestamp DIV (60*60*24)) partitions 3 ) u1
FROM graphupdate u;

Friday, August 9, 2019

The logs generated in cluster are usually saved on persistent volumes which are available between restarts. If the application prefers external only logging, then any internal storage for this purpose can be ephemeral.

The dependencies on external log services come with downstream benefits such as api monitoring. With the help of logs, specific queries can be run on log indexes for time slice events which can help with determining trends and raise alerts when thresholds are exceeded.

The queries for the logs change depending on whether they operate over stream or over entities. This does not have to change the user interface that retrieves time-sliced data, the display of the query parameters and the display of the results against the query. The nature of the queries only impacts the query execution over the data. The query and the data are independent of each other so long as one form or other of query can be used to return results of the analysis on the data.

This leaves the front-end open to display query filters, charts and graphs without any regard for underlying subsystem. Since the logs are a form of data, the user interface for the querying on the logs can be the same as the user-interface for any other data. Furthermore, the parameters for logs will likely be similar between many queries, so the user interface controls and the queries can be appropriately matched. The page that displays the controls will then likely change very little for using against one or more queries.

Whether the logs are read from an index or from sequential formats of storage does not necessarily change the notion that the execution follows the timeline in the logs for the activities performed by the product which helps reduce the overall cost of maintenance.

I got my company issued IntelliJ license today

Thursday, August 8, 2019

The advantages of cluster-specific rules different from cluster-external rules over having no differentiation is both significant and win-win for application deployers as well as their customers. Consider the use case where the analysis over logs need to have proprietary logic or have query hints or annotations that can help product support but need not be part of the queries from customers. Taking this to another extreme, let us say the cluster deployed application would like to have competitive advantage over the marketplace log store capabilities outside the cluster. These use cases broaden the horizon over the storage of logs especially when the application is a storage product.
Let us take another use case where the cluster specific solution provides an interim remediation for disaster recovery especially when the points of failure are external services. In such a failure case, the user will remain blind to the operations of the cluster since the cluster external log services are not giving visibility into the latest log entries. Similarly, external network connection may have been taken for granted while the administrator may find it easy to retrieve the logs from the cluster and send it offline for analysis by remote teams. The dual possibility of internal and external provides benefits for many other product perspectives.
The logs generated in cluster are usually saved on persistent volumes which are available between restarts. If the application prefers external only logging, then any internal storage for this purpose can be ephemeral.

The dependencies on external log services come with downstream benefits such as api monitoring

Wednesday, August 7, 2019

The reliance on Kubernetes cluster only log storage does not compete with log services external to the cluster. The cluster can be self-sufficient for limited logs while the cluster external services can provide a durable storage along with all the best practices of storage engineering. The cluster only log storage can leverage high availability and load balancing indigenous to the cluster while the cluster external services can divert log reading loads from the cluster itself. The cluster can reduce points of failure and facilitate capture of console sessions and other diagnostics actions taken on the cluster while the services external to the service cannot change the source of truth. The cluster specific log collection allows the ability to specify fluentd rules while the services outside the cluster have to rely on classification.
The advantages of cluster-specific rules different from cluster-external rules over having no differentiation is both significant and win-win for application deployers as well as their customers. Consider the use case where the analysis over logs need to have proprietary logic or have query hints or annotations that can help product support but need not be part of the queries from customers. Taking this to another extreme, let us say the cluster deployed application would like to have competitive advantage over the marketplace log store capabilities outside the cluster. These use cases broaden the horizon over the storage of logs especially when the application is a storage product.
Let us take another use case where the cluster specific solution provides an interim remediation for disaster recovery especially when the points of failure are external services. In such a failure case, the user will remain blind to the operations of the cluster since the cluster external log services are not giving visibility into the latest log entries. Similarly, external network connection may have been taken for granted while the administrator may find it easy to retrieve the logs from the cluster and send it offline for analysis by remote teams. The dual possibility of internal and external provides benefits for many other product perspectives.