Cluster computing

Saturday, June 13, 2020

Gradle plugins and the convenience to make docker images – some trivias and gotchas

Gradle is a build automation tool used primarily by Java language developers to compile and package their code. It provides a framework where different automation tasks can be delegated to plugins and these can be configured for the developer’s code repository with minimal parameters in a script. When a java project is compiled, it brings in a ton of dependencies in the form of published software components that have multiple versions. The tasks of compiling and packaging the code require changes from repository to repository based on the project source code structure and the parameters for different tasks. These parameters are specified in a gradle.properties file while the script is maintained in each subproject and the project root folder.

Options to package the code often includes tasks such as building a regular or a fat jar – the latter is one where the components are pulled in archived into a single jar file. The jar files can be extracted with the ‘jar’ tool and is made up of compiled classes from java code an it’s manifests. When more than one jars are packaged into an uber jar, it is called a fat jar. All jars can be published in a standard way called the maven specification where the metadata and versioning are produced in a browsable manner called the build information. Destinations where this information and the build can be uploaded include remote binary repositories such as artifactory and open-source. A lot of attention is paid to the manner in which the code is organized and the build is produced when they are as public as open-source.

The organization of projects and sub-projects have to be such that the enumeration of plugins, repositories, dependencies and their results have to be meticulously written otherwise getting the build script can be frustrating without the transparency and knowledge about the plugins. Fortunately, the documentation and the public forums serve well to overcome some or most of these hiccups. Commonly encountered errors are when a plugin is not specified but the task from its declarations is invoked, stale binaries when the whole project structure is not cleaned before the build and order and positioning of the steps within the script matter when there are more than one languages involved. A common practice is to turn on the debug output mode and the stacktrace from build exceptions so that the offending task and its resolution can be found. Sometimes, this resolution is not enough because the script has no visibility into the operations of the plugins because many plugins are authored outside of those provided by the framework. One way to overcome these would be to search the issues and forums tab in the plugin repositories for the opinions and resolutions cited.

The script itself is easy to read in the more recent versions of the gradle framework and comes with its own syntax called groovy that is acceptable both for manual builds as well as in an automated pipeline. Packaging plugins come with an additional onus for their users to know the layout that will be generated from the use of the plugin. Specifying the location, version and order requires to be repeated in each and every stage of a multi-stage build automation and mistakes in one can cascade into others. The process of narrowing down the failure while starting from clean state each time before spending time debugging it requires a little less effort than otherwise. The nine rules of debugging as proposed by David Agans could come helpful here for the most elusive problems. These include “Understand the system”, “Make it fail”, “Quit thinking and look”, “Divide and Conquer”, “Change one thing at a time”, “Keep an audit trail” and “Check the plug”, “Get a fresh view” and “If you didn’t fix it, it ain’t fixed”.

Friday, June 12, 2020

Providing REST API:

StreamProcessing is usually bound to a stream in the stream store which it accesses over the gRPC based connectors. But the usefulness and convenience of making web requests from the application cannot be written off. REST Apis are independently provisioned per resource and the services providing those resources can be hosted anywhere. This flat enumeration of APIs helps with the microservice model and available for mashups within the application. SOAP requires tools to inspect the message. REST can be intercepted by web proxy and displayed with browser and add-on.

SOAP methods require declarative address, binding and contract. REST is URI based and has qualifiers for resources. This is what makes REST popular over gRPC but the latter is more suited for IoT traffic.

There is a technique called HATEOAS where a client can get more information from the web API servers with the help of hypermedia. This technique helps the client to get information outside the documentation by machines instead of humans. Since the storage products are considered commodity, this technique is suitable for plugging in one storage product versus another. The clients that use the web API programmability can then switch services and their resource qualifiers with ease.

Some of the implementors for REST API follow a convention. This convention makes it easy to construct the APIs for the resources and to bring consistency among those for different resources.

REST Apis also make authentication and authorization simpler to setup and check. There are several solutions to choose from with some involving offloaded functionality or the use of frameworks like Java which can support annotation-based checks for adequate privilege and access control.

Supported actions are filtered. Whether a user can read or write is determined based on the capability for the corresponding read or write permission. The user is checked for the permission involved. This could be an RBAC access control such that it is just a check against the role for the user. The system user for instance has all the privileges. The capabilities a user has is determined from the license manager for instance or by the explicitly added capabilities. This is also necessary from audit purposes. If the auditing is not necessary, then the check against capabilities could always return true. On the other hand, if all the denies were to be audited, it would leave a trail in the audit log.

Performance Tuning considerations in Stream queries:

Query language and query operators have made writing business logic extremely easy and independent of the data source. This suffices for the most part but there are a few cases when the status quo is just not enough. Enter real-time processing needs and high priority queries, the size of the data, the complexity of the computation and the latency of the response, it begins to become a concern.

Databases have had a long and cherished history in encountering and mitigating query execution responses. However, relational databases pose a significantly different domain of considerations as opposed to NoSQL storage primarily due to the layered and interconnected data requiring scale up rather than scale out technologies. Both have their independent performance tuning considerations.

Stream storage is no different to suffer from performance issues with disparate queries ranging from small to big. The compounded effect of append only data and stream requiring to be evaluated in windows makes iterations difficult. The processing of the streams is also exported out of the storage and this causes significant round trip time and back and forth.

Apache stack has significantly improved the offerings on the stream processing. Apache Kafka and Flink are both able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.

Persistence of streams for intermediate executions helps with reusability and improves the pipelining of operations so the query operators are small and can be executed by independent actors. If we have equivalent of lambda processing on persisted streams, the pipelining can significantly improve performance earlier where the logic was monolithic and proved slow from progressing window to window. There is no distinct thumb rule but the fine-grained operators have proven to be effective since they can be studied.

Streams that articulate the intermediary result also help determine what goes into each stage of the pipeline. Watermarks and savepoints are similarly helpful This kind of persistence proves to be a win-win situation for parallelizing as well as subsequent processing while disk access used to be costly in dedicated systems. There is no limit to the number of operators and their scale to be applied on streams so proper planning mitigates the efforts need to choreograph a bulky search operation.

These are some of the considerations for the performance improvement of stream processing.

Thursday, June 11, 2020

Patenting customizations

Patenting Customizations:

A software product may have many innovations. Patents protect innovations from infringement and provide some copyrights assertion for the inventor or software maker so that someone else cannot claim as having ownership of the novelty. Patents can also help protect the maker from losing their competitive edge to someone else copying the idea. Patents are therefore desirable and can be used to secure mechanisms, processes and any novelty that was introduced into the product.

Customization is the way in which the software maker uses the features out of the box to be more usable for a customer. It includes automation of workflows that a customer needs over and on top of what the product offers. In the context of a storage product, the automations needed for extract-transform-load of data to and from the product is a classic example of customizations. Many customers have to write their own code to wrap the business logic for data import and export which sometimes calls for vendors to provide solution integration. The smartness is different solution integrations across customer segments can be consolidated and offered as a layer over the storage product. Together with the study of customer use cases and internal storage best practice, the storage product is better able to provide customization that are not only efficient but also convenient for the customer. This mutual tension between the software maker and the customer to do away with excesses on either side is best resolved with a patent.

Let us take the customization example for a storage product a little further. The use of connectors and appenders enhance the use of storage product for data transfers. Connectors funnel data from different data sources and are typically one per type of source. Appenders allow data to be written once but propagated to different destinations. A customer may choose to do so as well. However, an out-of-box automation for the said purpose, can be done once by the maker for the benefit of each and every customer. This removes some burden from the customer who would not need to know the internals of the product and would not need to invest in setting up the operations. A simple configuration can facilitate the data transfer. The value additions of the connectors and appenders are immense for the customers while increasing the appeal of the storage product in its ecosystem. Even though some of these automations are routine, they can be enhanced with intelligence for the purposes of inference. All these benefits make the case for a patent.

A patent can only be given when certain other conditions are also met. We will go over these conditions for their applicability to the purpose of customization.

First, the patent invention must show an element of novelty which is not already available in the field. This body of existing knowledge is called “prior art”. If the product has not been released, customization becomes part of v2 and prevents the gap where a competitor can grab a patent between the releases of the product by the maker as it begins to gain popularity.

Second, the invention must involve an “inventive step” or “non-obvious” step where a person having ordinary skill in the relevant technical field cannot do the same. This is sufficiently met by customization because the way internal operational data is stored and read is best known to the maker since all the components of the product are visible to the maker and best known to their staff.

Third, the invention must be capable of industrial application such that it is not merely a theoretical phenomenon but one that is useful in practice. Fortunately, all product support personnel will vouch for the utility of libraries that can assist with increasing convenience for end-users in mission critical deployments.

Finally, there is an ambiguous criterion that the innovation must be “patentable” under law for a specific country. In many countries, certain theories, creativity, models or discovery are generally not patentable. Fortunately, when it comes to software development, history and tradition serves well just like in any other industry.

Thus, all the conditions of the application for patent protection of innovation can be met. Further, when the invention is disclosed in a manner that is sufficiently clear and complete to enable it to be replicated by a person with an ordinary level of skill in the relevant technical field, it improves the acceptance and popularity of the product.

Wednesday, June 10, 2020

Application troubleshooting continued

Finally, a recap of the benefits of writing Apache Flink Application.

The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.

The stream processing is facilitated with the help of DataStream API instead of the DataSet API. The latter is more suitable for batch processing. Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.

The difference between Kafka and Flink is that Kafka is better suited for data pipelines while Flink is better suited for streaming applications. Kafka tends to reduce the batch size for the processing. Flink tends to run an operator on a continuous basis, processing one record at a time. Both are capable of running in standalone mode. Kafka may be deployed in a high availability cluster mode whereas Flink can scale up in a standalone mode because it allows iterative processing to take place on the same node. Flink manages its memory requirement better because it can perform processing on a single node without requiring cluster involvement. It has a dedicated master node for co-ordination.

Both systems are able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. It has become the norm for event-driven applications, data pipeline applications and data analytics application.

Tuesday, June 9, 2020

Application troubleshooting continued

The sizing specification for Flink runtime workloads is not published by Apache but some form of deployments work as T-shirt size guidance.

They are differentiated by IO-intensive and CPU-intensive allocations and are listed below:

Minimal size is usually for non-production workloads and includes:

Compute Intensive configuration of 1 Zookeeper server and 1 job manager and 4x task managers

The IO intensive configuration has lower task managers

The Small, medium and large proportionately scale up from these configurations.

For example, the small configuration

The medium compute intensive configuration will involve 2 job managers and 8x task managers. The size of zookeeper cluster can be three for each of medium and large deployments.

The large configuration will involve 4 job managers and 16x tasks managers.

The IO intensive configuration can be reduced to smaller numbers from the above configuration.

Kubernetes cluster sizing:

SDP applications are hosted on a cluster that involves Flink runtime cluster and stream store cluster. The stream store is globally shared across all projects level isolations of the application. Therefore, Applications wanting more resources for compute increase the resources available to the Flink Cluster and those requiring more IO try to increase the resources on the Pravega cluster. However, these are not the only clusters on stream store and analytics platform. In addition to the controller and segment stores for the global Pravega cluster on stream store and analytics platform, there are additional clusters required for components such as Zookeeper, Bookkeeper and Keycloak. In fact, performance tuning of an Flink application spans several layers, stacks and components. Tuning tips such as reduction of traffic, omission of unnecessary routines, in-memory operations, removal of bottlenecks and hot spots, load-balancing and increasing the size of the clusters are all valid candidates for improving Application performance.

Cluster sizing is accommodated based on the resources initially carved for the entire cluster for stream store and analytics platform. The resources associated with this cluster is all virtual and the ratio of physical to virtual is adjusted external to the platform

All of the resources for clusters for components within the Kubernetes cluster can have their resource described and configured at the time of installation or upgrade via the values file. These specifications for memory, cpu and storage affect how those components are installed and run. The number of replicas or container counts can be scaled dynamically as typical for Kubernetes deployments. The number of desired and ready instances will reflect whether the resource scaling was accomodated.

Leveraging K8s platform

Application execution is not visible to K8s platform because it occurs over Flink runtime.

The logs, metrics and events from Flink can be made to flow to the Kubernetes runtime.

Leveraging stream store

Applications can continue to use the stream store for persisting custom state and intermediary execution results. This does not impact other applications and gives flexibility for the application to introduce logic such as pause and resume.

Monday, June 8, 2020

Application troubleshooting continued

Application Monitoring:

Flink runtime monitoring has a dedicated solution with Prometheus stack. This stack comprises of Metrics which provides time-series, Labels which provides key-value pairs, Scrape which fetches metrics, TSDB which is the Prometheus storage layer and PromQL which is the query language used for charts, graphs and alerts. The dashboard is available via Grafana. All it takes to set up this stack is to drop the reporter jar in the lib directory and to configure the conf/flink-conf.yaml. The Prometheus service can be configured via prometheus.yml configuration or by service discovery. The stack is helpful even if we want to define custom metrics. Flink, as such, supports gathering and exposing metrics to external systems.

Kubernetes provides events, logs, metrics and audit information for all actors and their activities on assets maintained. All of this data may be collected centrally in the form of json text and destined for services that are dedicated to improving analysis and insights into the machine data.

Content can be wrapped with augmented key-value set that can be used with queries for filtering, transforming, mapping and reducing the operational machine data into more meaningful reports. Kubernetes is well-positioned to wrap individual data entries with metadata to not only enhance the content but also do so authoritatively and irrespective of downstream destination systems.

This envelope of metadata surrounding each entry may consist of predefined labels and annotations, timestamps or attributes of points of origin and any additional extractable key-value pairs from the data itself. Since Kubernetes is the source of truth for the runtime operations associated with hosting the applications, they are done once per data entry.

However, if system administrators are allowed to write rules with which to inject custom key-value pairs in these labels and annotations surrounding each entry, then it may improve the querying associated with the data by providing input not just from the system but also from the rules defined by the system administrator. This set of rules is evaluated by a classifier that executes on all data entries exactly once. The rules may have intrinsic and operators that evaluate against, say, day of week, peak versus non-peak hour period, and traffic characteristics such as five tuple attributes of the flow and so on.

By enhancing the envelope as well as the evaluator to wrap each entry, the downstream systems are guaranteed multiple perspectives on individual entries that were simply not possible earlier by the native Kubernetes framework.

A classifier to add labels and annotations within the Kubernetes framework to boost the native events will significantly improve the capabilities of downstream listeners and their alerts and reports.