Cluster computing: July 2020

Friday, July 31, 2020

Support for small to large footprint introspection database and query

The query processing ability from the API will also help automations that rely on scripts that run from the command line. Administrators generally prefer invocation from the command-line whenever possible. Health check by querying the introspection datastore as reported above is not only authoritative but also richer since all the components contribute to the data.

The REST API is a front for other tools and packages like command line interface and SDK which enable programmability for packaged queries of introspection store. An SDK does away with the requirement on the client to make the REST calls directly. This is a best practice favored by many cloud providers who know their cloud computing capability will be called from a variety of clients and devices. A variety of programming languages can be used to provide the SDK which widens the client base. The SDKs are provided as a mere convenience so this could be something that the developer community might initiate and maintain. Another benefit to using the SDK is that now the API versioning and service maintenance onus is somewhat reduced as more customers use their language friendly SDK while the internal wire call compatibility may change. SDKs also support platform independent and interoperable programming which helps test and automation.

Similarly command line interface CLI also improves the customer base. Tests and automations can make use of commands and scripts to drive the service. This is a very powerful technique for administrators as well as for day to day usage by end users. Command line usage has one additional benefit - it can be used for troubleshooting and diagnostics. While SDKs provide an additional layer and are hosted on the client side, command line interface may be available on the server side and provide fewer variables to go wrong. With detailed logging and request-response capture CLI and SDK help ease the calls made to the services SDK so that clients don’t have to make the REST calls directly. This is a best practice favored by many cloud providers who know their cloud computing capability will be called from a variety of clients and devices.

Similarly command line interface CLI also improve the customer base. Tests and automations can make use of commands and scripts to drive the service. This is a very powerful technique for administrators as well as for day to day usage by end users. Command line usage has one additional benefit - it can be used for troubleshooting and diagnostics. While SDKs provide an additional layer and are hosted on the client side, command line interface may be available on the server side and provide fewer variables to go wrong. With detailed logging and request-response capture CLI and SDK help ease the calls made to the services

One of the restrictions that comes with packaged queries exported via REST APIs is their ability to scale since they consume significant resources on the backend and continue to run for a long time. These restrictions cannot be relaxed without some reduction on their resource usage. The API must provide a way for consumers to launch several queries with trackers and they should be completed reliably even if they are done one by one. This is facilitated with the help of a reference to the query and a progress indicator. The reference is merely an opaque identifier that only the system issues and uses to look up the status. The indicator could be another api that takes the reference and returns the status. It is relatively easy for the system to separate read-only status information from read-write operations so the number of times the status indicator is called has no degradation on the rest of the system. There is a clean separation of the status information part of the system which is usually periodically collected or pushed from the rest of the system. The separation of read-write from read-only also helps with their treatment differently. For example, it is possible to replace the technology for the read-only separately from the technology for read-write. Even the technology for read-only can be swapped from one to another for improvements on this side.

Thursday, July 30, 2020

Support for small to large footprint introspection database and query

he REST API is a front for other tools and packages like command line interface and SDK which enable programmability for packaged queries of introspection store. A SDK does away with the requirement on the client to make the REST calls directly. This is a best practice favored by many cloud providers who know their cloud computing capability will be called from a variety of clients and devices. A variety of programming languages can be used to provide the SDK which widens the client base. The SDKs are provided as a mere convenience so this could be something that the developer community might initiate and maintain. Another benefit to using the SDK is that now the API versioning and service maintenance onus is somewhat reduced as more customers use their language friendly SDK while the internal wire call compatibility may change. SDKs also support platform independent and interoperable programming which helps test and automation.

Wednesday, July 29, 2020

Support for small to large footprint introspection database and query

The application that uses the introspection store may access it over APIs. In such case, the execution of the packaged query could imply launching an FlinkJob on a cluster and processing the results of the job. This is typical for any api implementation that involves asynchronous processing.

The execution of a query can take arbitrary amount of time. The REST based api implementation does not have to be blocking in nature since the call sequence in the underlying analytical and storage systems are also asynchronous. The REST api can provide a handle and a status checker or a webhook for completion reporting. This kind of mechanism is common at any layer and is also available from REST.

The ability to provide the execution of results inlined in the HTTP response is specific to the API implementation. Even if the results are large, the API responses can return a filestream so that the caller can process the response at their end. The ability to upload and download files is easily implementable for the API with the use of suitable parameter and return value for the corresponding calls.

The use of upload APIs is also useful to upload analytical queries in the form of Flink artifacts that the API layer can forward and execute on a cluster. The results of the query can then similarly be passed back to the user. The result of the processing is typically going to be a summary and will generally be smaller in size than overall data on which the query is run. This makes it convenient for the API return the results in one call.

Tuesday, July 28, 2020

Support for small to large footprint introspection database and query

The propagation of information from the introspection store along channels other than the send home channel will be helpful to augment the value for their subscribers.

Finally, a benefit of including packaged queries with introspection store in high end deployments. Flink Applications are known for improved programmability with both stream and batch processing. Packaged queries can present summary information from introspection stores.

The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.

The stream processing is facilitated with the help of DataStream API instead of the DataSet API. The latter is more suitable for batch processing. Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.

The difference between Kafka and Flink is that Kafka is better suited for data pipelines while Flink is better suited for streaming applications. Kafka tends to reduce the batch size for the processing. Flink tends to run an operator on a continuous basis, processing one record at a time. Both are capable of running in standalone mode. Kafka may be deployed in a high availability cluster mode whereas Flink can scale up in a standalone mode because it allows iterative processing to take place on the same node. Flink manages its memory requirement better because it can perform processing on a single node without requiring cluster involvement. It has a dedicated master node for co-ordination.

Both systems are able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. It has become the norm for event-driven applications, data pipeline applications and data analytics application.

Whether Kafka applications or Flink applications are bundled with the introspection store can depend on the usage and preference of the administrator. The benefit to the administrator in both cases is that no code has to be written for the analytics.

Monday, July 27, 2020

Support for small to large footprint introspection database and query

Another innovation for introspection store is dynamic instantiation on administrator request. When the deployment size of the overall system is small, the size of the introspection store might be highly restricted. It might benefit to instantiate the store only as and when needed. If the store were considered to be a file, this is the equivalent of writing the diagnostics operation information out to the file. Higher end deployments can repurpose a stream to store the periodically or continuously collected operation information. Smaller deployments can materialize the store for the administrator on demand.

Similarly, transformation and preparation of system collected information can happen within the introspection store for the high-end systems and offloaded to external stacks at the lower end. The exposure of introspection datastore in external data pipelines is made possible with direct access or copying.

These results are also useful towards statistical processing from historical trends that may require special purpose logic that would not necessarily need to be rewritten outside the store by consumers. If the logic involves external analytical packages, it can be made available via the analysis platform that the regular data is used with.

The introspection store could also benefit with its own importers and exporters for high-end systems so that it remains the source of truth for a variety of audience. Administrators are not the only ones to benefit from introspection store. Users and applications of the system can access the information in the store via programmatic or tool-based means so long as this shared container cannot be polluted by their access. Existing health reporting, monitoring and metrics may derive secondary information providing methods from this store for users to continue getting the benefit when they don’t need to know that such a store exists but are interested in querying specific runtime information of the system.

Appenders and relayers are patterns that allow simultaneous and piped conversion of data to different formats. A Log appender writes data to console, file and other destinations simultaneously. Text data can be formatted through a json relayer to form json which can then be converted to xml in staged progression. These mechanisms also apply to data in introspection store and can be performed by services operating on the store.

Full-Service stores come with the benefit that the data is not only prepared, saved and sent from the store but it can also be extracted, transformed and loaded in ways suitable to subscribers. The preparation of data in different data stores including the Introspection data store via common publishers from the system itself is a feature that enables consumers to get more from the same data so they have to do little, if any work. The ability to take on services on behalf of the audience by the product/solution is referred to as full-service. The introspection store can certainly participate in full-service solutions.

Sunday, July 26, 2020

Keyword detection

There are test-driven non-academic approaches that deserve a mention. In the section below, I present two of these approaches not covered elsewhere in the literature. They come from the recognition that nlp testers find the prevalent use of a corpus or a training set and neither of them are truly universal. Just like a compression algorithm does not work universally for high entropy data, a keyword-detection algorithm will likely not be universal. These approaches help draw the solution closer to “good enough” bar for themselves.

First approach: any given text can be divided into three different sections – 1) sections with low number of salient keywords 2) sections with high number of salient keywords and 3) sections with a mixed number of non-salient keywords.

Out of these, if an algorithm can work well for 1) and 2) then the 3) section can be omitted altogether to meet the acceptance criteria mentioned. If algorithm needs to be different for 1) and different for 2) then the common subset of keywords between the two will likely be a better outcome than either of them independently.

Second approach: Treat the data set as a unit to be run via merely different clusterers. Each clusterer can have any approach involved for vector representation such as

• involving different metrics such as mutual information or themes such as syntax, semantics, location, statistical or latent-semantics, or word-embeddings

• may require multiple passes of the same text,

• multiple levels of analysis,

• Treating newer approaches including the dynamic grouping approach of treating different selection of keywords to be a clusterer by itself where the groups representing salient topics as representative of pseudo-keywords and

• defining clusters as having a set of terms as centroids.

Then the common keywords detected by these clusterers will allow the outcome of this approach to be better representation of the sample.

Saturday, July 25, 2020

Support for small to large footprint introspection database and query.

Introspection is runtime operational states, metrics, notifications and smartness saved from the ongoing activities of the system. Since the system can be deployed in different modes and with different configurations, this dedicated container and query can take different shapes and forms. The smallest footprint may merely be similar to log files or json data while the large one can even open up the querying to different programming languages and tools. Existing hosting infrastructure such as container orchestration framework and etcd database may provide excellent existing channels of publications that can do away with this store and query altogether but they rely highly on network connectivity while we took the opportunity to discuss that Introspection datastore does not compete but only enhances offline production support.

The depth of introspection is also unparalleled with dynamic management views that is simply not possible with third party infrastructure. The way to get the information is probably best known only to the storage system and improves the offering from the storage product.

There are ways to implement the introspection suitable to a size of deployment of the overall system. Mostly these are incremental add-ons with the growth of the system except for the extremes of the system deployments of tiny and Uber. For example, while introspection store may save runtime only periodically for small deployments, it can be continuous for large deployments. The runtime information gathered could also be more expansive as the size of the deployment of the system grows. The gathering of runtime information could also expand to more data sources as available from a given mode of deployment from the same size. The Kubernetes mode of deployment usually has many deployments and statefulsets and the information on those may be available from custom resources as queried from kube-apiserver and etcd database. The introspection store is a container within the storage product so it is elastic and can accommodate the data from various deployment modes and sizes. Only for the tiniest deployments where repurposing a container for introspection cannot be accomodated, a change of format for the introspection store becomes necessary. On the other end of the extreme, the introspection store can include not just the dedicated store container but also snapshots of other runtime data stores as available. A cloud scale service to virtualize these hybrid data stores for introspection query could then be provided. These illustrate a sliding scale of options available for different deployment modes and sizes.

for small to large footprint introspection database and query.

Friday, July 24, 2020

Support for small to large footprint introspection database and query

The smallest footprint may merely be similar to log files or json data while the large one can even open up the querying to different programming languages and tools. Existing hosting infrastructure such as container orchestration framework and etcd database may provide excellent existing channels of publications that can do away with this store and query altogether but they rely highly on network connectivity while we took the opportunity to discuss that Introspection datastore does not compete but only enhances offline production support.

Introspection is the way in which the software maker uses the features that were developed for the consumers of the product for themselves so that they can expand the capabilities to provide even more assistance and usability to the user. In some sense, this is automation of workflows combined with the specific use of the product as a pseudo end-user. This automation is also called ‘dogfooding’ because it relates specifically to utilizing the product for the maker itself. The idea of putting oneself in the customers shoes to improve automation is not new in itself. When the product has many layers internally, a component in one layer may reach a higher layer that is visible to another standalone component in the same layer so that the interaction may occur between otherwise isolated components. This is typical for layered communication. However, the term ‘dogfooding’ is generally applied to the use of features available from the boundary of the product shared with external customers.

Thursday, July 23, 2020

Deployment

Deploying Pravega manually on Kubernetes:

1) Install Bookkeeper

helm install bookkeeper-operator charts/bookkeeper-operator

Verify that the Bookkeeper Operator is running.

$ kubectl get deploy

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

bookkeeper-operator 1 1 1 1 17s

Install the Operator in Test Mode

The Operator can be run in test mode if we want to deploy the Bookkeeper Cluster on minikube or on a cluster with very limited resources by setting testmode: true in values.yaml file. Operator running in test mode skips the minimum replica requirement checks. Test mode provides a bare minimum setup and is not recommended to be used in production environments.

2) Install the zookeeper:

helm install zookeeper-operator charts/zookkeeper-operator

Verify that the zookeeper Operator is running.

$ kubectl get deploy

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

zookeeper-operator 1 1 1 1 12s

3) Install Pravega

Install the Operator

Deploying in Test Mode

The Operator can be run in a "test" mode if we want to create pravega on minikube or on a cluster with very limited resources by enabling testmode: true in values.yaml file. Operator running in test mode skips minimum replica requirement checks on Pravega components. "Test" mode ensures a bare minimum setup of pravega and is not recommended to be used in production environments.

Using the testmode we can even specify a custom pravega version that can be skipped from the validations of the webhook.

Install a sample Pravega cluster

Set up Tier 2 Storage

Pravega requires a long term storage provider known as longtermStorage.

Check out the available options for long term storage and how to configure it.

For demo purposes, an NFS server can be installed.

$ helm install stable/nfs-server-provisioner --generate-name

helm install pravega-operator charts/pravega-operator

Verify that the pravega Operator is running.

$ kubectl get deploy

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

pravega-operator 1 1 1 1 13s

Please note that the Kubernetes mode differs from the manual steps for VM mode installation in the provisioning of K8s services

Wednesday, July 22, 2020

TLS

In this section, we talk about securing Pravega controller service with TLS. Pravega is a stream store that enables data to be appended continuously to a stream and TLS stands for Transport Layer Security which stands for a protocol that encrypts data in transit between the controller and the client. The documentation for Pravega talks about using keys, certificates, keystores and truststores to secure the controller. It also provides these samples as out of box that can be tried with in-memory standalone mode of deployment for Pravega.

The controller.auth.tlsEnabled=true and controller.auth.tlsCertFile=/etc/secrets/cert.pem are the settings required to secure the controller service. They can also be passed in via environment variables or JVM system properties.

The certificate that the client uses to connect with the server needs to be imported into the truststore on the server side. The keytool command can be used for this purpose as follows.

Care must be taken to ensure that the certificate used meets the following criteria: 1) the client sends a certificate and the server accepts the certificate based on a known certificate authority. It does not have to be signed certificates to share the same certificate authority but it signed certificates are universally accepted. 2) The certificates are properly deployed on the hosts and runtime for the client and the server. and 3) the certificates are not expired.

When the certificates and deployment are proper, the controller uri is available as tls://ip:port instead of tcp://ip:port which can then be used programmatically to make connections such as with:

ClientConfig clientConfig = ClientConfig.builder()

.controllerURI("tls://A.B.C.D:9090")

.trustStore("/etc/ssl")

.credentials(new DefaultCredentials("password", "username"))

.build();

This mode of deployment will be different if the controller is behind a service such as when deployed on a container orchestration framework. For example, in the case of Kubernetes, it is not necessary to configure the controller with tls if the corresponding K8s service is also provisioned with load balancing and TLS. In such a case, the TLS encryption stops at the K8s service and the communication between the service and the controller will remain internal.

With the suitable controllerUri established for the clients to connect, the reading and writing of events to the stream should work the same. Commonly encountered exceptions include the following:

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 10.10.10.10 found at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) at io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:237) at io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:625) ... 26 more

which indicates that the certificates do not have the same alternate domain names specified as that of the host from which the client connects. The usual mitigation for this is to disable the hostname verification or to have the alternate name specified in the certificate imported into the server's truststore.

Tuesday, July 21, 2020

Repackaging and Refactoring continued

As the repackaging is drawn out, there will be interfaces and dependencies that will be exposed which were previously taken for granted. These provide an opportunity to be declared via tests that can ensure compatibility as the system changes.

With the popularity of docker images, the application is more likely required to pull in the libraries for the configurable storage providers even before the container starts. This provides an opportunity for the application to try out various deployment modes with the new packaging and code.

One of the programmability appeals of deployment is the use of containers either directly on a container orchestration framework or on a platform that works with containers. These can test the usages of application against dependencies. The rest of the application can be tested best by running in standalone mode.

There are two factors that determine the ease and reliability of the rewrite of a component.

First, is the enumeration of dependencies and their interactions. For example, if the component is used as a part of a collection or invoked as part of a stack, then the dependency and interactions are not called out and the component will be required to be present in each of the usages. On the other hand, loading if and calling it only when required separates out the dependency and interactions enabling more efficiency

Second, the component may be calling other components and requiring them for a part of all of its operations. If they can be scoped for their usages, then the component has the opportunity to reorganize some of the usages and streamline its dependencies. These outbound references when avoided improves the overall reliability of the component without any loss of compatibility between revisions.

The repackaging is inclusive of the movement of callers including tests so certain classes and methods may need to be made visible for testing which is preferable over duplication of code. There are some sages that will still need to be retained when they are used in a shared manner and those can be favored to be in their own containers.

Monday, July 20, 2020

Repackaging and Refactoring

Software applications take on a lot of dependencies that go through revisions independent from the applications. These dependencies are fortunately external so the application can defer the upgrade of a dependency by holding on to a version but hopefully not for long otherwise it incurs technical debt.

The application itself could be composed of many components that could be anywhere in the application architecture and hierarchy. When these components undergo rewrite or movement in the layers, then the changes could impact the overall functionality even when there is no difference in logic and purpose.

One way to eliminate the concerns from unwanted regressions in the application behavior, is to position the components where they can change without impacting the rest of the organization.

For example, an application may depend on one or more storage providers. These providers are usually loaded by configuration at the time the application is launched. Since any of the storage provider could be used, the application ships all of them in a Java jar file.

In this case, the positioning of the components could be such that the application separates the storage provider with a contract that can be fulfilled by in-house dependencies or those written externally and available to be imported via standard Java library dependency.

The contract itself could be in its own jar file so that the application is only aware of the contract and not the storage providers which gives more flexibility to the application to focus on its business logic that runtime / host / operational considerations.

The storage providers could now be segregated and provided in their own library which can then be established as a compile time or preferably runtime dependency for the application. This gives the opportunity to test those storage providers more thoroughly and independent of changes in the application.

The cost of this transformation is typically the repackaging, refactoring and reworking of the tests for the original libraries. The last one being the hidden cost which needs to be expressed not only as the changes to the existing tests but the introduction of new more specific tests for the individual storage providers that were not available earlier.

The tests also vary quite a bit because they are not restricted to targeting just the components. They could be testing integration which spans many layers and consequently even more dependencies than the code with which the application runs.

Sunday, July 19, 2020

Stream store discussion continued

The notification system is complimentary to the health reporting stack but not necessarily a substitute. This document positions the notification system component of the product as a must-have component which should work well with existing subscriber plugins. It also exists side by side with metrics publisher and subscribers.

Publisher-subscriber pattern based on observable events in a stream suits the notification system very well. Persistence of notifications also enables web-based access of notifications that can be programmed to deliver in more than one form to a customer as per the compatibility of the device owned by the customer. A stream store proves to be better suited for the messaging service so as a storage solution in can provide this service out of the box.

Classes:

Sensor Manager:

• Refreshes the list of active sensors by getting the entire set from SensorScanner.

• Creates a sensor task for each of the active sensors and cancels those for the inactive ones

SensorTask

• Retrieves sensors from persistence layer

• Persists sensor to persistence layer

• Determines the senor alert association

SensorScanner

• Maintains the sensors as per the storage container hierarchy in the storage system

• Periodically updates the sensor manager

SensorClient

• Creates/gets/updates/deletes a sensor

AlertClient

• Creates/gets/updates/deletes an alert

AlertAction

• Changes state of an alert by triggering the operation associated

NotificationList

• Creates a list of notification objects and implements the set interface

• Applies a function parameter on the data from the set of notification objects

• Applies predefined standard query operators on the data from the set of notification objects

Notification

• Data structure for holding the value corresponding to a point of time.

Unit-Test:

Evaluator evaluator = sensor.getEvaluatorRetriever().getEvaluator("alertpolicy1"); SensorStateInformation stateInformation = evaluator.evaluate(sensor, threshold);

Saturday, July 18, 2020

Stream store discussion continued

Design of alerts and notifications system for any storage engineering product.

Storage engineering products publish alerts and notifications that interested parties can subscribe to and take actions on specific conditions. This lets them remains hands-off the product as it continues to serve the entire organization.

The notifications are purely a client-side concept because a periodic polling agent that watches for a number of events and conditions can raise this notification. The store does not have to persist any information. Yet the state associated with a storage container can also be synchronized by persistence instead of retaining it purely as an in-memory data structure. This alleviates the need for the users to keep track of the notifications themselves. Even though the last notification is typically the most actionable one, the historical trend from persisted notifications gives an indication for the frequency of adjustments.

The notifications can also be improved to come from various control plane activities as they are most authoritative on when certain conditions occur within the storage system. These notifications can then be run against the rules specified by the user or administrator so that only a filtered subset is brought to the attention of the user.

Notifications could also be improved to come from the entire hierarchy of containers within a storage product for all operations associated with them. They need not just be made available on the user interactions. They can be made available via a subscriber interface that can be accessed globally.

There may be questions on why information on the storage engineering product needs to come from notifications as opposed to metrics which are suited for charts and graphs via existing time-series database and reporting stack. The answer is quite simple. The storage engineering product is a veritable storage and time series database in its own right and should be capable storing both metrics and notifications. All the notifications are events and those events are also as continuous as the data that generates them. They can be persisted in the store itself. Data does not become redundant as they are stored in both formats. Instead, one system caters to the in-store evaluation of rules that trigger only the alerts necessary for the humans and another is more continuous machine data that can be offloaded for persistence and analysis to external dedicated metrics stacks.

When the events are persisted in the store itself, the publisher-subscriber interface then becomes similar to writer-reader that the store already supports. The stack that analyzes and reports the data can read directly from the store. Should this store container for internal events be hidden from public, a publisher-subscriber interface would be helpful. The ability to keep the notification container internal enables the store to cleanup as necessary. Persistence of events also helps with offline validation and introspection for assistance with product support.

Friday, July 17, 2020

Stream store discussion continued

Pravega stream store already has an implementation for ByteArraySegment that allows segmenting a byte array and operating only on that segment. This utility can help view collections of events in memory and writing it out to another stream. It extends from AbstractBufferView and implements ArrayView interfaces. These interfaces already support copyTo methods that can take a set of events represented as a ByteBuffer and copy it to destination. If copying needs to be optimized, the BufferView interface provides a reader that can copy into another instance of BufferView. The ArrayView interface provides an index addressable collection of ByteBuffers with methods for slice and copy. The ByteArraySegment provides a writer that can be used to write contents to this instance and a reader that can copy into another instance. It has methods for creating a smaller ByteArraySegment and copying the instance.

The segment store takes wire commands such as to append an event to a stream as part of its wire service protocol. These commands operate on an event or at best a segment at a time. But the operations they provide are the full set of create, update, read and delete. Some of the entries are based on table segments and require special case because they are essentially metadata in the form of key-value store that are dedicated to stream, transaction and Segment metadata. The existing wire commands serve as an example for extensions to segment ranges or even streams. Even the existing wirecommands may be sufficient to handle segment ranges merely by referring to them by their metadata. For example, we can create metadata from existing metadata and make an altogether new stream. This operation will simply edit the metadata for the new stream to be different in its identity but retain the data references as much as possible. Streams are maintained as much isolated from one another as possible which requires data to be copied. Copying data is not necessarily bad or time-consuming. In fact, storage engineering has lowered the cost of storage as well as that of the activities involved which speaks in favor of users and applications that want to combine lots of small streams into a big one or to make a copy of one stream for isolation of activities. Data pipeline activities such as these have a pretty good tolerance to stream processing operations in general. The stream store favors efficiency and data deduplication but it participates in data pipeline activities. It advocates combining stream processing and batch processing using the same stream and enables writing each event exactly once regardless of failures in the sender, receiver and network. This core tenet remains the same as applications continue to demand their own streams and in the case where they cannot share an existing stream, they will look for copyStream logic.

Now coming back to other examples of commands and activities involving streams and segments, we include an example from integration tests as well. The StreamProducerOperationDataSource also operates on a variety of producer operation types. Some of these include create, merge and abort transactions. A transaction is its own writer of data because it commits all or none. These operations are performed independently and the merge transaction comes close to combining the writes and serves as a good example for the copy stream operation.

Thursday, July 16, 2020

Stream manager discussion continued

The following section describes the stream store implementation of copyStream.

Approach 1. The events are read by a batch client from the streamcuts pointing to head and tail of the stream. The head and tail are determined at the start of the copy operation and do not accommodate any new events that may be written to the tail of the stream after the copy operation is initiated. The segment iterator initiated for the ranges copies the events one by one to the destination stream. When the last event has been written to the stream, the last segment is sealed and the destination stream is made available to the caller. The entire operation is re-entrant and idempotent if a portion of the stream has already been written.

Approach 2. The stream store does not copy events per se but segments since the segments point to ids, epoch and offset which are invariants between the copies. The segments are stored on tier 2 as files or blobs and they are copied with a change of name. The resulting set of segment ranges are then placed in the container for a new stream which is the equivalent of the folder or bucket on tier 2 and given a proper name for the stream. The entire operation can be scoped to a transaction so that all or none are copied.

Between the two approaches, the latter is closer to the segment store and probably more efficient. The metadata for the streams such as the watermarks and reader group states are removed. Only the data part of the streams is copied. The copied streams are then registered with the stream store.

Wednesday, July 15, 2020

Reader Group Notifications continued

The following section describes the set of leads that can be drawn for investigations into notifications. First, the proper methods for determining the data with which the notifications are made, needs to be called correctly. The number of segments is a valid datapoint. It can come from many sources. The streamManager is authoritative but the code that works with the segments might be at a lower level to make use of the streamManager. The appropriate methods to open the stream for finding the number of segments may be sufficient. If that is a costly operation, then the segments can be retrieved from say the readergroup but those secondary data structures must also be kept in sync with the source of truth.

Second the methods need to be called as frequently as the notifications that are generated otherwise the changes will not be detected. The code that generates the notifications is usually offline as compared to the online usage of the stream. This requires the data to be fetched as required for generating these notifications.

Lastly, the notification system should not introduce unnecessary delay in sending the notifications. If there is a delay or a timeout, it becomes the same as not receiving the notification.

The ReaderGroupState is initialized at the time of the creation of reader group. It could additionally be initialized/updated at the time of reader creation. Both the reader Group and reader require the state synchronizer to be initialized so the synchronizer can be initialized with the new group state. This is not implemented in the Pravega source code yet.

During the initialization of the ReaderGroupState the segments are fetched from the controller and corresponds to the segments at the current time. The controller also provides the ability to get all segments between start and end tailcuts. However those segments are a range not a map as desired.

The number of Readers comes from the reader group. These data will remain the same on cross thread calls and invocations from different levels of the stack. The readerGroup state is expected to be accurate for the purpose of sending notifications but it can still be inaccurate if the state is initialed once and only updated subsequently without refreshing the state via initialization each time.

Tuesday, July 14, 2020

Reader Group notifications continued.

One of the best ways to detect all reader group segment changes is to check the number of segments at the time of admission of readers into the reader group.

This would involve:

Map<SegmentWithRange, Long> segments = ReaderGroupImpl.getSegmentsForStreams(controller, config);

synchronizer.initialize(new ReaderGroupState.ReaderGroupStateInit(config, segments, getEndSegmentsForStreams(config)));

Where the segments are read from the ReaderGroup which knows the segment map and the distribution of readers rather than getting the segments from the state.

Lastly, the notification system should not introduce unnecessary delay in sending the notifications. If there is a delay or a timeout, it becomes the same as not receiving the notification.