Cluster computing

Monday, July 27, 2020

Support for small to large footprint introspection database and query

Another innovation for introspection store is dynamic instantiation on administrator request. When the deployment size of the overall system is small, the size of the introspection store might be highly restricted. It might benefit to instantiate the store only as and when needed. If the store were considered to be a file, this is the equivalent of writing the diagnostics operation information out to the file. Higher end deployments can repurpose a stream to store the periodically or continuously collected operation information. Smaller deployments can materialize the store for the administrator on demand.

Similarly, transformation and preparation of system collected information can happen within the introspection store for the high-end systems and offloaded to external stacks at the lower end. The exposure of introspection datastore in external data pipelines is made possible with direct access or copying.

These results are also useful towards statistical processing from historical trends that may require special purpose logic that would not necessarily need to be rewritten outside the store by consumers. If the logic involves external analytical packages, it can be made available via the analysis platform that the regular data is used with.

The introspection store could also benefit with its own importers and exporters for high-end systems so that it remains the source of truth for a variety of audience. Administrators are not the only ones to benefit from introspection store. Users and applications of the system can access the information in the store via programmatic or tool-based means so long as this shared container cannot be polluted by their access. Existing health reporting, monitoring and metrics may derive secondary information providing methods from this store for users to continue getting the benefit when they don’t need to know that such a store exists but are interested in querying specific runtime information of the system.

Appenders and relayers are patterns that allow simultaneous and piped conversion of data to different formats. A Log appender writes data to console, file and other destinations simultaneously. Text data can be formatted through a json relayer to form json which can then be converted to xml in staged progression. These mechanisms also apply to data in introspection store and can be performed by services operating on the store.

Full-Service stores come with the benefit that the data is not only prepared, saved and sent from the store but it can also be extracted, transformed and loaded in ways suitable to subscribers. The preparation of data in different data stores including the Introspection data store via common publishers from the system itself is a feature that enables consumers to get more from the same data so they have to do little, if any work. The ability to take on services on behalf of the audience by the product/solution is referred to as full-service. The introspection store can certainly participate in full-service solutions.

Sunday, July 26, 2020

Keyword detection

There are test-driven non-academic approaches that deserve a mention. In the section below, I present two of these approaches not covered elsewhere in the literature. They come from the recognition that nlp testers find the prevalent use of a corpus or a training set and neither of them are truly universal. Just like a compression algorithm does not work universally for high entropy data, a keyword-detection algorithm will likely not be universal. These approaches help draw the solution closer to “good enough” bar for themselves.

First approach: any given text can be divided into three different sections – 1) sections with low number of salient keywords 2) sections with high number of salient keywords and 3) sections with a mixed number of non-salient keywords.

Out of these, if an algorithm can work well for 1) and 2) then the 3) section can be omitted altogether to meet the acceptance criteria mentioned. If algorithm needs to be different for 1) and different for 2) then the common subset of keywords between the two will likely be a better outcome than either of them independently.

Second approach: Treat the data set as a unit to be run via merely different clusterers. Each clusterer can have any approach involved for vector representation such as

• involving different metrics such as mutual information or themes such as syntax, semantics, location, statistical or latent-semantics, or word-embeddings

• may require multiple passes of the same text,

• multiple levels of analysis,

• Treating newer approaches including the dynamic grouping approach of treating different selection of keywords to be a clusterer by itself where the groups representing salient topics as representative of pseudo-keywords and

• defining clusters as having a set of terms as centroids.

Then the common keywords detected by these clusterers will allow the outcome of this approach to be better representation of the sample.

Saturday, July 25, 2020

Support for small to large footprint introspection database and query.

Introspection is runtime operational states, metrics, notifications and smartness saved from the ongoing activities of the system. Since the system can be deployed in different modes and with different configurations, this dedicated container and query can take different shapes and forms. The smallest footprint may merely be similar to log files or json data while the large one can even open up the querying to different programming languages and tools. Existing hosting infrastructure such as container orchestration framework and etcd database may provide excellent existing channels of publications that can do away with this store and query altogether but they rely highly on network connectivity while we took the opportunity to discuss that Introspection datastore does not compete but only enhances offline production support.

The depth of introspection is also unparalleled with dynamic management views that is simply not possible with third party infrastructure. The way to get the information is probably best known only to the storage system and improves the offering from the storage product.

There are ways to implement the introspection suitable to a size of deployment of the overall system. Mostly these are incremental add-ons with the growth of the system except for the extremes of the system deployments of tiny and Uber. For example, while introspection store may save runtime only periodically for small deployments, it can be continuous for large deployments. The runtime information gathered could also be more expansive as the size of the deployment of the system grows. The gathering of runtime information could also expand to more data sources as available from a given mode of deployment from the same size. The Kubernetes mode of deployment usually has many deployments and statefulsets and the information on those may be available from custom resources as queried from kube-apiserver and etcd database. The introspection store is a container within the storage product so it is elastic and can accommodate the data from various deployment modes and sizes. Only for the tiniest deployments where repurposing a container for introspection cannot be accomodated, a change of format for the introspection store becomes necessary. On the other end of the extreme, the introspection store can include not just the dedicated store container but also snapshots of other runtime data stores as available. A cloud scale service to virtualize these hybrid data stores for introspection query could then be provided. These illustrate a sliding scale of options available for different deployment modes and sizes.

for small to large footprint introspection database and query.

Friday, July 24, 2020

Support for small to large footprint introspection database and query

The smallest footprint may merely be similar to log files or json data while the large one can even open up the querying to different programming languages and tools. Existing hosting infrastructure such as container orchestration framework and etcd database may provide excellent existing channels of publications that can do away with this store and query altogether but they rely highly on network connectivity while we took the opportunity to discuss that Introspection datastore does not compete but only enhances offline production support.

Introspection is the way in which the software maker uses the features that were developed for the consumers of the product for themselves so that they can expand the capabilities to provide even more assistance and usability to the user. In some sense, this is automation of workflows combined with the specific use of the product as a pseudo end-user. This automation is also called ‘dogfooding’ because it relates specifically to utilizing the product for the maker itself. The idea of putting oneself in the customers shoes to improve automation is not new in itself. When the product has many layers internally, a component in one layer may reach a higher layer that is visible to another standalone component in the same layer so that the interaction may occur between otherwise isolated components. This is typical for layered communication. However, the term ‘dogfooding’ is generally applied to the use of features available from the boundary of the product shared with external customers.

Thursday, July 23, 2020

Deployment

Deploying Pravega manually on Kubernetes:

1) Install Bookkeeper

helm install bookkeeper-operator charts/bookkeeper-operator

Verify that the Bookkeeper Operator is running.

$ kubectl get deploy

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

bookkeeper-operator 1 1 1 1 17s

Install the Operator in Test Mode

The Operator can be run in test mode if we want to deploy the Bookkeeper Cluster on minikube or on a cluster with very limited resources by setting testmode: true in values.yaml file. Operator running in test mode skips the minimum replica requirement checks. Test mode provides a bare minimum setup and is not recommended to be used in production environments.

2) Install the zookeeper:

helm install zookeeper-operator charts/zookkeeper-operator

Verify that the zookeeper Operator is running.

$ kubectl get deploy

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

zookeeper-operator 1 1 1 1 12s

3) Install Pravega

Install the Operator

Deploying in Test Mode

The Operator can be run in a "test" mode if we want to create pravega on minikube or on a cluster with very limited resources by enabling testmode: true in values.yaml file. Operator running in test mode skips minimum replica requirement checks on Pravega components. "Test" mode ensures a bare minimum setup of pravega and is not recommended to be used in production environments.

Using the testmode we can even specify a custom pravega version that can be skipped from the validations of the webhook.

Install a sample Pravega cluster

Set up Tier 2 Storage

Pravega requires a long term storage provider known as longtermStorage.

Check out the available options for long term storage and how to configure it.

For demo purposes, an NFS server can be installed.

$ helm install stable/nfs-server-provisioner --generate-name

helm install pravega-operator charts/pravega-operator

Verify that the pravega Operator is running.

$ kubectl get deploy

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

pravega-operator 1 1 1 1 13s

Please note that the Kubernetes mode differs from the manual steps for VM mode installation in the provisioning of K8s services

Wednesday, July 22, 2020

TLS

In this section, we talk about securing Pravega controller service with TLS. Pravega is a stream store that enables data to be appended continuously to a stream and TLS stands for Transport Layer Security which stands for a protocol that encrypts data in transit between the controller and the client. The documentation for Pravega talks about using keys, certificates, keystores and truststores to secure the controller. It also provides these samples as out of box that can be tried with in-memory standalone mode of deployment for Pravega.

The controller.auth.tlsEnabled=true and controller.auth.tlsCertFile=/etc/secrets/cert.pem are the settings required to secure the controller service. They can also be passed in via environment variables or JVM system properties.

The certificate that the client uses to connect with the server needs to be imported into the truststore on the server side. The keytool command can be used for this purpose as follows.

Care must be taken to ensure that the certificate used meets the following criteria: 1) the client sends a certificate and the server accepts the certificate based on a known certificate authority. It does not have to be signed certificates to share the same certificate authority but it signed certificates are universally accepted. 2) The certificates are properly deployed on the hosts and runtime for the client and the server. and 3) the certificates are not expired.

When the certificates and deployment are proper, the controller uri is available as tls://ip:port instead of tcp://ip:port which can then be used programmatically to make connections such as with:

ClientConfig clientConfig = ClientConfig.builder()

.controllerURI("tls://A.B.C.D:9090")

.trustStore("/etc/ssl")

.credentials(new DefaultCredentials("password", "username"))

.build();

This mode of deployment will be different if the controller is behind a service such as when deployed on a container orchestration framework. For example, in the case of Kubernetes, it is not necessary to configure the controller with tls if the corresponding K8s service is also provisioned with load balancing and TLS. In such a case, the TLS encryption stops at the K8s service and the communication between the service and the controller will remain internal.

With the suitable controllerUri established for the clients to connect, the reading and writing of events to the stream should work the same. Commonly encountered exceptions include the following:

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 10.10.10.10 found at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) at io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:237) at io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:625) ... 26 more

which indicates that the certificates do not have the same alternate domain names specified as that of the host from which the client connects. The usual mitigation for this is to disable the hostname verification or to have the alternate name specified in the certificate imported into the server's truststore.

Tuesday, July 21, 2020

Repackaging and Refactoring continued

As the repackaging is drawn out, there will be interfaces and dependencies that will be exposed which were previously taken for granted. These provide an opportunity to be declared via tests that can ensure compatibility as the system changes.

With the popularity of docker images, the application is more likely required to pull in the libraries for the configurable storage providers even before the container starts. This provides an opportunity for the application to try out various deployment modes with the new packaging and code.

One of the programmability appeals of deployment is the use of containers either directly on a container orchestration framework or on a platform that works with containers. These can test the usages of application against dependencies. The rest of the application can be tested best by running in standalone mode.

There are two factors that determine the ease and reliability of the rewrite of a component.

First, is the enumeration of dependencies and their interactions. For example, if the component is used as a part of a collection or invoked as part of a stack, then the dependency and interactions are not called out and the component will be required to be present in each of the usages. On the other hand, loading if and calling it only when required separates out the dependency and interactions enabling more efficiency

Second, the component may be calling other components and requiring them for a part of all of its operations. If they can be scoped for their usages, then the component has the opportunity to reorganize some of the usages and streamline its dependencies. These outbound references when avoided improves the overall reliability of the component without any loss of compatibility between revisions.

The repackaging is inclusive of the movement of callers including tests so certain classes and methods may need to be made visible for testing which is preferable over duplication of code. There are some sages that will still need to be retained when they are used in a shared manner and those can be favored to be in their own containers.