Cluster computing: March 2020

Tuesday, March 31, 2020

There are a few advantages to using manifests and charts with custom resources that are self-contained as opposed to those that are provisioned external to the Kubernetes infrastructure via service brokers. When the charts are self-contained, they are completely within the Kubernetes system and accessible for create, update and delete via the kubectl. The kube-apiserver and the operators can take care of the state reconciliation and the system becomes the source of truth. This is not the case with external resources which can have varying degrees of deviation from truth depending on the external service provider.
Resources that are provisioned by service broker can also be kustomized. The charts for these resources are oblivious to the provisioners. It’s the kube-apiserver that determines high provisioners are to be tapped for a resource. This is looked up in the service catalog. The biding between the catalog and the provisioner is a loose one. They can get out of sync but the catalog guarantees a global registry. The manifest and charts work well to describe and define these externally provisioned custom resources
The resources that are provisioned externally have their own data management systems since they are external services and are responsible for the data they store in the system. This data can also be exported if the external service provides an option. The Kubernetes resource can then merely include a resource identifier or a uri for the data associated with the external resource. In such a case, the resource becomes exclusively control plane only.
There is a size limit to the self-contained Kubernetes resource. We can only embed Kubernetes secret in them. We cannot embed arbitrary byte ranges in Kubernetes resources. That would not be appropriate since it’s a resource for handling in the control plane. Any tar ball of data should be downloadable via a URI from its respective service. This keeps the concerns between control plane and data plane separate. It is also easy on the Kube-api server while delegating the sync between the custom resource and external service providers to be separate.
The size and the separation of concerns between the kubectl resource and its associated data does not stop with the external service and the kube-api server. The binding between the resource and the external service is managed with the help of a service catalog which lets services to be hosted outside the cluster. It adheres to OSBA api. It allows services to be independent and scaleable. It allows services to define their own resources.

Monday, March 30, 2020

Both persistent volume and network accessible storage refer to disk storage for Kubernetes which is a portable, extensible, open-source platform for managing containerized workloads and services. It is often a strategic decision for any company because it improves business value of their offering while relieving their business logic from the chores of the hosts so that the same logic can work elsewhere with minimal disruption to its use.
Kubernetes provides a familiar notion of shared storage system with the help of VolumeMounts accessible from each container. A volume mount is a shared file system which may be considered local to the container and reused across containers. The file system protocols have always facilitated the local and remote file storage with their support for distributed file systems. This allowed for databases, configurations and secrets to be available on disk across containers and provide single point of maintenance. Most storage regardless of which storage access protocol – file system protocols, http(s), block or stream are essentially moving data to storage so there is a transfer and latency involved.
The only question has been what latency, and I/O throughput is acceptable for the application and this has guided the decisions for the storage systems, appliances and their integrations. When the storage is tightly coupled with the compute such as between a database server and a database file, all the reads and writes incurred from performance benchmarks require careful arrangement of bytes, their packing, organization, index, checksums and error codes. But most applications hosted on Kubernetes don’t have the same requirements as a database server.
This design and relaxation of performance requirements from applications hosted on Kubernetes facilitates different connectors not just volume mounts. The notion that the same data can be sent to a destination regardless of the destination has been successfully demonstrated by log appenders which publish logs to a variety of destinations. Connectors, too, can help persist data written from the application to a variety of storage providers using consolidators, queues, cache and mechanisms that know how and when to write the data.
The native Kubernetes API does not support any other forms of storage connectors other than the VolumeMount but it does allow services to be written in the form of Kubernetes applications that can accept the data published over http(s) just like a time series database server accepts all kinds of events over the web protocol. The configuration of the endpoint, the binding of the service and the contract associated with the service for the connector definition may vary from destination to destination in the same data publishing application. This may call for the application to become a consolidator that can provide different storage class and support different data workload profiles. Appenders and connectors are popular design patterns that get re-used often and justify their business value.
The shared data volume can be made read-only and accessible only to the pods. This facilitates access restrictions. While authentication, authorization and audit can be enabled for storage connectors, they will still require RBAC access. Therefore, service accounted become necessary with storage connectors. A side-benefit of this security is that the accesses can now be monitored and alerted.

Sunday, March 29, 2020

This article introduces its readers to the long duration testing tools for storage engineering products. Almost all products made for digital storage have to deal with bytes of data that usually have a start offset, end offset and content for the range of byte. It’s the pattern of reading and writing various byte ranges that generates load on the storage products. Whether the destination for data is a file in a folder, a blob in a bucket, a content-addressable-storage in a remote store or a stream in a scope, the requirements for reading and writing of large amounts of data remains more or less the same. This would make most load or stress testing tools appear to be similar in nature with a standard battery of tests. This article calls out some of those re-appearing tests but it might also surprise us to know that these tools much like the products they test differ even within the same category playing one or more of their strengths.

These tests include:

1) Precise storage I/O for short and long durations

2) TCP or UDP generation with full flexibility for their traffic generation and control

3) Generating real-world data access patterns in terms of storage containers, their count and size

4) Supporting byte range validations with hashing, journaling of reads and writes, replay of load patterns and configuring compression and byte range validation

5) Able to identify, locate and resolve system malfunction through logging and support for test reports

6) Configurability to determine target systems, the inclusion of loads and the parameters necessary for test automation

7) Test across hundreds of remote clients and servers from a single controller.

8) Determining server-side performance exclusively from client-side reporting.

9) Able to pause, resume and show progress

10) Complete and accurate visibility into the server or component of storage products that show anomalies

11) Overloading the system until they break and being able to determine the threshold before the surge as well as after thrashing

Saturday, March 28, 2020

One thing manifests and charts should not do, is to rely exclusively on kustomizations to chain ownerships. Those belong in the chart or the operator code which exclusively define the Kubernetes resources. There is an inherent attribute on every Kubernetes resource called ownerReferences that allows us to chain construction and deletion of Kubernetes objects. This is very valuable to the resource because the kube-apiserver will perform these actions automatically. Specifying the chaining together with the resource makes it clear to Kubernetes control plane that these objects are first class citizens that it needs to manage. Specifying it elsewhere leaves the onus on the application not the the kube-apiserver. If there is some logic that cannot be handled via static yaml declarations, then it belongs in the operator code not the charts and the manifests
There are a few advantages to using manifests and charts with custom resources that are self-contained as opposed to those that are provisioned external to the Kubernetes infrastructure via service brokers. When the charts are self-contained, they are completely within the Kubernetes system and accessible for create, update and delete via the kubectl. The kube-apiserver and the operators can take care of the state reconciliation and the system becomes the source of truth. This is not the case with external resources which can have varying degrees of deviation from truth depending on the external service provider.
Resources that are provisioned by service broker can also be kustomized. The charts for these resources are oblivious to the provisioners. It’s the kube-apiserver that determines high provisioners are to be tapped for a resource. This is looked up in the service catalog. The biding between the catalog and the provisioner is a loose one. They can get out of sync but the catalog guarantees a global registry. The manifest and charts work well to describe and define these externally provisioned custom resources

Friday, March 27, 2020

Kustomization allows us to create object types that are not defined by the Kubernetes cluster. These object types allow us to group objects so that they may all have overlayed configuration or overridden configuration. One possible use case scenario is the overriding of all images to be loaded from a specific internal registry. These images are treated as an object type to separate its distinction from the Kubernetes image object. Regardless of the origin of these images, these objects can now uniformly be loaded from internal registry
The use of manifests for object types is a way of grouping resources for these changes. They can be modified to enhance annotations and labels. Those annotations are not limited to post configuration processing or interpretations. They can be used upfront with a selector to determine if they are candidates for creating a resource.
The use of labels is similar to the use of annotations in that they come helpful to allow their immediate use in templates from the charts. The grouping of resources by labels can in fact be specified as a static yaml configuration suitable use within a chart with the label as the parameter and the specification of a selector to match the label.
These mechanisms show a way to have the manifest generate what is needed by the charts in addition to their already well-established role in applying overlays and overrides via kustomization.
One thing manifests and charts should not do, is to rely exclusively on kustomizations to chain ownerships. Those belong in the chart or the operator code which exclusively define the Kubernetes resources. There is an inherent attribute on every Kubernetes resource called ownerReferences that allows us to chain construction and deletion of Kubernetes objects. This is very valuable to the resource because the kube-apiserver will perform these actions automatically. Specifying the chaining together with the resource makes it clear to Kubernetes control plane that these objects are first class citizens that it needs to manage. Specifying it elsewhere leaves the onus on the application not the the kube-apiserver. If there is some logic that cannot be handled via static yaml declarations, then it belongs in the operator code not the charts and the manifests

Thursday, March 26, 2020

The use of charts and manifests explained:
Kustomization with all its benefits as described in earlier posts also serves as a template free configuration
When the charts are authored they have to be general purpose with the user having the ability to specify the parameters for deployment. Most of these options are saved in the values file. Even if there are more than one chart, the user can provide these values to all the charts via a wrapper values file. These values are then used to override the default values that may come with the associated chart. The template allows taking these options from the values files and using it to create the Kubernetes resources as necessary. This makes the template work the same regardless of the deployment. The syntax used in the templates is quite flexible to read the options from the values file. Other than these parameters, the templates are mere static definitions for Kubernetes resources and encourage consistency throughout the definitions such as with annotations and labels.
When the charts are used with values file, kustomization is not necessary. It is just a value add on based on our discussion above. The user can choose between specifying manifests and charts to suit his deployment. They serve different purposes and it would behoove the user to use either or both appropriately.
The user also has the ability to compose multiple values files and use them together with comma separations when specifying to apply the chart. This allows the user to organize the options in the values file to be granular and then include them on a case by case basis.

Wednesday, March 25, 2020

Billing requests are expected to be continuous for a business. Are they indeed a stream so that they can be stored as such ? Consider that billing requests have existing established technologies that compete with the business justification of storing them as streams? Will there be cost savings as well as expanded benefits? If the stream storage is a market disruptive innovation, can it be overlaid over object storage which has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Event storage. We focus on the use case of billing requests because they are not relational and find many applications that are similar to the use cases of object storage. Specifically, events conform to append only stream storage due to the sequential nature of the events. billing requests are also processed in windows making a stream processor such as Flink extremely suitable for events. Stream processors benefit from stream storage and such a storage can be overlaid on any Tier-2 storage. In particular, object storage unlike file storage can come very useful for this purpose since the data also becomes web accessible for other analysis stacks. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a queue layer and a cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that billing requests are similar to fragmented data transfer and how object storage can serve both as source and destination of billing requests.

Event storage gained popularity because a lot of IoT devices started producing them. Read and writes were very different from conventional data because they were time-based sequential and progressive. Although stream storage is best for events, any time-series database could also work. However, they are not web-accessible unless they are in an object store. Their need for storage is not very different from applications requiring object storage that facilitate store and access. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Event and it is in this space that billing requests and object store find suitability. Search, browse and query operations are facilitated in a web service using a web-accessible store.

Tuesday, March 24, 2020

We were discussing Kubernetes Kustomization. There are two advantages to using it. First, it allows us to configure the individual components of the application without requiring changes in them. Second, it allows us to combine components from different sources and overlay them or even override certain configurations. The kustomize tool provides this feature. Kustomize can add configmaps and secrets to the deployments using their specific generators respectively.
Kustomize is static declaration. We can add labels across components. We can choose the groups of Kubernetes resources dynamically using selectors but they have to be declared as yaml. This kustomization yaml is usually stored as manifests and applied on existing components so they refer to other yamls. The manifests is a way of specifying the location of the kustomization files and passing it as a commandline parameter to kubectl commands with -k option
For example, we can say:
commonLabels:
app: potpourri-app
resources:
- deployment.yaml
- service.yaml
We can even add new resources such as K8s secret
This comes useful to inject username passwords for say a database application at the time of install and uninstall with the help of a resource called secret.yaml. It just won't detect a virus to force an uninstall of the product. Those actions remain with the user.
Kustomize also helps us to do overlays and overrides. Overlay means we change parameters for one or more existing components. Override means we take an existing yaml and change portions of it such as changing the service to be of type LoadBalancer instead of NodePort or vice versa for developer builds. In this case, we provide just enough information to lookup the declaration we want to modify and specify the modification. For example:
apiVersion:v1
kind:Service
metadata:
name: myservice
spec:
type: NodePort
If the above service type modification were persisted side by side as prod and dev environment, it would be called an overlay.
Finally the persistence of kustomization files is not strictly required and we can run:
kustomize build manifests_folder | kubectl apply -f
or
kubectl apply -k
One of the interesting applications of Kustomization is the use of internal docker registries.
we use the secretGenerator to create the secret for the registry which typically has the
docker-server, docker-username, docker-password and docker-email and the secret type to be type: docker-registry
This secret can take environment variables and the kustomization file can even be stored in source control.

Monday, March 23, 2020

Kubernetes Kustomization techniques:
Kustomize is a standalone tool for the Kubernetes platform that supports the management of objects using a kustomization file.
“kubectl kustomize <kustomization_directory>” command allows us to view the resources that can be kustomized. The apply verb instead of the kustomize verb can be used to apply it again.
It can help with generating resources, setting cross-cutting fields such as labels and annotations or metadata and composing or customizing groups of resources.
The resources can be generated and infused with specific configuration and secret using a configMap generator and a secret generator respectively. For example, it can take an existing application.properties file and generated a configMap that can be applied to new resources.
Kustomization allows us to override the registry for all images used in the containers for an application.
Any access to images on registries such as docker, gcr, or AWS is outbound from the kube cluster and will likely require credentials. Outbound connectivity from the pods is given using well known Nameservers and gateway added through the master which even goes through the host visible network and to the external world. There are two modes for downloading images:
First, we can provide an insecure registry that is internal and private to the cluster. This registry has no images to begin with and all images can be subsequently uploaded to the registry. It helps us create a manifest of images required to host the application on the cluster and it provides us the ability to use non-tls registry
Second, we can use credentials with one or more external registries as add on because the outbound request for pulling an image can be configured using credentials by referring to them as “registry-creds". Each external registry can accept the path for a credentials file usually in json format which will help configure the registry.
Together these options allow all images to be made available inside the cluster so that containers can be spun up on the cluster to host the application.
Kubernetes has a controller-manager, a kubelet, an apiserver, a proxy, etcd and a scheduler. All of these can be configured using a configurator. The –feature-gates flag can be used to govern what can be allowed to run. The options supported by the feature-gates are few but the components can utilize them to provide selectivity in inclusion for running the app.

Sunday, March 22, 2020

A billing application over stream storage:

Billing requests are expected to be continuous for a business. Are they indeed a stream so that they can be stored as such ? Consider that billing requests have existing established technologies that compete with the business justification of storing them as streams? Will there be cost savings as well as expanded benefits? If the stream storage is a market disruptive innovation, can it be overlaid over object storage which has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Event storage.  We focus on the use case of billing requests because they are not relational and find many applications that are similar to the use cases of object storage. Specifically, events conform to append only stream storage due to the sequential nature of the events. billing requests are also processed in windows making a stream processor such as Flink extremely suitable for events. Stream processors benefit from stream storage and such a storage can be overlaid on any Tier-2 storage. In particular, object storage unlike file storage can come very useful for this purpose since the data also becomes web accessible for other analysis stacks. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a queue layer and a cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that billing requests are similar to fragmented data transfer and how object storage can serve both as source and destination of billing requests.

Saturday, March 21, 2020

High Availability for applications.
This article is a comparision between cluster mode and serverless computing. Both are techniques to scale software deployments to meet the challenges of the increasing traffic from clients. Back of the envelope calculations for capacity and an inclination to treat software deployments as traditional have been common practice. It is time to wake up to lower costs with small changes in modes of deployment and application modularity
High availability clusters spin up additional nodes to handle the load. They scale out with fixed long running costs for nodes and continue till they are resized. The intervention to increase or decrease the number of nodes is independent of the demand and capacity fluctuations. All aspects of the node from compute to operating system and application stack involve maintenance chores for the application. The code for the application also takes on such routines as health checks, alerts and notifications.
Compare this with serverless computing where the resources required to execute the code is not only independent of the application but also dynamically provisioned and torn down so we pay as we go. Applications are already familiar with Platform as a service model and have shifted to deep divisions in application modularity with separate hardware and software for each module that is managed independent of the applications. This shift towards serverless computing also called Function as a service is only minimally more than the PaaS model because application have to organize their logic into smaller functions that can be executed without concern for resources. The applications also get to focus more on their actual business value rather than the mundane operation concerns.
The cluster model is hugely popular since it has shown a proven track record and tested software in cluster management. Cluster management also provides customized capabilities by way of dedicated nodes. The serverless computing is only provided by big cloud providers who can take away all the costs from our application by providing economies of scale. The serverless architecture may be standalone or distributed. In both cases, it remains an event-action platform to execute code in response to events. We can execute code written as functions in many different languages and a function is executed in its own container. Because these functions are asynchronous to the frontend and backend, they need not perform continuous polling which helps them to be more scaleable and resilient. OpenWhisk introduces event programming model where the charges are only for what is used.
The choice between the cluster and serverless computing also depends on the ability for the organizations to adapt. Clusters are easy to be provisioned on premise and on orchestration frameworks whereas the use of public cloud technologies has still not penetrated sufficiently within the organizations where they become mainstream mode of deployment. Some organizations also have genuine need for special purpose hardware racks and cannot truly be software defined.

Friday, March 20, 2020

Kubernetes has a controller-manager, a kubelet, an apiserver, a proxy, etcd and a scheduler. All of these can be configured using a configurator. The –feature-gates flag can be used to govern what can be allowed to run. The options supported by the feature-gates are few but the components can utilize them to provide selectivity in inclusion for running the app.
The images do not have feature gates. They are atomic and wholesome in that they are composed of layers but are treated as a whole image with <registry>/<name>:<tag> specifier. The images referenced as "repository" key value in the values file can simply specify the name and tag and utilize the registry resolving to locate the image. The kubernetes framework must be guided in this process.
The image loading is first local then remote. Even in remote the preference is for the configured registry rather than the default. This allows us to use manifests with charts in an effective way. The helm charts contain at least two elements: a chart.yml that has a description of the package and one or more templates which contains Kubernetes manifest files
A manifest can specify chart override values such as either:
organization/chart-values: |-
{
"image": {
"repository": "image1"
}
}
Or

organization/chart-values: |-
{
"global": {
"registry" : ""
},
"createdecksappResource": false
}
In the first case, the image is used with its name and tag as specified. In the second case, the registry is prefixed to the image and tag making it specific to where the image should be located.

Thursday, March 19, 2020

Image registry:
Kubernetes, as a container orchestration framework, requires images to launch containers. The image registry can be private or it can be one of the well know public registries.
Any access to images on registries such as docker, gcr, or AWS is outbound from the kube cluster and will likely require credentials. Outbound connectivity from the pods is given using well known Nameservers and gateway added through the master which even goes through the host visible network and to the external world. There are two modes for downloading images:
First, we can provide an insecure registry that is internal and private to the cluster. This registry has no images to begin with and all images can be subsequently uploaded to the registry. It helps us create a manifest of images required to host the application on the cluster and it provides us the ability to use non-tls registry
Second, we can use credentials with one or more external registries as add on because the outbound request for pulling an image can be configured using credentials by referring to them as “registry-creds". Each external registry can accept the path for a credentials file usually in json format which will help configure the registry.
Together these options allow all images to be made available inside the cluster so that containers can be spun up on the cluster to host the application.
Kubernetes has a controller-manager, a kubelet, an apiserver, a proxy, etcd and a scheduler. All of these can be configured using a configurator. The –feature-gates flag can be used to govern what can be allowed to run. The options supported by the feature-gates are few but the components can utilize them to provide selectivity in inclusion for running the app.
The images do not have feature gates. They are atomic and wholesome in that they are composed of layers but are treated as a whole image with <registry>/<name>:<tag> specifier. The images referenced as "repository" key value in the values file can simply specify the name and tag and utilize the registry resolving to locate the image. The kubernetes framework must be guided in this process.

Wednesday, March 18, 2020

The Kubernetes ingress is a resource that can help with load balancing. It provides an external endpoint to the cluster and usually has a backend to which it forwards traffic. This makes it work as a SSL termination endpoint and name based virtual hosting.
An ingress controller controls the traffic into the Kubernetes cluster. Typically, this is done with the help of nginx. An Ingress-nginx controller can be configured to use the certificate.
Nginx provides the option to specify a –default-ssl-certificate. The default certificate is used as a catch-all for all traffic into the server. Nginx also provides a –enable-ssl-passthrough feature This bypasses the nginx and the controller instead pipes it forward and backward between the client and backend. If the virtual domain cannot be resolved, it passes to the default backend.
This calls for two steps required to secure the ingress controller
1) Use a library such as cert-manager to generate keys and certificates.
2) Use the generated key and certificates as Kubernetes secrets and generate the key certificate parts whose location is specified in the SSL configuration of the application.
The TLS port usually on 443 or 8443 can also be port-forwarded to the host and thereby the rest of the clients on the host external network. The ingress in an SSL termination point and the host can use an SSL proxy or any technique suitable to relay the traffic to the ingress.
This does not affect a registry for image pulling. Any access to images on registries such as docker, gcr, or AWS is outbound from the kube cluster. We have already established outbound connectivity from the pods using well known Nameservers and gateway added through the master which goes through the host visible network and to the external world. There are two techniques possible here.
First, we can provide an insecure registry that is internal and private to the cluster. This registry has no images to begin with and all images can be subsequently uploaded to the registry. It helps us create a manifest of images required to host the application on the cluster and it provides us the ability to use non-tls registry
Second, we can use credentials with one or more external registries as add on because the outbound request for pulling an image can Minikube provides addons to configure these credentials by referring to them as “registry-creds" Each external registry can accept the path for a credentials file usually in json format which will help configure the registry.
Together these options allow all images to be made available inside the cluster so that containers can be spun up on the cluster to host the application.

Tuesday, March 17, 2020

Minikube Tunnel creates a route to services deployed with type LoadBalancer and sets their ingress to be cluster ip
We can also create an ingress resource with nginx-ingress-controller. The resource has an ip address mentioned that can be reached from host. The /etc/hosts file on the host must specify this ip address and the corresponding host specified in the ingress resource.
There is no support mentioned in the docs for replacing the network adapter with a bridged network even though that might solve external connectivity for the minikube cluster.
The Kubernetes ingress is a resource that can help with load balancing. It provides an external endpoint to the cluster and usually has a backend to which it forwards traffic. This makes it work as a SSL termination endpoint and name based virtual hosting.
An ingress controller controls the traffic into the Kubernetes cluster. Typically, this is done with the help of nginx. An Ingress-nginx controller can be configured to use the certificate.
Nginx provides the option to specify a –default-ssl-certificate. The default certificate is used as a catch-all for all traffic into the server. Nginx also provides a –enable-ssl-passthrough feature This bypasses the nginx and the controller instead pipes it forward and backward between the client and backend. If the virtual domain cannot be resolved, it passes to the default backend.
This calls for two steps required to secure the ingress controller
1) Use a library such as cert-manager to generate keys and certificates.
2) Use the generated key and certificates as Kubernetes secrets and generate the key certificate parts whose location is specified in the SSL configuration of the application.

Monday, March 16, 2020

External key management on stream store analytics:

Stream stores are overlaid on Tier 2 storage where the assumption is that the latter takes care of securing data at rest. Tier 2 such as object storage has always supported Data at Rest Encryption(D@RE) by maintaining a set of encryption keys in the system. These include Data Encryption Keys (DeKs) and Key Encryption Keys (KeKs). Certain object storage even supports external key management (EKM) by providing integration with Gemalto Key Secure servers for industry best practice. With the help of external keys, there is reduced risk when there is a compromise against a single instance of an application. Keys are rotated periodically, and this integration helps with performing the re-encryption on storage artifacts. Products that combine analytics over stream stores have at least two levels of data transfers – one involving the analytical application and the stream store and another involving stream store and tier 2 which may either be a nfs file system or a blob store. They can also occur side by side if the product allows storage independent of streams or with a virtualizer that involves a storage class provisioner or finally with an abstraction that syncs between hybrid stores. In these cases, there is replicated data often without protection. When the product supports the ability to use the same key to secure all parts of the data and their copies along with the ability to rotate the keys, an external key manager comes useful to safeguard the keys both old and new.

Data is organized in containers and hierarchy specific to the store and encryption can be applied at each hierarchical level. All the data is at the lowest level and have their own DeK per container while the higher-level containers have their own KeKs. A master KeK is available for the overall store. When the stores are multiple and hybrid the masters become different, but it can be treated as just another intermediary level as long as the stores are registered at an abstraction layer.

Sunday, March 15, 2020

We were discussing Minikube applications and the steps taken for allowing connectivity to the applications.
The technique to allow external access to the application hosted on Minikube is port-forwarding. If the application is hosted on http and https both, then a set of ports can be opened on the host to send traffic to and from the application.
On Windows we take extra precautions in handling betwork traffic. The default firewall settings may prevent access to these ports. A new set of inbound and outbound rules must be specified for the new ports on the host. A set of inbound and outbound rules need to be written to allow access to each port.
Redirects continue to operate as before because all available endpoints will have port-forwarding. The web address itself does not need to be translated to have the localhost and host port to be included as long as the application is the same point of origin.
The other option aside from port forwarding is to ask the minikube to expose the service. This option provides another Kubernetes service with an ip address and port as its own url. This url can then be accessed from the host. There is no direct external network ip connectivity over the NAT without using static ip addressing. That said, Minikube does provide an option for tunneling.
Tunnel creates a route to services deployed with type LoadBalancer and sets their ingress to be cluster ip
We can also create an ingress resource with nginx-ingress-controller. The resource has an ip address mentioned that can be reached from host. The /etc/hosts file on the host must specify this ip address and the corresponding host specified in the ingress resource.

Saturday, March 14, 2020

Minikube applications can be accessed outside the host via port-forwarding. The applications hosted on Minikube have external cluster-IP address but the ip address is NAT'ed which means it is on a private network where the address is translated from the external IP address.
The external and cluster Ip address are two different layers of abstraction. The external in this case refers to the address that is visible only to the host since the minikube is hosted with a only host visible network adapter. It has outbound external connectivity but no internal access except from what the host permits. The IP address does not route automatically to the pods within a minikube.
The cluster IP address refers to one that has been marked cluster wide and is accessible outside the kubernetes cluster. It does not mean it is accessible over the NAT. It is different from the internal ip addresses used for the pods.
The layering therefore looks like the following:
- Outside world
- Host (IP connectivity)
- Minikube (Network Address Translation)
- Cluster IP address ( Kubernetes )
- Pod IP address ( Kubernetes )

The Minikube provides two features that enable transmission of data to the pod to and from the outside world.
This is called port-forwarding.
To transmit the data to a web application serving at port 80, we can run the following commands on the host:
> kubectl port-forward pod/<podName> -n namespace 9880:80 for the inbound traffic
Forwarding from 80 -> 9880
and
> kubectl port-forward --address 0.0.0.0 pod/podName -n namespace for the outbound traffic from the application
Forwarding from 0.0.0.0:9880 -> 9000

It is important to recognize that the inbound and outbound rules must be specified separately for the same application. If the traffic involves both http and https then this results in a set of two rules for each kind of traffic - plain and encrypted.

Friday, March 13, 2020

Kubernetes application install on windows
Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services. It is often a strategic decision for any company because it decouples the application from the hosts so that the same application can work elsewhere with minimal disruption to its use.
Windows is the most common operating system software on personal workstations. Most Kubernetes deployments are on Linux flavor virtual machines for large scale deployments. The developer workstation is considered a small deployment.
The most convenient way to install Kubernetes on windows for hosting any application is with the help of software product called Minikube. This software provisions a dedicated Kubernetes cluster ideal for use in a small resource environment
It simplifies storage with the help of a storage class that refers to an abstraction on how data is persisted. It uses a storage provisioner called K8s.io/minikube-hostpath which unlike other storage provisioners does not require static configuration beforehand for hosted applications to be able to persist files. All request for persisting files are honored dynamically as and when they are required. It stores the data from the applications on the host itself unlike nfs-client-provisioner which provisions on a remote storage.
It simplifies networking with the help of dual network adapters that let’s the cluster provide connectivity with the host and for the outside world which let’s the application appear as if it is on a traditional deployment that is reachable on the internet. The network adapter with the host provides ability to seamlessly port-forward for services deployed to pods with external cluster-ip address.
Together the storage and networking convenience makes application portability easy. Minikube also his calls forcomes with its own docker runtime and kubectl toolset that makes it easy to provide the software to run on the cluster.
These help with the convenience to host any application on Kubernetes over windows in a resource constrained environment

Thursday, March 12, 2020

1) The Flink programming model helps a lot with writing queries for streams. Several examples of this are available on their documentation page and as a sample here. The ability to combine a global index for the stream store with their programming model boosts the analytics that can be performed on the stream. An example to create this index and use it with storage is shown here.
2) The utility of the index is in its ability to lookup based on keywords. The query engine for using the index exposes additional semantics and syntax to the analytics user over the Flink queries or to be used with Flink queries. Then the logic to use the queries and the Flink can be packaged in a maven published jar. Credentials to access the streams can be injected into the jar with the help of a resolver which utilizes the application context.
3) Some of the streams may be generated as part of a running map or flatMap operations on an existing stream and they might come useful later. Unlike the stream for index, there could be a stream for transformation of events. Such transformation will happen once and persist in a new stream. Indexes are rebuilt. Transformation is one time. Indexes are used for a while. Transformations are temporary and persisted only when used with several queries. Indexes might even be stored better as files since they are rewritten. Transformed streams will be append only. The Extract-Transform-Load operation to generate this stream could be packaged in a maven artifact that is easy to write in Flink. If the indexing automatically includes all streams in a project, then this transformed stream would become automatically available to the user. If there is a way for user to blacklist or whitelist the streams for inclusion in the index, it will give more power to the user and prevent unnecessary indexing. All project members can have the privilege to add a stream to the indexing. If the stream is earmarked to be indexed, indexing may even be kicked off by the project member or require the administrator to do so.
4) Overall, there is a comparision made between indexing across stream and transforming one or more streams into another stream. The Flink programmability model works well with transformations. Utilization of index-based querying adds more power to this analytics. Finally, the data for the transformed streams and indexes can be stored with a tier 2 that brings storage engineering best practice and allowing the businesses to focus more on the querying, transformations and indexing.

Wednesday, March 11, 2020

Streams and Tier2 Storage (continued):

Object storage is a limitless storage in terms of capacity. Stream storage is limitless in continuous storage. The two can send data to each other in a mutually beneficial manner. Stream processing can help with analytics while object storage can help with storage engineering best practice. There have their own advantages and can transmit data between themselves for their respective benefits.

A cache can help improve performance of data in transit. The techniques for providing stream segments do not matter to the client and the cache can use any algorithm. The cache also provides the benefits of alleviating load from the stream store without any additional constraints. In fact the cache will also use the same stream reader as the client and with the only difference that there will be fewer stream readers on the stream store than before.

We have not compared this cache layer with a message queue server but there are interesting problems common to both. For example, we have a multiple consumer single producer pattern in the periodic reads from the stream storage. The message queue server or broker enables this kind of publisher-subscriber pattern with retries and dead letter queue. In addition, it journals the messages for review later. This leaves the interaction between the caches and the storage to be handled elegantly with well-known messaging framework. The message broker inherently comes with a scheduler to perform repeated tasks across publishers. Hence it is easy for the message queue server to perform as an orchestrator between the cache and the storage, leaving the cache to focus exclusively on the cache strategy suitable to the workloads. Journaling of messages also helps with diagnosis and replay and probably there is no better store for these messages than the object storage itself. Since the broker operates in a cluster mode, it can scale to as many caches as available. Moreover, the journaling is not necessarily available with all the messaging protocols which counts as one of the advantages of using a message broker. Aside from the queues dedicated to handle the backup of objects from cache to storage, the message broker is also uniquely positioned to provide differentiated treatment to the queues. This introduction of quality of service levels expands the ability of the solution to meet varying and extreme workloads The message queue server is not only a nice to have feature but also a necessity and a convenience when we have a distributed cache to work with the stream storage.

The distance between stream stores and object stores that send data to each other does not matter since the relay is usually asynchronous. This enables cloud storage to work with on-premise storage and tier 2 overlays to send and receive data from remote storage. It facilitates global distribution of data while allowing applications to work with each and every store. The web accessibility of data in object stores also lets it be used as intermediary storage.
The stream stores are helpful to analytics with powerful programmability features. Stream stores can be used on virtual machines and workstations for programmability while using object storage in the cloud. This makes it easy for development of applications.

Tuesday, March 10, 2020

Streams and tier2 storage.

Streams are overlayed on tier 2 storage which includes S3. Each stream does not have to map one on one with a file or a blob for user data isolation. This is entirely handled by the stream storage. Let us take a look at the following instead

A stream may have an analytics jar. These jars can be published to a maven repository on a blob store.

For example:

publishing {

publications {

mavenJava(MavenPublication) {

from components.java

}

repositories {

maven {

url "s3://${repoBucketName}/releases"

credentials(AwsCredentials) {

accessKey awsCredentials.AWSAccessKeyId

secretKey awsCredentials.AWSSecretKey

}

Artifactory is already popular to all maven publishers as jcenter(). This approach is just to generalize that to S3 storage whether it is on-premises or in the cloud.

Taking this one step forward to generalize even publishers that can package user data into a tar ball or extract a stream to a blob are all similarly useful.

When a publisher packages a stream, it can ask the stream store to send all the segments of the stream over the http. A user script that makes this curl call can redirect the output to a file. The file then becomes included in another S3 call to upload it as a blob.

Such a publisher can combine more than one stream in an archive or package metadata with it. This publisher knows what is relevant to the user to pack her data into an archive. Another publisher could make this conversion of user data to blob without the user needing to know about any intermediary file.