Cluster computing

Sunday, March 8, 2020

We were discussing http proxy. A gateway and http proxy services can be used with stream storage for the site specific stream store. The gateway also acts as a http proxy. Any implementation of gateway has to maintain a registry of destination addresses. As streams proliferate with their geo-replications, this registry becomes global while enabling rules to determine the site from which they need to be accessed. Finally they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific sites for stream storage.
Both the above functionalities can be elaborate allowing gateway service to provide immense benefit per deployment.
The advantages of an http proxy include aggregations of usages. In terms of success and failure, there can be detailed count of calls. Moreover, the proxy could include all the features of a conventional http service like Mashery such as Client based caller information, destination-based statistics, per object statistics, categorization by cause and many other features along with a RESTful api service for the features gathered.
Since stream can be overlayed on object storage, and objects have their own web addresses, the gateway can also resolve site specific web address at the object level granularity. This enhances the purpose of the gateway and relieves the administration from searching and looking up the site for a specific object.

Saturday, March 7, 2020

A simple http proxy:
Software applications are increasingly deployed on linux containers some of which don’t have external connectivity except to their host. These applications require traffic to come and go via their host. Often this involves tweaking the IP routing table or port forwarding between the container and the host. This article acknowledges the fact that http traffic can transcend the networking divide by merely requiring a proxy in the middle.
A sample HTTP proxy program would look something like this:

#! /bin/python
import SocketServer
import SimpleHTTPServer
import urllib

PORT = 9000

class Proxy(SimpleHTTPServer.SimpleHTTPRequestHandler):
def do_GET(self):
remote_url = 'http://' +
destination_ip+':'+destination_port+self.path
self.copyfile(urllib.urlopen(remote_url), self.wfile)

def do_POST(self):
remote_url = 'http://' +
destination_ip+':'+destination_port+self.path
self.copyfile(urllib.urlopen(remote_url, self.request.data), self.wfile)

httpd = SocketServer.ForkingTCPServer(('', PORT), Proxy)
print "serving at port", PORT
httpd.serve_forever()

The above program is a sample for reading purposes only. It could be wrapped in a Django server such as this one. It is also easy to modify the program above to replace all destination address so that the proxy does not disclose the backend. A truly transparent proxy is also possible with the help of http headers. This http proxy does the same relaying as a one-armed router except that this is at http layer instead of the IP layer.
Other benefits of http proxy involve troubleshooting, client and destination statistics, and several gateway rules and classifiers.

Friday, March 6, 2020

Benefits of blob storage for Events:
1) Events tend to be small, numerous, varying-sized and continuous. Blobs can be large and binary and support Events with varying number of key-values.
2) Event processing can become compute-oriented.
3) Storage of Events can scale while remaining globally accessible
4) Majority of Events don’t change much which leads to efficiencies in storage.
5) Even skip-level Events can be stored in object storage to make only the relevant web-accessible.
6) Events can be partitioned by source or sink or flow and this makes it easy to store them in object storage.
7) Events can be queried with no regard to their storage and object storage makes the nodes available over http.
8) Events are suitable for library which can be hosted on object storage
9) Events can have metadata and object storage facilitates tagging and adding metadata
Blobs and Events are mutually beneficial when data transfers between them are facilitated
Additionally, Event stores can be software defined and allow object storage to serve as tier 2 storage for the overlay. In this case, we can differentiate their products as different instances used and maintained independently. We don’t differentiate Event store and object store as different storage to the customers although they could be made side by side for supporting data transfers for different purposes.
Object storage facilitates storage classes. The use of dynamic assignments helps map Events to blobs which can then be assigned to different storage classes thus allowing secondary offline workflows that are independent from the primary create-update-delete oriented workflows associated with Events.

Thursday, March 5, 2020

Storage products become popular as a sink as well as a staging in the data pipeline. The platform for a storage product facilitates all aspects of the data manageability both at rest as well as in transit. Yet there is no tagging or labeling of data that allows the storage product to hand out data corresponding to the user of the overall system involving the product.

The only way to handle or associate data from a particular user is with the upstream system that recognizes the user. Since Kubernetes provides a way to recognize the user and the actions for the create, update or delete of resources, it is well-positioned to handle this segregation of data for packing and unpacking purposes.

Data does not always come from users. It can be exchanged between upstream and downstream systems.

Object storage is a limitless storage in terms of capacity. Stream storage is limitless in continuous storage. The two can send data to each other in a mutually beneficial manner. Stream processing can help with analytics while object storage can help with storage engineering best practice. There have their own advantages and can transmit data between themselves for their respective benefits.

A cache can help improve performance of data in transit. The techniques for providing stream segments do not matter to the client and the cache can use any algorithm. The cache also provides the benefits of alleviating load from the stream store without any additional constraints. In fact the cache will also use the same stream reader as the client and with the only difference that there will be fewer stream readers on the stream store than before.

We have not compared this cache layer with a message queue server but there are interesting problems common to both. For example, we have a multiple consumer single producer pattern in the periodic reads from the stream storage. The message queue server or broker enables this kind of publisher-subscriber pattern with retries and dead letter queue. In addition, it journals the messages for review later. This leaves the interaction between the caches and the storage to be handled elegantly with well-known messaging framework. The message broker inherently comes with a scheduler to perform repeated tasks across publishers. Hence it is easy for the message queue server to perform as an orchestrator between the cache and the storage, leaving the cache to focus exclusively on the cache strategy suitable to the workloads. Journaling of messages also helps with diagnosis and replay and probably there is no better store for these messages than the object storage itself. Since the broker operates in a cluster mode, it can scale to as many caches as available. Moreover, the journaling is not necessarily available with all the messaging protocols which counts as one of the advantages of using a message broker. Aside from the queues dedicated to handle the backup of objects from cache to storage, the message broker is also uniquely positioned to provide differentiated treatment to the queues. This introduction of quality of service levels expands the ability of the solution to meet varying and extreme workloads The message queue server is not only a nice to have feature but also a necessity and a convenience when we have a distributed cache to work with the stream storage.

Wednesday, March 4, 2020

The programmatic way of writing a custom resource definition allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.

With the example involving backup of user data, we have shown that the packing and unpacking of an archive can be described with a custom resource called Backup. This resource can then be scoped to the corresponding resource for whom the backup makes sense. The availability of the custom resource implies that we can use K8S API and CLI to request it.

Why is the scope and purpose of this custom resource for an archive important? It helps the users’ data to be packaged from project to project regardless of what it comprises of and where it is stored – file, blob, stream. This portability of data independent from the logic and the platform improves mobility and productivity for the end-user.
Storage products become popular as a sink as well as a staging in the data pipeline. The platform for a storage product facilitates all aspects of the data manageability both at rest as well as in transit. Yet there is no tagging or labeling of data that allows the storage product to hand out data corresponding to the user of the overall system involving the product.
The only way to handle or associate data from a particular user is with the upstream system that recognizes the user. Since Kubernetes provides a way to recognize the user and the actions for the create, update or delete of resources, it is well-positioned to handle this segregation of data for packing and unpacking purposes.

Tuesday, March 3, 2020

The programmatic way of writing a custom resource definition allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.

With the example involving backup of user data, we have shown that the packing and unpacking of an archive can be described with a custom resource called Backup. This resource can then be scoped to the corresponding resource for whom the backup makes sense. The availability of the custom resource implies that we can use K8S API and CLI to request it.
We have also shown that that bytes of the backup can be inlined into the outgoing K8s representation from the api-server in both yaml and json format which allows the user to use such things as jsonpath to go directly to the archive data and download it as a file with standard command line arguments.
It is also possible to make the K8s API directly transfer the archive bytes with its own resource. This resource would then be decoded into a file. There is no limit on the bytes sent over the API response so the size of the archive can be arbitrary. It would be better that the preparation of the archive be decoupled from its transfer because the transfer might be interrupted. It is also important that the archive be properly cleaned up if it is persisted locally since they can be of any size.
Why is the scope and purpose of this custom resource for an archive important ? It helps the users’ data to be packaged from project to project regardless of what it comprises of and where it is stored – file, blob, stream. This portability of data independent from the logic and the platform improves mobility and productivity for the end-user.

Monday, March 2, 2020

Writing a custom resource definition in Kubernetes:

Purpose: Applications hosted on Kubernetes want to introduce their own Kubernetes object so that it can be used like any other and leverage supported features such as usage from command-line, Kubernetes API and secured with role-based access security. It is also stored in the etcd. This article explains the steps to create a custom resource in Kubernetes using the operators written for it.

A stock custom resource definition is a yaml configuration file that has the attributes

Kind: “CustomResourceDefinition”

And specification with group, version and scope fields where group defines the api collection that relates the objects, the version as usually “v1alpha1” or one of the supported strings, and scope as whether the object is available within a namespace or cluster wide. The object is also given names to be called in singular, plural and by type. The metadata of the object is usually constructed from its plural name and group. The definition then includes the properties specific to the resource.

The programmatic way of doing this allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.

The code-generator package allows us to create the package:
# vendor/k8s.io/code-generator/generate-groups.sh
Usage: generate-groups.sh <generators> <output-package> <apis-package> <groups-versions> ...

<generators> the generators comma separated to run (deepcopy,defaulter,client,lister,informer) or "all".
<output-package> the output package name (e.g. github.com/example/project/pkg/generated).
<apis-package> the external types dir (e.g. github.com/example/api or github.com/example/project/pkg/apis).
<groups-versions> the groups and their versions in the format "groupA:v1,v2 groupB:v1 groupC:v2", relative
to <api-package>.
... arbitrary flags passed to all generator binaries.

Examples:
generate-groups.sh all github.com/example/project/pkg/client github.com/example/project/pkg/apis "foo:v1 bar:v1alpha1,v1beta1"
generate-groups.sh deepcopy,client github.com/example/project/pkg/client github.com/example/project/pkg/apis "foo:v1 bar:v1alpha1,v1beta1"

For example,
vendor/k8s.io/code-generator/generate-groups.sh deepcopy,client path/to/project//pkg/apis project:v1alpha1

The most interesting part of these object creation is the specification of other objects as properties and the chaining of their ‘ownerReference’ which allows us to introduce hierarchy, composition, and scoped actions.