Cluster computing

Tuesday, March 3, 2020

The programmatic way of writing a custom resource definition allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.

With the example involving backup of user data, we have shown that the packing and unpacking of an archive can be described with a custom resource called Backup. This resource can then be scoped to the corresponding resource for whom the backup makes sense. The availability of the custom resource implies that we can use K8S API and CLI to request it.
We have also shown that that bytes of the backup can be inlined into the outgoing K8s representation from the api-server in both yaml and json format which allows the user to use such things as jsonpath to go directly to the archive data and download it as a file with standard command line arguments.
It is also possible to make the K8s API directly transfer the archive bytes with its own resource. This resource would then be decoded into a file. There is no limit on the bytes sent over the API response so the size of the archive can be arbitrary. It would be better that the preparation of the archive be decoupled from its transfer because the transfer might be interrupted. It is also important that the archive be properly cleaned up if it is persisted locally since they can be of any size.
Why is the scope and purpose of this custom resource for an archive important ? It helps the users’ data to be packaged from project to project regardless of what it comprises of and where it is stored – file, blob, stream. This portability of data independent from the logic and the platform improves mobility and productivity for the end-user.

Monday, March 2, 2020

Writing a custom resource definition in Kubernetes:

Purpose: Applications hosted on Kubernetes want to introduce their own Kubernetes object so that it can be used like any other and leverage supported features such as usage from command-line, Kubernetes API and secured with role-based access security. It is also stored in the etcd. This article explains the steps to create a custom resource in Kubernetes using the operators written for it.

A stock custom resource definition is a yaml configuration file that has the attributes

Kind: “CustomResourceDefinition”

And specification with group, version and scope fields where group defines the api collection that relates the objects, the version as usually “v1alpha1” or one of the supported strings, and scope as whether the object is available within a namespace or cluster wide. The object is also given names to be called in singular, plural and by type. The metadata of the object is usually constructed from its plural name and group. The definition then includes the properties specific to the resource.

The programmatic way of doing this allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.

The code-generator package allows us to create the package:
# vendor/k8s.io/code-generator/generate-groups.sh
Usage: generate-groups.sh <generators> <output-package> <apis-package> <groups-versions> ...

<generators> the generators comma separated to run (deepcopy,defaulter,client,lister,informer) or "all".
<output-package> the output package name (e.g. github.com/example/project/pkg/generated).
<apis-package> the external types dir (e.g. github.com/example/api or github.com/example/project/pkg/apis).
<groups-versions> the groups and their versions in the format "groupA:v1,v2 groupB:v1 groupC:v2", relative
to <api-package>.
... arbitrary flags passed to all generator binaries.

Examples:
generate-groups.sh all github.com/example/project/pkg/client github.com/example/project/pkg/apis "foo:v1 bar:v1alpha1,v1beta1"
generate-groups.sh deepcopy,client github.com/example/project/pkg/client github.com/example/project/pkg/apis "foo:v1 bar:v1alpha1,v1beta1"

For example,
vendor/k8s.io/code-generator/generate-groups.sh deepcopy,client path/to/project//pkg/apis project:v1alpha1

The most interesting part of these object creation is the specification of other objects as properties and the chaining of their ‘ownerReference’ which allows us to introduce hierarchy, composition, and scoped actions.

Sunday, March 1, 2020

Writing a custom resource definition in Kubernetes:
Purpose: Applications hosted on Kubernetes want to introduce their own Kubernetes object so that it can be used like any other and leverage supported features such as usage from command-line, Kubernetes API and secured with role-based access security. It is also stored in the etcd. This article explains the steps to create a custom resource in Kubernetes using the operators written for it.
A stock custom resource definition is a yaml configuration file that has the attributes
Kind: “CustomResourceDefinition”
And specification with group, version and scope fields where group defines the api collection that relates the objects, the version as usually “v1alpha1” or one of the supported strings, and scope as whether the object is available within a namespace or cluster wide. The object is also given names to be called in singular, plural and by type. The metadata of the object is usually constructed from its plural name and group. The definition then includes the properties specific to the resource.
The resource itself can be created from this definition by specify “kind: <singular>” value and values for all the properties required for the resource. Again, the resource can also be declarative just like the yaml for the custom resource definition.
The programmatic way of doing this allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.
The objects for these customresources can now be created with a ‘spec’ for the fields and ‘status’. Its metadata can be added and a collection can be defined for these objects.
A custom client can now be registered to take the calls over the wire from the command line interface or the REST API calls.
Then the invocation for the creation of custom resource is just
resp, err := crdclient.CustomObject(“default”).Create(object)
The most interesting part of these object creation is the specification of other objects as properties and the chaining of their ‘ownerReference’ which allows us to introduce hierarchy, composition, and scoped actions.

Saturday, February 29, 2020

Archival using streams discussion is continued:

Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.

There cold, warm and hot regions of the stream perform the equivalent of class.

The data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Let us compare a stream archival policy evaluator with Artifactory. Both of them emphasize the following:

1) Proper naming with a format like “<team>-<technology>-<maturity>-<locator>” where

A team is the primary identifier for the project

A technology, tool or package type is being used

A maturity level where binary objects indicate the stage of processing such as debug or release

A locator which indicates the topology of the artifact.

With such proper naming, both make use of rules to use their organization for effective processing.

2) Both make use of three main categories: security, performance and operability.

Security permissions determine who has access to the streams.

Performance considerations determine the cleanup policies such that the repositories are performing efficiently.

Operability considerations determine whether objects need to be in different repositories to improve read, write and delete access to reduce interference

3) Both of them make heavy use of local, remote and virtual repositories to their advantage in getting and putting objects

While Artifactory relies on organizations to determine their own policies, the stream policy evaluator is a manifestation of those policies and spans across repositories, organizations and administration responsibilities.

Friday, February 28, 2020

Archival using streams discussion is continued:
Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.
There cold, warm and hot regions of the stream perform the equivalent of class.

Their treatment can also be based on policies just like the use of storage class.
The rules for a storage class need not be mere regex translation of outbound destination to another site-specific address. We are dealing with stream transformations and conversions to another stream. The cold warm and hot regions need not exist in the same stream all the time. They can be written to the own independent streams before being processed. Also the processing policy can be different for each and written in the form of a program We are not just putting the streams on steroids, we are also making it smarter by allowing the administrator to customize the policies. These policies can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The source of truth remains unchanged while data is copied to streams where they can be better suited for all extract transform and load operations on streams. The
data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Thursday, February 27, 2020

Archival using data stream store:

As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations as discussed in the earlier post

With programmability for streams, it is relatively easy to translate the operations described in the earlier post to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tiering enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage. Archival systems are considered low cost storage because the data is usually cold.
Data Warehouses used to be the graveyard for online transactional data. As data is passed to this environment, it changes from current value to historical data. As such a system of record for historical data is created and this is then used for all kind of DSS processing. The Corporate Information Factory that the data warehouse evolved into had two prominent features - the virtual operational data store and the addition of unstructured data. The VODS was a feature that allowed organizations to access data on the fly without building an infrastructure. This meant that corporate communications could now be combined with corporate transactions to paint a more complete picture. CIF had an archival feature whereby data would now be transferred from data warehouse to nearline storage using Cross media storage manager (CMSM) and then retired to archival.
Stream stores don’t have a native storage. They are hosted on tier 2 so they look like file and blobs and are subsequently sent to their own tertiary storage. If the stream stores were native on disk, its archival would target the cold end of the streams.
Between files and blobs, we suggest object storage to be better suited for archival. We suggest that Object storage is best suited for using blobs as inputs for backup and archival and fits very well in the Tier 2 in the tier-ing earlier:
Here we suggest that the storage class make use of dedicated long term media on the storage cluster and a corresponding service to auto promote objects for aging.

Wednesday, February 26, 2020

Archival using data stream store:
As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations. We compare this technique as applied to relational tables with a similar strategy for streams we take the example of an inventory table with assets as records.
The assets are continuously added and retired. Therefore there is no fixed set to work and the records may have to be scanned again and again. Fortunately, the assets on which the archiving action needs to be performed do not accumulate forever as the archival catches up with the rate of retirement.
The retirement policy may be dependent not just on the age but several other attributes of the assets. Therefore the archival may have policy stored as a separate logic to evaluate each asset against. Since the archival is expected to run over and over again, it is convenient to revisit each asset again with this criteria to see if the asset can now be retired.
The archival action may fail and the source and destination must remain clean without any duplication or partial entries. Consider the case when the asset is removed from the source but it is not added to the destination. It may be missed forever if the archival fails before the retired asset makes it to the destination table. Similarly, if the asset has been moved to the destination table, there need not be another entry for the same asset if the archival runs again and finds the original entry lingering in the source table.
This leads to a policy where the selection of the asset, the insertion into the destination and the removal from the original is done in a transaction that guarantees all parts of the operation happen successfully or are rolled back to just before these operations started. But this can be relaxed by adding checks in each of the statement to make sure each operations can be taken again and again on the same asset with a forward only movement of the asset from the source to the destination. This is often referred to as reentrant logic and helps take action on the assets in a failsafe manner without requiring the use of locks and logs for the overall set of select, insert and delete.
The set of three actions mentioned above only work on one asset at a time. This is prudent and mature consideration because the storage requirement and possible changes to the asset is minimized when we work on one asset at a time instead of several. Consider the case when an asset is faultily marked as ready for retirement and then reverted back again. If it were part of a list of say ten assets that were being archived, it may affect the other nine to be rolled back and the actions repeated by excluding the one. On the other hand, if we were to work with only one asset at a time, the rest of the inventory is untouched.
With programmability for streams, it is relatively easy to translate the operations described above to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.