Cluster computing

Sunday, March 1, 2020

Writing a custom resource definition in Kubernetes:
Purpose: Applications hosted on Kubernetes want to introduce their own Kubernetes object so that it can be used like any other and leverage supported features such as usage from command-line, Kubernetes API and secured with role-based access security. It is also stored in the etcd. This article explains the steps to create a custom resource in Kubernetes using the operators written for it.
A stock custom resource definition is a yaml configuration file that has the attributes
Kind: “CustomResourceDefinition”
And specification with group, version and scope fields where group defines the api collection that relates the objects, the version as usually “v1alpha1” or one of the supported strings, and scope as whether the object is available within a namespace or cluster wide. The object is also given names to be called in singular, plural and by type. The metadata of the object is usually constructed from its plural name and group. The definition then includes the properties specific to the resource.
The resource itself can be created from this definition by specify “kind: <singular>” value and values for all the properties required for the resource. Again, the resource can also be declarative just like the yaml for the custom resource definition.
The programmatic way of doing this allows us to do a little bit more than the declarations. It lets us introuduce dynamic behavior. We do this by calling the methods from the client-go package for create, update and delete of the resource to use with Kubernetes API server. These methods are referred to as the ‘clientset’ because they use the corresponding methods from the Kubernetes apiextensions library. We connect to the API server from within the cluster using InClusterConfig method from the package. So far all of these methods are just calls with appropriate parameter and checks of return values.
The objects for these customresources can now be created with a ‘spec’ for the fields and ‘status’. Its metadata can be added and a collection can be defined for these objects.
A custom client can now be registered to take the calls over the wire from the command line interface or the REST API calls.
Then the invocation for the creation of custom resource is just
resp, err := crdclient.CustomObject(“default”).Create(object)
The most interesting part of these object creation is the specification of other objects as properties and the chaining of their ‘ownerReference’ which allows us to introduce hierarchy, composition, and scoped actions.

Saturday, February 29, 2020

Archival using streams discussion is continued:

Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.

There cold, warm and hot regions of the stream perform the equivalent of class.

The data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Let us compare a stream archival policy evaluator with Artifactory. Both of them emphasize the following:

1) Proper naming with a format like “<team>-<technology>-<maturity>-<locator>” where

A team is the primary identifier for the project

A technology, tool or package type is being used

A maturity level where binary objects indicate the stage of processing such as debug or release

A locator which indicates the topology of the artifact.

With such proper naming, both make use of rules to use their organization for effective processing.

2) Both make use of three main categories: security, performance and operability.

Security permissions determine who has access to the streams.

Performance considerations determine the cleanup policies such that the repositories are performing efficiently.

Operability considerations determine whether objects need to be in different repositories to improve read, write and delete access to reduce interference

3) Both of them make heavy use of local, remote and virtual repositories to their advantage in getting and putting objects

While Artifactory relies on organizations to determine their own policies, the stream policy evaluator is a manifestation of those policies and spans across repositories, organizations and administration responsibilities.

Friday, February 28, 2020

Archival using streams discussion is continued:
Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.
There cold, warm and hot regions of the stream perform the equivalent of class.

Their treatment can also be based on policies just like the use of storage class.
The rules for a storage class need not be mere regex translation of outbound destination to another site-specific address. We are dealing with stream transformations and conversions to another stream. The cold warm and hot regions need not exist in the same stream all the time. They can be written to the own independent streams before being processed. Also the processing policy can be different for each and written in the form of a program We are not just putting the streams on steroids, we are also making it smarter by allowing the administrator to customize the policies. These policies can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The source of truth remains unchanged while data is copied to streams where they can be better suited for all extract transform and load operations on streams. The
data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Thursday, February 27, 2020

Archival using data stream store:

As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations as discussed in the earlier post

With programmability for streams, it is relatively easy to translate the operations described in the earlier post to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tiering enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage. Archival systems are considered low cost storage because the data is usually cold.
Data Warehouses used to be the graveyard for online transactional data. As data is passed to this environment, it changes from current value to historical data. As such a system of record for historical data is created and this is then used for all kind of DSS processing. The Corporate Information Factory that the data warehouse evolved into had two prominent features - the virtual operational data store and the addition of unstructured data. The VODS was a feature that allowed organizations to access data on the fly without building an infrastructure. This meant that corporate communications could now be combined with corporate transactions to paint a more complete picture. CIF had an archival feature whereby data would now be transferred from data warehouse to nearline storage using Cross media storage manager (CMSM) and then retired to archival.
Stream stores don’t have a native storage. They are hosted on tier 2 so they look like file and blobs and are subsequently sent to their own tertiary storage. If the stream stores were native on disk, its archival would target the cold end of the streams.
Between files and blobs, we suggest object storage to be better suited for archival. We suggest that Object storage is best suited for using blobs as inputs for backup and archival and fits very well in the Tier 2 in the tier-ing earlier:
Here we suggest that the storage class make use of dedicated long term media on the storage cluster and a corresponding service to auto promote objects for aging.

Wednesday, February 26, 2020

Archival using data stream store:
As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations. We compare this technique as applied to relational tables with a similar strategy for streams we take the example of an inventory table with assets as records.
The assets are continuously added and retired. Therefore there is no fixed set to work and the records may have to be scanned again and again. Fortunately, the assets on which the archiving action needs to be performed do not accumulate forever as the archival catches up with the rate of retirement.
The retirement policy may be dependent not just on the age but several other attributes of the assets. Therefore the archival may have policy stored as a separate logic to evaluate each asset against. Since the archival is expected to run over and over again, it is convenient to revisit each asset again with this criteria to see if the asset can now be retired.
The archival action may fail and the source and destination must remain clean without any duplication or partial entries. Consider the case when the asset is removed from the source but it is not added to the destination. It may be missed forever if the archival fails before the retired asset makes it to the destination table. Similarly, if the asset has been moved to the destination table, there need not be another entry for the same asset if the archival runs again and finds the original entry lingering in the source table.
This leads to a policy where the selection of the asset, the insertion into the destination and the removal from the original is done in a transaction that guarantees all parts of the operation happen successfully or are rolled back to just before these operations started. But this can be relaxed by adding checks in each of the statement to make sure each operations can be taken again and again on the same asset with a forward only movement of the asset from the source to the destination. This is often referred to as reentrant logic and helps take action on the assets in a failsafe manner without requiring the use of locks and logs for the overall set of select, insert and delete.
The set of three actions mentioned above only work on one asset at a time. This is prudent and mature consideration because the storage requirement and possible changes to the asset is minimized when we work on one asset at a time instead of several. Consider the case when an asset is faultily marked as ready for retirement and then reverted back again. If it were part of a list of say ten assets that were being archived, it may affect the other nine to be rolled back and the actions repeated by excluding the one. On the other hand, if we were to work with only one asset at a time, the rest of the inventory is untouched.
With programmability for streams, it is relatively easy to translate the operations described above to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tuesday, February 25, 2020

Serviceability of multi-repository product pipeline: (continued...)
The cost for transformation is justified by the benefits of consistency and centralization on a unified holistic source it also provides staging for alternative and more traditional source code control which have immense support of toolset. Cloning of branches is cheap but it hides the total cost of ownership and the possibilities to create build jobs independent of source organization and control.
The microservices model is very helpful for independent development of capabilities. The enforcement of consistency in source code, it’s builds and deployments, is usually done either by requiring changes in source associated with each service
On the other hand, a new component may be written to not require changes in services and have a dedicated purpose to enforce consistency. This approach is least disruptive but it requires transformation to what the services expect and may end up accumulating special cases per service. An nginx gateway or http proxy allow rules to be written that can describe this forwarding.
Similarly, pipeline jobs may be written where one feeds into another and a dedicated pipeline may bring consistency by interpreting the artifacts from each repository. This makes different jobs to pair with their source easier because an umbrella source/build job will take care of the packaging and routines common to all repositories
Instead of this several one to one pairing, all the source can be under one root and diifferent pipelines authored with the same source of truth but with different views and purposes. This makes it eaiser for developers to see the whole code and make changes as necessary in any tree or across trees.

Monday, February 24, 2020

We were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

Together the message broker and the storage engine provide address and persistence of queues to facilitate access via a geographically close location. However, we are suggesting native message broker functionality to stream storage in a way that promotes this distributed message broker that is enhanced by storage engineering best practice. Since we have streams for queues, we don’t need to give queues any additional journaling.

When gateways solve problems where data does not have to move, they are very appealing to many usages across the companies that use cloud providers. There have been several vendors in their race to find this niche. In our case, the amqp references to use streams are a way to do just that. With queue storage not requiring any maintenance or administration and providing ability to store as much content as necessary, this gateway service becomes useful for chaining, linking or networking message brokers.