Cluster computing

Friday, February 28, 2020

Archival using streams discussion is continued:
Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.
There cold, warm and hot regions of the stream perform the equivalent of class.

Their treatment can also be based on policies just like the use of storage class.
The rules for a storage class need not be mere regex translation of outbound destination to another site-specific address. We are dealing with stream transformations and conversions to another stream. The cold warm and hot regions need not exist in the same stream all the time. They can be written to the own independent streams before being processed. Also the processing policy can be different for each and written in the form of a program We are not just putting the streams on steroids, we are also making it smarter by allowing the administrator to customize the policies. These policies can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The source of truth remains unchanged while data is copied to streams where they can be better suited for all extract transform and load operations on streams. The
data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Thursday, February 27, 2020

Archival using data stream store:

As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations as discussed in the earlier post

With programmability for streams, it is relatively easy to translate the operations described in the earlier post to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tiering enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage. Archival systems are considered low cost storage because the data is usually cold.
Data Warehouses used to be the graveyard for online transactional data. As data is passed to this environment, it changes from current value to historical data. As such a system of record for historical data is created and this is then used for all kind of DSS processing. The Corporate Information Factory that the data warehouse evolved into had two prominent features - the virtual operational data store and the addition of unstructured data. The VODS was a feature that allowed organizations to access data on the fly without building an infrastructure. This meant that corporate communications could now be combined with corporate transactions to paint a more complete picture. CIF had an archival feature whereby data would now be transferred from data warehouse to nearline storage using Cross media storage manager (CMSM) and then retired to archival.
Stream stores don’t have a native storage. They are hosted on tier 2 so they look like file and blobs and are subsequently sent to their own tertiary storage. If the stream stores were native on disk, its archival would target the cold end of the streams.
Between files and blobs, we suggest object storage to be better suited for archival. We suggest that Object storage is best suited for using blobs as inputs for backup and archival and fits very well in the Tier 2 in the tier-ing earlier:
Here we suggest that the storage class make use of dedicated long term media on the storage cluster and a corresponding service to auto promote objects for aging.

Wednesday, February 26, 2020

Archival using data stream store:
As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations. We compare this technique as applied to relational tables with a similar strategy for streams we take the example of an inventory table with assets as records.
The assets are continuously added and retired. Therefore there is no fixed set to work and the records may have to be scanned again and again. Fortunately, the assets on which the archiving action needs to be performed do not accumulate forever as the archival catches up with the rate of retirement.
The retirement policy may be dependent not just on the age but several other attributes of the assets. Therefore the archival may have policy stored as a separate logic to evaluate each asset against. Since the archival is expected to run over and over again, it is convenient to revisit each asset again with this criteria to see if the asset can now be retired.
The archival action may fail and the source and destination must remain clean without any duplication or partial entries. Consider the case when the asset is removed from the source but it is not added to the destination. It may be missed forever if the archival fails before the retired asset makes it to the destination table. Similarly, if the asset has been moved to the destination table, there need not be another entry for the same asset if the archival runs again and finds the original entry lingering in the source table.
This leads to a policy where the selection of the asset, the insertion into the destination and the removal from the original is done in a transaction that guarantees all parts of the operation happen successfully or are rolled back to just before these operations started. But this can be relaxed by adding checks in each of the statement to make sure each operations can be taken again and again on the same asset with a forward only movement of the asset from the source to the destination. This is often referred to as reentrant logic and helps take action on the assets in a failsafe manner without requiring the use of locks and logs for the overall set of select, insert and delete.
The set of three actions mentioned above only work on one asset at a time. This is prudent and mature consideration because the storage requirement and possible changes to the asset is minimized when we work on one asset at a time instead of several. Consider the case when an asset is faultily marked as ready for retirement and then reverted back again. If it were part of a list of say ten assets that were being archived, it may affect the other nine to be rolled back and the actions repeated by excluding the one. On the other hand, if we were to work with only one asset at a time, the rest of the inventory is untouched.
With programmability for streams, it is relatively easy to translate the operations described above to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tuesday, February 25, 2020

Serviceability of multi-repository product pipeline: (continued...)
The cost for transformation is justified by the benefits of consistency and centralization on a unified holistic source it also provides staging for alternative and more traditional source code control which have immense support of toolset. Cloning of branches is cheap but it hides the total cost of ownership and the possibilities to create build jobs independent of source organization and control.
The microservices model is very helpful for independent development of capabilities. The enforcement of consistency in source code, it’s builds and deployments, is usually done either by requiring changes in source associated with each service
On the other hand, a new component may be written to not require changes in services and have a dedicated purpose to enforce consistency. This approach is least disruptive but it requires transformation to what the services expect and may end up accumulating special cases per service. An nginx gateway or http proxy allow rules to be written that can describe this forwarding.
Similarly, pipeline jobs may be written where one feeds into another and a dedicated pipeline may bring consistency by interpreting the artifacts from each repository. This makes different jobs to pair with their source easier because an umbrella source/build job will take care of the packaging and routines common to all repositories
Instead of this several one to one pairing, all the source can be under one root and diifferent pipelines authored with the same source of truth but with different views and purposes. This makes it eaiser for developers to see the whole code and make changes as necessary in any tree or across trees.

Monday, February 24, 2020

We were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

Together the message broker and the storage engine provide address and persistence of queues to facilitate access via a geographically close location. However, we are suggesting native message broker functionality to stream storage in a way that promotes this distributed message broker that is enhanced by storage engineering best practice. Since we have streams for queues, we don’t need to give queues any additional journaling.

When gateways solve problems where data does not have to move, they are very appealing to many usages across the companies that use cloud providers. There have been several vendors in their race to find this niche. In our case, the amqp references to use streams are a way to do just that. With queue storage not requiring any maintenance or administration and providing ability to store as much content as necessary, this gateway service becomes useful for chaining, linking or networking message brokers.

Sunday, February 23, 2020

Serviceability of multi-repository product pipeline:

Software products are a snapshot of the source code that continuously accrues in a repository usually called ‘master’. Most features are baked completely before they are included for release in the master. This practice of constantly vetting the master for any updates and keeping it incremental for release is called continuous integration and continuous deployment.

This used to work well for source code that was all compiled together in the master and released with an executable. However, contemporary software engineering practice is making use of running code independently on lightweight hosts called containers with a popular form of deployment called Docker images. These images are built from the source code in their own repositories. The ‘master’ branch now becomes a repository that merely packages the build from each repository that contributes to the overall product.

This works well for v1.0 release of a product because each repository has its own master. When a version is released, it usually forks from the master as a release branch. Since code in each repository has contributed to the release, they all fork their own branches at the time when the packaging repository is forked. Creating a branch at the time of release on any repository allows that branch to accrue hot fixes independent of the updates made in the ‘master’ for the next release.

There is a way to avoid proliferation of release branches for every master of a repository. This has to do with the leverage of source control to go back in time. Since the source code version control continuously revisions each update to a repository, it can bring back the snapshot of the code up to a given version. Therefore, if they are no fixes required in a repository for a released version, a release branch for that repository need not be created.

It is hard to predict which repositories will or will not require their release branches to be forked, so organizations either create them all at once or play it by the ear on the case by case basis. When the number of repositories is a handful, either option makes no difference.

The number of repositories is also hard to predict between successive versions of a software product. With the use of open source and modular organization of code, the packaging and storing of code in repositories has multiplied. Some can even number in hundreds or have several levels of branches representing organizational hierarchy before the code reaches master. The minimum requirement for any released product is that all of the source code pertaining to the release must be snapshot.

Complicating this practice, a fix may now require updates to multiple repositories. This fix will require more effort to port between master and release branches for their respective repositories. Instead, it might be easier to flatten out all repositories into folders under the same root in a new repository. This technique allows individual components to be build and released from sub-folders rather than their own repository and gives a holistic approach to the entire released product.

This gets done only once for every release and centralizes the sustained engineering management by transformation of the code rather than proliferation of branches.