Cluster computing: February 2020

Saturday, February 29, 2020

Archival using streams discussion is continued:

Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.

There cold, warm and hot regions of the stream perform the equivalent of class.

The data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Let us compare a stream archival policy evaluator with Artifactory. Both of them emphasize the following:

1) Proper naming with a format like “<team>-<technology>-<maturity>-<locator>” where

A team is the primary identifier for the project

A technology, tool or package type is being used

A maturity level where binary objects indicate the stage of processing such as debug or release

A locator which indicates the topology of the artifact.

With such proper naming, both make use of rules to use their organization for effective processing.

2) Both make use of three main categories: security, performance and operability.

Security permissions determine who has access to the streams.

Performance considerations determine the cleanup policies such that the repositories are performing efficiently.

Operability considerations determine whether objects need to be in different repositories to improve read, write and delete access to reduce interference

3) Both of them make heavy use of local, remote and virtual repositories to their advantage in getting and putting objects

While Artifactory relies on organizations to determine their own policies, the stream policy evaluator is a manifestation of those policies and spans across repositories, organizations and administration responsibilities.

Friday, February 28, 2020

Archival using streams discussion is continued:
Traditionally, blob storage has developed benefits for archival:
1) Backups tend to be binary. Blobs can be large and binary
2) Backups and archival platforms can take data over protocols and blobs can be transferred natively or via protocols
3) Cold storage class of object storage is well suited for archival platforms
Stream stores allows one steam to be converted to another and do away with class.
There cold, warm and hot regions of the stream perform the equivalent of class.

Their treatment can also be based on policies just like the use of storage class.
The rules for a storage class need not be mere regex translation of outbound destination to another site-specific address. We are dealing with stream transformations and conversions to another stream. The cold warm and hot regions need not exist in the same stream all the time. They can be written to the own independent streams before being processed. Also the processing policy can be different for each and written in the form of a program We are not just putting the streams on steroids, we are also making it smarter by allowing the administrator to customize the policies. These policies can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The source of truth remains unchanged while data is copied to streams where they can be better suited for all extract transform and load operations on streams. The
data transformation routines that can be offloaded to a compute outside the storage, if necessary, to transform the data prior to archival. The idea here is to package a common technique across data sources that handle their own archival preparations across data streams. All in all it becomes an archival management system rather than remain a store.

Thursday, February 27, 2020

Archival using data stream store:

As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations as discussed in the earlier post

With programmability for streams, it is relatively easy to translate the operations described in the earlier post to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tiering enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage. Archival systems are considered low cost storage because the data is usually cold.
Data Warehouses used to be the graveyard for online transactional data. As data is passed to this environment, it changes from current value to historical data. As such a system of record for historical data is created and this is then used for all kind of DSS processing. The Corporate Information Factory that the data warehouse evolved into had two prominent features - the virtual operational data store and the addition of unstructured data. The VODS was a feature that allowed organizations to access data on the fly without building an infrastructure. This meant that corporate communications could now be combined with corporate transactions to paint a more complete picture. CIF had an archival feature whereby data would now be transferred from data warehouse to nearline storage using Cross media storage manager (CMSM) and then retired to archival.
Stream stores don’t have a native storage. They are hosted on tier 2 so they look like file and blobs and are subsequently sent to their own tertiary storage. If the stream stores were native on disk, its archival would target the cold end of the streams.
Between files and blobs, we suggest object storage to be better suited for archival. We suggest that Object storage is best suited for using blobs as inputs for backup and archival and fits very well in the Tier 2 in the tier-ing earlier:
Here we suggest that the storage class make use of dedicated long term media on the storage cluster and a corresponding service to auto promote objects for aging.

Wednesday, February 26, 2020

Archival using data stream store:
As new data is added to existing and the old ones retired, the data storage grows to a very large size. The number of active records within the store may only be a fraction and is more often segregated by a time window. Consequently software engineers perform a technique called archiving which moves older and unused records to a tertiary storage. This technique is robust and involves some interesting considerations. We compare this technique as applied to relational tables with a similar strategy for streams we take the example of an inventory table with assets as records.
The assets are continuously added and retired. Therefore there is no fixed set to work and the records may have to be scanned again and again. Fortunately, the assets on which the archiving action needs to be performed do not accumulate forever as the archival catches up with the rate of retirement.
The retirement policy may be dependent not just on the age but several other attributes of the assets. Therefore the archival may have policy stored as a separate logic to evaluate each asset against. Since the archival is expected to run over and over again, it is convenient to revisit each asset again with this criteria to see if the asset can now be retired.
The archival action may fail and the source and destination must remain clean without any duplication or partial entries. Consider the case when the asset is removed from the source but it is not added to the destination. It may be missed forever if the archival fails before the retired asset makes it to the destination table. Similarly, if the asset has been moved to the destination table, there need not be another entry for the same asset if the archival runs again and finds the original entry lingering in the source table.
This leads to a policy where the selection of the asset, the insertion into the destination and the removal from the original is done in a transaction that guarantees all parts of the operation happen successfully or are rolled back to just before these operations started. But this can be relaxed by adding checks in each of the statement to make sure each operations can be taken again and again on the same asset with a forward only movement of the asset from the source to the destination. This is often referred to as reentrant logic and helps take action on the assets in a failsafe manner without requiring the use of locks and logs for the overall set of select, insert and delete.
The set of three actions mentioned above only work on one asset at a time. This is prudent and mature consideration because the storage requirement and possible changes to the asset is minimized when we work on one asset at a time instead of several. Consider the case when an asset is faultily marked as ready for retirement and then reverted back again. If it were part of a list of say ten assets that were being archived, it may affect the other nine to be rolled back and the actions repeated by excluding the one. On the other hand, if we were to work with only one asset at a time, the rest of the inventory is untouched.
With programmability for streams, it is relatively easy to translate the operations described above to a stream store. The streams have bands of cold, warm and hot data with progressive frontiers that make it easy to adjust the width of each region. The stream store is already considered durable and fit for archival, so the adjustment of width alone can overcome the necessity to move data. Some number of segments from the cold store can become candidates for off site archival.

Tuesday, February 25, 2020

Serviceability of multi-repository product pipeline: (continued...)
The cost for transformation is justified by the benefits of consistency and centralization on a unified holistic source it also provides staging for alternative and more traditional source code control which have immense support of toolset. Cloning of branches is cheap but it hides the total cost of ownership and the possibilities to create build jobs independent of source organization and control.
The microservices model is very helpful for independent development of capabilities. The enforcement of consistency in source code, it’s builds and deployments, is usually done either by requiring changes in source associated with each service
On the other hand, a new component may be written to not require changes in services and have a dedicated purpose to enforce consistency. This approach is least disruptive but it requires transformation to what the services expect and may end up accumulating special cases per service. An nginx gateway or http proxy allow rules to be written that can describe this forwarding.
Similarly, pipeline jobs may be written where one feeds into another and a dedicated pipeline may bring consistency by interpreting the artifacts from each repository. This makes different jobs to pair with their source easier because an umbrella source/build job will take care of the packaging and routines common to all repositories
Instead of this several one to one pairing, all the source can be under one root and diifferent pipelines authored with the same source of truth but with different views and purposes. This makes it eaiser for developers to see the whole code and make changes as necessary in any tree or across trees.

Monday, February 24, 2020

Sunday, February 23, 2020

Serviceability of multi-repository product pipeline:

Software products are a snapshot of the source code that continuously accrues in a repository usually called ‘master’. Most features are baked completely before they are included for release in the master. This practice of constantly vetting the master for any updates and keeping it incremental for release is called continuous integration and continuous deployment.

This used to work well for source code that was all compiled together in the master and released with an executable. However, contemporary software engineering practice is making use of running code independently on lightweight hosts called containers with a popular form of deployment called Docker images. These images are built from the source code in their own repositories. The ‘master’ branch now becomes a repository that merely packages the build from each repository that contributes to the overall product.

This works well for v1.0 release of a product because each repository has its own master. When a version is released, it usually forks from the master as a release branch. Since code in each repository has contributed to the release, they all fork their own branches at the time when the packaging repository is forked. Creating a branch at the time of release on any repository allows that branch to accrue hot fixes independent of the updates made in the ‘master’ for the next release.

There is a way to avoid proliferation of release branches for every master of a repository. This has to do with the leverage of source control to go back in time. Since the source code version control continuously revisions each update to a repository, it can bring back the snapshot of the code up to a given version. Therefore, if they are no fixes required in a repository for a released version, a release branch for that repository need not be created.

It is hard to predict which repositories will or will not require their release branches to be forked, so organizations either create them all at once or play it by the ear on the case by case basis. When the number of repositories is a handful, either option makes no difference.

The number of repositories is also hard to predict between successive versions of a software product. With the use of open source and modular organization of code, the packaging and storing of code in repositories has multiplied. Some can even number in hundreds or have several levels of branches representing organizational hierarchy before the code reaches master. The minimum requirement for any released product is that all of the source code pertaining to the release must be snapshot.

Complicating this practice, a fix may now require updates to multiple repositories. This fix will require more effort to port between master and release branches for their respective repositories. Instead, it might be easier to flatten out all repositories into folders under the same root in a new repository. This technique allows individual components to be build and released from sub-folders rather than their own repository and gives a holistic approach to the entire released product.

This gets done only once for every release and centralizes the sustained engineering management by transformation of the code rather than proliferation of branches.

Saturday, February 22, 2020

Friday, February 21, 2020

Thursday, February 20, 2020

We were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

While message broker lookup queues by address, a gateway service can help with translation of address. In this way, a gateway can significantly expand the distribution of queues.

First, the address mapping is not at site level. It is at queue level.  

Second, the address of the queue – both universal as well as site specific are maintained along with the queue as part of its location information 

Third, instead of internalizing a table of rules from the external gateway, a lookup service can translate universal queue address to the address of the nearest queue. This service is part of the message broker as a read only query. Since queue name and address is already an existing functionality, we only add the ability to translate universal address to site specific address at the queue level.  

Fourth, the gateway functionality exists as a microservice.  It can do more than static lookup of physical location of an queue given a universal address instead of the site-specific address. It has the ability to generate tiny urls for the queues based on hashing.  This adds aliases to the address as opposed to the conventional domain-based address.  The hashing is at the queue level and since we can store billions of queues in the queue storage, a url shortening feature is a significant offering from the gateway service within the queue storage. It has the potential to morph into other services than a mere translator of queue addresses. Design of a url hashing service was covered earlier as follows. 

Fifth, the conventional gateway functionality of load balancing can also be handled with an elastic scale-out of just the gateway service within the message broker.  

Sixth, this gateway can also improve access to the queue by making more copies of the queue elsewhere and adding the superfluous mapping for the duration of the traffic. It need not even interpret the originating ip addresses to determine the volume as long as it can keep track of the number of read requests against existing address of the same queue.  

These advantages can improve the usability of the message brokers and their queues by providing as many as needed along with a scalable service that can translate incoming universal address of queues to site specific location information. 

Wednesday, February 19, 2020

We were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

A message broker is supposed to assign the messages If it sends it to the same single point of contention, it is not very useful When message brokers are used with cache, the performance generally improves over what might have been incurred in going to the backend to store and retrieve That is why different processors behind a message broker could maintain their own cache. A dedicated cache service like AppFabric may also be sufficient to distribute incoming messages. In this case, we are consolidating processor caches with a dedicated cache. This does not necessarily mean a single point of contention. Shared cache may offer at par service level agreement as an individual cache for messages. Since a message broker will not see a performance degradation when sending to a processor or a shared dedicated cache, it works in both these cases. Replacing a shared dedicated cache with a shared dedicated storage such as a stream store is therefore also a practical option.
Some experts argued that message broker is practical only for small and medium businesses which are small scale in requirements. This means that they are stretched on large scale and stream storage deployments which are not necessarily restricted in size but message brokers are also a true cloud storage. These experts point out that stream storage is useful for continuous workload. Tools like duplicity use S3 API to persist in object storage and in this case, message brokers can also perform backup and archiving for their journaling. These workflows do not require modifications of data objects and this makes stream storage perfect for them. These experts argued that the problem with message broker is that it adds more complexity and limits performance. It is not used with online transaction processing which are more read and write intensive and do not tolerate latency.

Tuesday, February 18, 2020

Monday, February 17, 2020

We were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

Chained message brokers
This answers the question that if the mb is a separate instance, can message broker be chained across object storage. Along the lines of the previous question, if the current message broker does not resolve the queue for a message located in its queues, is it possible to distribute the query to another message broker. These kinds of questions imply that the resolver merely needs to forward the queries that it cannot answer to a default pre-registered outbound destination. In a chained message broker, the queries can make sense simply out of the naming convention and say if a request belongs to it or not. If it does not, it simply forwards it to another message broker. This is somewhat different from the original notion that the address is something opaque to the user and does not have any interpretable part that can determine the site to which the object belongs. The linked message broker does not even need to take time to local instance is indeed the correct recipient. It can merely translate the address to know if it belongs to it with the help of a registry. This shallow lookup means a request can be forwarded faster to another linked message broker and ultimately to where it may be guaranteed to be found. The Linked message broker has no criteria for the message broker to be similar and as long as the forwarding logic is enabled, any implementation can exist in each of the message broker for translation, lookup and return. This could have been completely mitigated if the opaque addresses were hashes and the destination message broker was determined based on a hash table. Whether we use routing tables or a static hash table, the networking over the chained message brokers can be its own layer facilitating routing of messages to correct message broker.

Sunday, February 16, 2020

Saturday, February 15, 2020

We were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

The message broker is more than a publisher subscriber. It is the potential for a platform for meaningful insight into events.

Message broker is also a networking layer that enables a P2P network of different local storage which could be on-premise or in the cloud. A distributed hash table in this case determines the store to go to. The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content therefore goes to specified locations that makes subsequent requests easier. Unstructured P2P is composed of peers joining based on some rules and usually without any knowledge of the topology. In this case the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items. P2P provides a good base for large scale data sharing. Some of the desirable features of P2P networks include selection of peers, redundant storage, efficient location, hierarchical namespaces, authentication as well as anonymity of users.  In terms of performance, the P2P has desirable properties such as efficient routing, self-organizing, massively scalable and robust in deployments, fault tolerance, load balancing and explicit notions of locality.  Perhaps the biggest takeaway is that the P2P is an overlay network with no restriction on size and there are two classes structured and unstructured. Structured P2P means that the network topology is tightly controlled and the content is placed on random peers and at specified location which will make subsequent requests more efficient.

Friday, February 14, 2020

Public cloud technologies can be used to improve our products and their operations even if they are deployed on-premise. Public clouds have become popular for their programmability which endear them to developers while providing unparalleled benefits such as no maintenance, lower total cost of ownership, global reachability and availability for all on-premise instances, unified and seamless operations monitoring via system center tools and improved performance via almost limitless scalability.

The public cloud acts like a global sponge to collect data across on-premise deployments that serve to give insights into the operations of those deployments. Take the use case of collecting metrics such as total size and object count from any customer deployment. Metrics like these are lightweight and have traditionally been passed into special purpose on-premise storage stacks which again pose the challenge that most on-premise deployments are dark for monitoring remotely. On the other hand, the proposal here is that if these choice metrics were to fly up to the public clouds like pixie dusts from each of the customer deployments, then they can improve visibility into the operations of all customer deployments while expanding the horizon of monitoring and call home services. A service hosted on the public cloud becomes better suited to assuming the role of a network operations center to receive callback-home alerts, messages, events and metrics. This is made possible with the rich development framework available in the public cloud which allows rapid development of capabilities in the area of operations monitoring, product support and remote assistance to customers for all deployments of software products and appliances. It also lower costs for these activities. Metrics is just one example of this initiative to move such developmental work to the public cloud. All services pertaining to the operations of customer deployments of our products are candidates for being built and operated as tenants of the public cloud.

The proposals of this document include the following: 1) improving visibility of on-premise appliance deployments to services in the public cloud, 2) Using the public cloud features to improve manageability of all deployed instances, 3) Using a consolidated reporting framework via services built in the cloud that map all the deployed instances for their drill down analytics and 4) using the public cloud as a traffic aggregator for all capabilities added to the on-premise storage now and in the future.

Specifically, these points were explained in the context of knowing the size and count of objects from each deployed instance. Previously, there was no way of pulling this information from customer deployments by default. Instead the suggestion was that we leverage the on-premise instance to publish these metrics to a service hosted in the public cloud. This service seemed to be available already in both the major public clouds. The presentation demonstrated Application Programming Interface aka API calls made against each public cloud to put and get the total size and count of objects.

These API calls were then shown as introduction to several powerful features available from the public cloud for value additions to the manageability of deployed instances. The latter part of the presentation focused on the involvement of public cloud for the monitoring and operations of geographically distributed on-premise customer deployments.

As a mention of a new capabilities made possible with the integration of on-premise deployment with manageability using public cloud, it was shown that when this information is collected via the API calls made, they become easy to report with drill downs on groups of deployments or aggregation of individual deployments.

Finally, the public cloud was suggested to be used as a traffic aggregator aka sponge for subscribing to any light-weight data not just the size and count described in this presentation and thus become useful in alleviating publishing-subscribing semantics from on-premise deployments. Together, these proposals are a few of the integration possibilities of on-premise deployments with capabilities hosted in the public cloud.

#codingexercise

Determine the max size of a resource from allresources

Int getMaxResourceLength(List<String > input) {
return input.stream().map(x -> getSize(x))
.max(Comparator.comparing(Integer::valueOf))
.get();

}

Thursday, February 13, 2020

Both message broker and stream store are increasingly becoming cloud technologies rather than on-premise. This enables their integration much more natural as the limits for storage, networking and compute are somewhat relaxed.

This also helps with increasingly maintenance free deployments of their components. The cloud technologies are suited for load balancing, scaling out, ingress control, pod health and proxies that not only adjust to demand but also come with all the benefits of infrastructure best practices

The networking stacks in the host computers have maintained a host centric send and receive functionality for over three decades. Along with the services for file transfers, remote invocations, peer-to-peer networking and packet capture, the networking in the host has supported sweeping changes such as web programming, enterprise computing, IoT, social media applications and cloud computing. The standard established for this networking across vendors was the Open Systems Interconnection model. The seven layers of this model strived to encompass all but the application logic. However, the networking needs from these emerging trends never fully made its feedback to the host networking stack. This article presents the case where a message broker inherently belongs to a layer in the networking stack. Message Brokers have also been deployed as a standalone application on the host supporting Advanced Message Queuing Protocol. Some message brokers have demonstrated extreme performance that has facilitated the traffic on social media.

The message broker supports an observer pattern that allows interested observers to listen from outside the host. When the data unit is no longer packets but messages, it facilitates data transfers in units that are more readable. The packet capture used proxies as a man in the middle which provides little or no disruption to ongoing traffic while facilitating the storage and analysis of packets for the future. Message subscription makes it easier for collection and verification across hosts.

The notion of data capture now changes to the following model:

Network Data subscriber

Message Broker

Stream storage

Tier 2 storage

This facilitates immense analytics directly over the time series database. Separation of analysis stacks implies that the charts and graphs can now be written with any product and on any host.

This way a message broker becomes a networking citizen in the cloud.

Wednesday, February 12, 2020

Both message broker and stream store are increasingly becoming cloud technologies rather than on-premise. This enables their integration much more natural as the limits for storage, networking and compute are somewhat relaxed.
This also helps with increasingly maintenance free deployments of their components. The cloud technologies are suited for load balancing, scaling out, ingress control, pod health and proxies that not only adjust to demand but also come with all the benefits of infrastructure best practices
The networking stacks in the host computers have maintained a host centric send and receive functionality for over three decades. Along with the services for file transfers, remote invocations, peer-to-peer networking and packet capture, the networking in the host has supported sweeping changes such as web programming, enterprise computing, IoT, social media applications and cloud computing. The standard established for this networking across vendors was the Open Systems Interconnection model. The seven layers of this model strived to encompass all but the application logic. However, the networking needs from these emerging trends never fully made its feedback to the host networking stack. This article presents the case where a message broker inherently belongs to a layer in the networking stack. Message Brokers have also been deployed as a standalone application on the host supporting Advanced Message Queuing Protocol. Some message brokers have demonstrated extreme performance that has facilitated the traffic on social media.
The message broker supports an observer pattern that allows interested observers to listen from outside the host. When the data unit is no longer packets but messages, it facilitates data transfers in units that are more readable. The packet capture used proxies as a man in the middle which provides little or no disruption to ongoing traffic while facilitating the storage and analysis of packets for the future. Message subscription makes it easier for collection and verification across hosts.
The notion of data capture now changes to the following model:
Network Data subscriber
Message Broker
Stream storage
Tier 2 storage

This facilitates immense analytics directly over the time series database. Separation of analysis stacks implies that the charts and graphs can now be written with any product and on any host.
This way a message broker becomes a networking citizen in the cloud.

Tuesday, February 11, 2020

Yesterday we were discussing the use case of stream storage with message broker. We continue the discussion today.

A message broker can roll over all data eventually to persistence, so the choice of storage does not hamper the core functionality of the message broker.

A stream storage can be hosted on any Tier2 storage whether it is files or blobs. The choice of tier 2 storage does not hamper the functionality of the stream store. In fact, the append only unbounded data nature of messages in the queue is exactly what makes the stream store more appealing to these message brokers

Though the designs of message brokers are different, the stream store provides guarantees for durability, performance and storing consistency models that work with both.

There are several examples available for stream store to be helpful in persisting logs, Metrics and Events. The message broker find a lot of use cases for these type of data and their integration with a stream store helps with this a lot.

Both message broker and stream store are increasingly becoming cloud technologies rather than on-premise. This enables their integration much more natural as the limits for storage, networking and compute are somewhat relaxed.

This also helps with increasingly maintenance free deployments of their components. The cloud technologies are suited for load balancing, scaling out, ingress control, pod health and proxies that not only adjust to demand but also come with all the benefits of infrastructure best practices

Monday, February 10, 2020

Yesterday we were discussing the use case of stream storage with message broker. We continue the discussion today.

As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues are finding universal recognition in storage systems. Messages are also time-stamped and can be treated as events.A stream store is best suited for a sequential event storage.

Since Event storage overlays on Tier 2 storage on top of blocks, files, streams and blobs, it is already transferring data to those dedicated stores. The storage tier for the message broker with a stream storage system only brings in the storage engineering best practice.

The programmability of streams has a special appeal for the message processors. Runtimes like Apache Flink already supports user jobs which have rich APIs to work with unbounded and bounded data sets.

The message broker now has the ability to host a runtime that is even more performant and expressive than any of the others it supported.

The design for message brokers has certain features that stream storage can handle very well. For example, the MSMQ has a design which includes support for dead letter and poison letter queues, support for retries by the queue processor, a  non-invasive approach which lets the clients send and forget, ability to create an audit log, support for transactional as well as non-transactional messages, support for distributed transactions, support for clusters instead of standalone single machine deployments and ability to improve concurrency control. Stream storage provides excellent support for queues, logs, audit, events, along with transactional semantics.

This design of MSMQ is somewhat different from that of System.Messaging. The system.messaging library transparently exposes the underlying Message queuing windows APIs . For example, it provides GetPublicQueues method that enumerates the public message queues. It takes the message queue criteria as a parameter. This criterion can be specified with parameters such as category and label. It can also take machine name or cluster name, created and modified times as filter parameters. The GetPublicQueuesEnumerator is available to provide an enumerator to iterate over the results.

Though these designs are different, the stream store provides guarantees for durability, performance and storing consistency models that work with both.