Cluster computing: Network engineering continued ...

Sunday, October 25, 2020

This is a continuation of the earlier posts starting with this one: http://ravinote.blogspot.com/2020/09/best-practice-from-networking.html
Hardware techniques for replication are helpful when the inventory is something, we can control along with the deployment of the storage product. Even so, there has been a shift to software-defined stacks, and replication per se is not required to be hardware-implemented any more. If it is offloaded to hardware, there is a total cost of ownership that increases so it must be offset with some gains.

The notion of physical replication when implemented in the software stack is perhaps the simplest of all. If the data is large, the time to replicate is proportional to the bandwidth. Then there are costs to reinstall the storage container and making sure it is consistent. This is an option for the end-users and typically a client-side workaround.

The notion of trigger-based replication is the idea of using incremental changes as and when they happen so that only that is propagated to the destination. The incremental changes are captured and shipped to the remote site and the modifications are replayed there.

The log-based replication is probably the most performant scheme where the log is actively watched for data changes that are intercepted and sent to the remote system. In this technique, the log may be read and the data changes may be passed to the destination or the log may be read and the captures from the logs may be passed to the destination. This technique is performant because it has a low overhead.

Most of the log-based replication methods are proprietary. A standard for this is hard to enforce and accepting all proprietary formats is difficult to maintain.

Statistics gathering: Every accounting operation within the storage product uses some form of statistics such as summation to building histograms and they inevitably take up memory especially if they can’t be done in one pass. Some of these operations were done as aggregations that were synchronous but when the size of the data is very large, it was translated to batch, micro-batch, or stream operations. With the SQL statement like query using partition and over, smaller chunks were processed in an online-manner. However, most such operations can be delegated to the background.

Cluster computing