Friday, January 4, 2019

Today we continue discussing the best practice from storage engineering :

280) Optimistic concurrency control was introduced to allow each transaction to maintain histories of reads and writes so that those causing isolation conflicts can be rolled back.

281) Shared-memory systems have been popular for storage products. They include SMPs, multi-core systems and a combination of both. The simplest way to use it is to create threads in the same process. Shared-memory parallelism is widely used with big data.

282) Shared-Nothing model supports shared-nothing parallelism.  When each node is independent and self-sufficient, there is no single point of contention. None of the nodes share memory or disk storage. Generally, these compete with any model that has a single point of contention in the form of memory or disk space.

283) Shared-Disk: This model is supported where a large space is needed. Some products implement shared-disk and some implement shared-nothing. Shared-nothing and shared-disk do not go together in the same code base.

284) The implementation of a content-distribution network such as for images or videos generally translates to random disk reads which means caching may not always help. Therefore, the disks that are RAIDed are tuned. It used to be a monolithic RAID 10 when it is served from a a single master with multiple slaves.  Instead nowadays a sharded approach is taken and preferably served from Object Storage.

285) Image and video libraries will constantly run into cache misses especially with slow replication. It is better to separate traffic to different cluster pools. The replication and caching into the picture to handle the load.  WIth a distribution to different cluster pools, we can distribute the load and avoid them.

Thursday, January 3, 2019

Today we continue discussing the best practice from storage engineering:

275) Workloads that are not well-behaved may be throttled till they are well-behaved. A workload with high request rate is more likely to be throttled. The opposite is also true.

276) Serializability of objects enables reconstruction on the remote destination. It is more than a protocol for data packing and unpacking on the wire. It includes constraints that enable data validation and helps prevent failures down the line. If the serialization includes encryption, it becomes tamper proof.

277) Serializability is also the notion of correctness when simultaneous updates happen to a resource. When multiple transactions commit their actions, their result can correspond the one from a serial execution of some transactions.  This is very helpful to eliminate inconsistencies across transactions. Serializability differs from isolation only in that the latter tries to do the same from the point of view of a single transaction.

278) Databases were veritable storage systems that guaranteed transactions. Two-phase locking was introduced with transactions where a shared lock was acquired before read and an exclusive lock before write. The two-phase referred to intent and acquisition. With transactions blocking on a wait queue, this was a way to enforce serializability

279) Transaction locking and logging proved onerous and complicated. Multi-Version Concurrency control was brought in for the purpose of not acquiring locks. With consistent view of data at some points of tie in the past, we no longer need to keep track of every change made since the latest such point of time

280) Optimistic concurrency control was introduced to allow each transaction to maintain histories of reads and writes so that those causing isolation conflicts can be rolled back.

Wednesday, January 2, 2019

Today we continue discussing the best practice from storage engineering:

265) Most of the log-based replication methods are proprietary.  A standard for this is hard to enforce and accepting all proprietary formats is difficult to maintain.

266) Statistics gathering: Every accounting operation within the storage product uses some form of statistics such as summation to building histograms and they inevitably take up memory especially if they can’t be done in one-pass. Some of these operations were done as aggregations that were synchronous but when the size of the data is very large, it was translated to batch or stream operations. With the SQL statement like query using partition and over, smaller chunks were processed in an online-manner. However, most such operations can be delegated to the background.

267) Index reconstruction: Users may request that data be re-organized in the background such as sorting them on different attributes or to repartition them across multiple disks. Online re-organization of files is very inefficient and costly to the user. Therefore, some form of separation is called for.

268) Physical re-organization: With data accesses over time that require multiple insertions and deletions, that storage on disk may become fragmented. In order to overcome this, routine reorganization becomes necessary. The same holds for index reconstruction which are fairly expensive and generally done at the restart of a service. Out of rotation methods where one index is fully reconstructed prior to switching is also tolerated in some cases.

269) Backup/Export: All storage products enable data to be imported and exported. Backup and replication are some of the common techniques to export the data. Since they are long running processes, they cannot take locks. Instead a fuzzy dump is taken and then the logs are processed to meet some form of consistency.

270) Queries consume a lot of resources that conflict with the read-write operations on the data path. When they cannot be separated, they must allow prioritization of queries and the tolerance of the elapsed time of long running queries.

Tuesday, January 1, 2019

Today we continue discussing the best practice from storage engineering :

261) Hardware techniques for replication are helpful when the inventory is something, we can control along with the deployment of the storage product. Even so, there has been a shift to software defined stacks and replication per se is not required to be hardware implemented any more. If it is offloaded to hardware, there is a total cost of ownership that increases so it must be offset with some gains.

262) However, the notion of physical replication when implemented in the software stack is perhaps the simplest of all. If the data is large, the time to replicate is proportional to the bandwidth. Then there are costs to reinstall the storage container and making sure it is consistent. This is therefore an option for the end users and typically a client-side workaround.

263) The notion of trigger-based replication is the idea of using incremental changes as and when they happen so that only that is propagated to the destination. The incremental changes are captured and shipped to the remote site and the modifications are replayed there.

264) The log-based replication is probably the most performant scheme where the log is actively watched for data changes which are intercepted and sent to the remote system. In this technique the log may be read and the data changes may be passed to the destination or the log may be read and the captures from the logs may be passed to the destination. This technique is performant because it has a low overhead.

265) Most of the log-based replication methods are proprietary.  A standard for this is hard to enforce and accepting all proprietary formats is difficult to maintain.
Conclusion: Storage engineering has demonstrated consistent success in the industry with these and more salient considerations. We can find manifestations in the tiniest to the largest products.

Monday, December 31, 2018

Today we continue discussing the best practice from storage engineering :

255) Update in place and historical queries real-time challenges. If the values of the updates are maintained with their chronological order, then the queries may simply respond with the values of recent past. Such a collection of queries with answers from the same point of time are compatible

256) Catalogs maintained in storage products tend to get large. A sufficient allocation may need to be estimated. This is not always easy to do with the back of the envelope calculation but allowing the catalog to grow elastically is a good design practice

257) Memory allocation is a very common practice in data path. The correct management of memory is required for programming as well as performance. A context based memory allocator is often used. It involves the following steps: A context is created with a given name or type. A chunk is allocated within a context The chunk of memory is deleted within a context after use. The context is then deleted and then reset. Alternatively, some systems are implemented in languages with universal runtime and garbage collection to utilize the built-in collection and finalization.

258) Memory contexts provide more control than garbage collectors. Developers can provide both spatial and temporal locality of deallocation. Garbage collectors work on all of a program’s memory but some languages offer ways to tune up the garbage collection for specific servers.

259) Overheads may be reduced by not requiring as many calls to kernel as there are user based requests. It is often better to consolidate them in bulk mode so that majority of the calls return in a shallow manner.

260) Administrators are often tasked with provisioning sufficient memory on all compute resources associated with the storage product. With t-shirt sized commodity virtual machines, this is only partially addressed because that only specifies the physical memory.  Virtual memory and it a usage must be made easier to query so that corrective measures may be taken.

Sunday, December 30, 2018

Today we continue discussing the best practice from storage engineering:

250) The algorithm for load-balancing can even be adaptive based on choosing appropriate metrics to determine traffic patterns that are well-known.  We start with a single number to quantify load on each partition and each server and then use the product of request latency and request rate to represent loads.

251) Bitmap indexes are useful for columns with small number of values because they take up less space than B+ tree which requires a value and record pointer tuple for each record. Bitmap are also helpful for conjunctive filters.

252) B+ trees are helpful for fast insertion, delete and update of records. They are generally not as helpful to warehouses as Bitmaps

253) Bulk-load is a very common case in many storage products including data warehouses. They have to be an order of magnitude faster than individual insertions. Typically they will not incur the same overhead for every record and will take up the overhead upfront before the batch or stream into the storage.

254) Bulk Loads may not be as prevalent as when the storage product is already real-time. The only trouble with real-time products is that the read write is not separated from read only and they may contend for mutual exclusion. Moreover sets of queries may not see compatible answers.

255) Update in place and historical queries real-time challenges. If the values of the updates are maintained with their chronological order, then the queries may simply respond with the values of recent past. Such a collection of queries with answers from the same point of time are compatible

A use case for visibility of storage products: https://1drv.ms/w/s!Ashlm-Nw-wnWuDSAzBSGbG3Wy6aG 

Saturday, December 29, 2018


Today we continue discussing the best practice from storage engineering:

245) The process pool per disk worker model has alleviated the need to fork processes and tear down and every process in the pool is capable of executing any of the read-writes from any of the clients. The process pool size is generally finite if not fixed. This has all of the advantages from the process per disk worker model above and with the possibility of differentiated processes in the pool and their quota.
246) When compute and storage are consolidated, they have to be treated as commodity and the scalability is achieved only with the help of scale-out. On the other hand, they are inherently different. Therefore, nodes dedicated to computation may be separated from nodes dedicated to storage. This lets them both scale and load balance independently.
247) Range-based partitioning /indexing is much more beneficial for sequential access such as with stream which makes enumeration easier and faster because of the locality of a set of ranges. This helps with performance. Hash based indexing is better when we have to fan out the processing in their own partitions for performance and all the hashes fall in the same bucket. This helps with load balancing.
248) Third, throttling or isolation is very useful when accounts are not well behaved.  The statistics is collected by the partition server which keeps track of request rates for accounts and partitions.  The same request rate may also be used for load balancing.
249) Automatic load balancing can now be built on range based partitioning approach and the account-based throttling.  This improves multi-tenancy in the environment as well as handling of peaks in traffic patterns.
250) The algorithm for load-balancing can even be adaptive based on choosing appropriate metrics to determine traffic patterns that are well-known.  We start with a single number to quantify load on each partition and each server and then use the product of request latency and request rate to represent loads.