Cluster computing

Wednesday, January 2, 2019

Today we continue discussing the best practice from storage engineering:

265) Most of the log-based replication methods are proprietary. A standard for this is hard to enforce and accepting all proprietary formats is difficult to maintain.

266) Statistics gathering: Every accounting operation within the storage product uses some form of statistics such as summation to building histograms and they inevitably take up memory especially if they can’t be done in one-pass. Some of these operations were done as aggregations that were synchronous but when the size of the data is very large, it was translated to batch or stream operations. With the SQL statement like query using partition and over, smaller chunks were processed in an online-manner. However, most such operations can be delegated to the background.

267) Index reconstruction: Users may request that data be re-organized in the background such as sorting them on different attributes or to repartition them across multiple disks. Online re-organization of files is very inefficient and costly to the user. Therefore, some form of separation is called for.

268) Physical re-organization: With data accesses over time that require multiple insertions and deletions, that storage on disk may become fragmented. In order to overcome this, routine reorganization becomes necessary. The same holds for index reconstruction which are fairly expensive and generally done at the restart of a service. Out of rotation methods where one index is fully reconstructed prior to switching is also tolerated in some cases.

269) Backup/Export: All storage products enable data to be imported and exported. Backup and replication are some of the common techniques to export the data. Since they are long running processes, they cannot take locks. Instead a fuzzy dump is taken and then the logs are processed to meet some form of consistency.

270) Queries consume a lot of resources that conflict with the read-write operations on the data path. When they cannot be separated, they must allow prioritization of queries and the tolerance of the elapsed time of long running queries.

Cluster computing

Wednesday, January 2, 2019

No comments:

Post a Comment