Cluster computing

Sunday, December 16, 2018

Today we continue discussing the best practice from storage engineering:

181) Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. If the storage can serve both these models of execution, then they work really well.

182) Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for data stores has become more important than services Therefore datastores do well to support at least some of these primitives in order to cater to a wide variety of analytical workloads.

183) Time-Travel - Time Travel walking through different versions of the data In SQL queries, this is done with the help of AT and BEFORE keywords. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes except that we get there differently

184) Multi-version concurrency control - This provides concurrent access to the data in what may be referred to as transactional memory. In order to prevent reads and writes from seeing inconsistent views, we use locking or multiple copies of each data item.Since a version is a snapshot in time and any changes result in a new version, it is possible for writers to rollback their change while making sure the changes are not seen by others as we proceed.

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

Saturday, December 15, 2018

Today we continue discussing the best practice from storage engineering :

175) Storage can be used with any analysis package. As a storage tier, a product only serves the basic Create-update-delete-list operations on the resources. Anything above and beyond that is in the compute layer. Consequently, storage products do well when they integrate nicely with popular analysis packages.
176) When it comes to integration, programmability is important. How well a machine learning SDK can work with the data in the storage, is not only important for the analysis side but also for the storage side. The workloads from such analysis are significantly more different that others because they require GPUs for heavy iterations in their algorithms.
177) There are many data sources that can feed the analysis side but the storage tier is uniquely positioned as a staging for most in and out data transfers. Consequently, the easier it is to to trace the data through the storage tier, the more popular it becomes for the analysis side
178) Although we mentioned documentation, we have not elaborated on the plug and play kind of architecture. If the storage tier can act as a man in the middle with very little disruption to ongoing data I/O, it can become useful in many indirect ways that differ from the usual direct storage of data.
179) A key aspect of storage tier is the recognition of a variety of data formats or in a specific case file types. If the file is an image versus a text file, then it does not help with certain storage chores such as dedeplication and compression. Therefore, if the storage product provides ability to differentiate data natively, it will help in its acceptance and popularity
180) File types are equally important for applications. Workloads become segregated by file types. If a storage container has all kinds of files, the ability to work with them as if the file types were independent would immensely benefit in separating workloads and providin differential treatment.

Friday, December 14, 2018

Today we continue discussing the best practice from storage engineering:

165) Event Subscription versus appenders: Just like log appenders, there is a possibility to transfer the same event collection result to a large number of destinations simultaneously. These destinations can include files, databases, email recipients and so on.

166) Optional versus mandatory. Every feature in the storage server that is not on the data critical path, is candidate for being turned off to save on resources and improve the data path. This allows equally to components that are not essential. Reducing the number of publisher subscribers is another example of this improvement

167) The number of layers encountered in some operations may become a lot. In such case layering can accommodate components that directly talk to lower layers Layering is not always stripped. However, the justifications to bypass layers must be well made. This counts towards performance by design.

168) There are time periods of peak workload for any storage product. These products serve annual holiday sales, specific anniversaries and anticipated high demand. Utilization of the product under such activity is unusually high. While capacity may be planned to meet the demand, there are ways to tune the existing system to extract more performance. Part of these efforts include switching from being disk intensive to performing more in-memory computation and utilizing other software products to be used with the storage server such as memcache.

169) When the load is high, it is difficult to turn on the profiler to study bottlenecks. However, this can be safely done in advance in performance labs. However, there is an even easier strategy of selectively turning off components that are not required and scaling the component that is under duress. A priority list of mitigatory steps may be predetermined prior to periods of heavy loads.

170) The monitoring of the Service Level Agreements on the storage server allows us to determine the steps needed to maintain the optimum performance of the system. Maintaining standbys for servers and replacements for cluster nodes or spare hardware either on the chassis or as a plugin helps with the speedy resolution of outages.

# usage example
https://1drv.ms/w/s!Ashlm-Nw-wnWuCiBH-iYeyyYCrFG

Thursday, December 13, 2018

Today we continue discussing the best practice from storage engineering:

165) Event Subscription versus appenders: Just like log appenders, there is a possibility to transfer the same event collection result to a large number of destinations simultaneously. These destinations can include files, databases, email recipients and so on.
166) Optional versus mandatory. Every feature in the storage server that is not on the data critical path, is candidate for being turned off to save on resources and improve the data path. This allows equally to components that are not essential. Reducing the number of publisher subscribers is another example of this improvement
167) The number of layers encountered in some operations may become a lot. In such case layering can accommodate components that directly talk to lower layers Layering is not always stripped. However, the justifications to bypass layers must be well made. This counts towards performance by design.
168) There are time periods of peak workload for any storage product. These products serve annual holiday sales, specific anniversaries and anticipated high demand. Utilization of the product under such activity is unusually high. While capacity may be planned to meet the demand, there are ways to tune the existing system to extract more performance. Part of these efforts include switching from being disk intensive to performing more in-memory computation and utilizing other software products to be used with the storage server such as memcache.
169) When the load is high, it is difficult to turn on the profiler to study bottlenecks. However, this can be safely done in advance in performance labs. However, there is an even easier strategy of selectively turning off components that are not required and scaling the component that is under duress. A priority list of mitigatory steps may be predetermined prior to periods of heavy loads.
170) The monitoring of the Service Level Agreements on the storage server allows us to determine the steps needed to maintain the optimum performance of the system. Maintaining standbys for servers and replacements for cluster nodes or spare hardware either on the chassis or as a plugin helps with the speedy resolution of outages.

Wednesday, December 12, 2018

Today we continue discussing the best practice from storage engineering:

160) Nativitity of registries – User registries, on the other hand, are welcome and can be arbitrary. In such cases, the registries are about their own artifacts. However, such rregsitries can be stored just the same way as user data. Consequently, the system does not need to participate in the user registries and they can ear mark storage designated storage artifacts for this purpose.

161) Sequences – Sequences hold a special significance in the storage server. If there are several actions taken by the storage server and the actions don’t belong to the same group and there is no way to assign a sequence number, then we rely on the names of the actions as they appear on the actual timeline such as in the logs. When the names can be collected as sequences, we can perform standard query operations on the collections to determine patterns. This kind of pattern recognition is very useful when there are heterogenous entries and the order in which the user initiates them is dynamic.

162) Event driven framework: Not all user defined actions are fire and forget. Some of them may be interactive and since there can be any amount of delay between interactions, usually some form of event driven framework consistently finds its way into the storage server. From storage drivers, background processors and even certain advanced UI controls use event driven framework.

163) Tracing: The most important and useful application of sequences is the tracing of actions for any activity. Just like logs, event driven framework may provide the ability to trace user actions as different system components participate in their completion. Tracing is very similar to profiling but there needs to be a publisher-subscriber model. Most user mode completion of tasks are done with the help of a message queue and a sink.

164) Event Querying: Event driven framework have the ability to not just operate on whole data but also on streams. This makes it very popular to write stream-based queries involving partitions and going over a partition.

165) Event Subscription versus appenders: Just like log appenders, there is a possibility to transfer the same event collection result to a large number of destinations simultaneously. These destinations can include files, databases, email recipients and so on.

Tuesday, December 11, 2018

Today we continue discussing the best practice from storage engineering:

155) Pooled external resources: It is not just the host and its operating system resources that the storage product requires, it may also require resources that are remote from the local stack. Since such resources can be expensive, it is helpful for the storage product to be thrifty by pooling the resources and servicing as many workers in the storage product as possible.

156) Leveraging monitoring of the host: When an application is deployed to Platform-as-a-service, it no longer has the burden of maintaining its own monitoring. The same applies to storage server as a product depending on where it is deployed. The deeper we go in the stack including the fabric below the storage server, the more amenable they are for the monitoring of the layer above.

157) Immutables: Throughout the storage server we have to use constants for everything from identifiers, names and even temporary data as immutables. While we can differentiate them with number sequences, it is more beneficial to use strings. Strings not only allow names to be given but also help prefix and suffix matching. Even enums have names and we can store them as single instances throughout the system.

158) System artifacts must have names that are not user-friendly because they are reserved and should potentially not come in the way of the names that the user wants to use. Moreover, these names have to be somewhat hidden from the users

159) Registries – When there are collections of artifacts that are reserved for a purpose, they need to be maintained somewhere as a registry. It facilitates lookups. However, registries like lists cannot keep piling up. As long as we encapsulate the logic to determine the lsit, the list is redundant because we can execute the logic and over again. However, this is often hard to enforce as a sound architecture principle

160) Nativitity of registries – User registries, on the other hand, are welcome and can be arbitrary. In such cases, the registries are about their own artifacts. However, such rregsitries can be stored just the same way as user data. Consequently, the system does not need to participate in the user registries and they can ear mark storage designated storage artifacts for this purpose.

Int getIndexOf(node*root, node* moved){
if (root == null) return 0;
if (root.next == null && root == moved) return 0;
if (root.next == null && root != moved) return -1;
int count = 0;
node* tail=root;
while (tail->next) {
If (tail == moved) break;
tail = tail->next;
count++;
}
if (count ==0) return -1;
Return count;
}

Monday, December 10, 2018

Today we continue discussing the best practice from storage engineering:

Querying: Storage products are not required to form a query layer. Still they may participate in the query layer on enumeration basis. Since the query layer can come separately over the storage entities, it is generally written in the form of a web service. However, SQL like queries may also be supported on the enumerations as shown here: https://1drv.ms/w/s!Ashlm-Nw-wnWt1QqRW-GkZlfg0yz

152) Storage product code is usually a large code base. Component interaction diagrams can be generated and reorganizations may be done to improve the code. The practice associated with large code bases applies to storage server code as well.

153) Side-by-Side: Most storage products have to be elastic to growing data. However, customers may choose to user more than one instances. In such a case, the storage product must work well in a side by side manner. This is generally the case when we don’t bind to pre-determined operating system resources and acquire them on initialization using the Resource Acquisition is Initialization principle.

154) Multiplexing: By the same argument above, it is not necessary for the Operating System to see multiple instances of the storage product as separate processes. For example, if the storage product is accessible via an http endpoint, multiple instances can register different endpoints while using a host virtualization for utilizing operating system resources. This is the case with Linux containers.

155) Pooled external resources: It is not just the host and its operating system resources that the storage product requires, it may also require resources that are remote from the local stack. Since such resources can be expensive, it is helpful for the storage product to be thrifty by pooling the resources and servicing as many workers in the storage product as possible.