Cluster computing

Monday, December 17, 2018

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

186) Cost based optimization: When we are able to determine the cost function for a state of the storage system or the processing of a query, we naturally try to work towards the optimum by progressively decreasing the cost. Some methods like simulated annealing serves this purpose. But the tradeoff is that the cost function is an oversimplification the trend to consistently lower the costs as a linear function does not represent all the parameters of the system. Data mining algorithms may help here better if we can form a decision tree or a classifier that can encapsulate all the logic associated with the parameters from both supervised and unsupervised learning

187) AI/ML pipeline: One of the emerging trends of vectorized execution is its use with new AI/ML packages that are easy to run on GPU based machines and pointing to the data from the pipeline. While trees, graphs and forests are a way to represent the decision-making models of the system, the storage tier can enable the analysis stack with better concurrency, partitions and summation forms.

188) Declarative querying: SQL is a declarative querying language. It works well for database systems and relational data. It’s bridging to document stores and Big Data is merely a convenience. A storage tier does not necessarily participate in the data management systems. Yet the storage tier has to enable querying.

189) Support for transformations from batch to Stream processing using the same Storage Tier: Products like Apache Flume are able to support dual mode processing by allowing transformations to different stores. Unless we have a data management system in place a storage tier does not support SQL query keywords like Partition, Over, On, Before, TumblingWindow. The support for SQL directly from the storage tier using an iteration of storage resources, is rather limited. However if the support for products like Flume is there, then there is no difference to analysis whether the product is a time series database or an Object Storage.

190) A pipeline may use the storage tier as a sink for the data. Since pipelines have their own reliability issues, a storage product cannot degrade the pipeline no matter how many pipelines share the storage.

Sunday, December 16, 2018

Today we continue discussing the best practice from storage engineering:

181) Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. If the storage can serve both these models of execution, then they work really well.

182) Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for data stores has become more important than services Therefore datastores do well to support at least some of these primitives in order to cater to a wide variety of analytical workloads.

183) Time-Travel - Time Travel walking through different versions of the data In SQL queries, this is done with the help of AT and BEFORE keywords. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes except that we get there differently

184) Multi-version concurrency control - This provides concurrent access to the data in what may be referred to as transactional memory. In order to prevent reads and writes from seeing inconsistent views, we use locking or multiple copies of each data item.Since a version is a snapshot in time and any changes result in a new version, it is possible for writers to rollback their change while making sure the changes are not seen by others as we proceed.

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

Saturday, December 15, 2018

Today we continue discussing the best practice from storage engineering :

175) Storage can be used with any analysis package. As a storage tier, a product only serves the basic Create-update-delete-list operations on the resources. Anything above and beyond that is in the compute layer. Consequently, storage products do well when they integrate nicely with popular analysis packages.
176) When it comes to integration, programmability is important. How well a machine learning SDK can work with the data in the storage, is not only important for the analysis side but also for the storage side. The workloads from such analysis are significantly more different that others because they require GPUs for heavy iterations in their algorithms.
177) There are many data sources that can feed the analysis side but the storage tier is uniquely positioned as a staging for most in and out data transfers. Consequently, the easier it is to to trace the data through the storage tier, the more popular it becomes for the analysis side
178) Although we mentioned documentation, we have not elaborated on the plug and play kind of architecture. If the storage tier can act as a man in the middle with very little disruption to ongoing data I/O, it can become useful in many indirect ways that differ from the usual direct storage of data.
179) A key aspect of storage tier is the recognition of a variety of data formats or in a specific case file types. If the file is an image versus a text file, then it does not help with certain storage chores such as dedeplication and compression. Therefore, if the storage product provides ability to differentiate data natively, it will help in its acceptance and popularity
180) File types are equally important for applications. Workloads become segregated by file types. If a storage container has all kinds of files, the ability to work with them as if the file types were independent would immensely benefit in separating workloads and providin differential treatment.

Friday, December 14, 2018

Today we continue discussing the best practice from storage engineering:

165) Event Subscription versus appenders: Just like log appenders, there is a possibility to transfer the same event collection result to a large number of destinations simultaneously. These destinations can include files, databases, email recipients and so on.

166) Optional versus mandatory. Every feature in the storage server that is not on the data critical path, is candidate for being turned off to save on resources and improve the data path. This allows equally to components that are not essential. Reducing the number of publisher subscribers is another example of this improvement

167) The number of layers encountered in some operations may become a lot. In such case layering can accommodate components that directly talk to lower layers Layering is not always stripped. However, the justifications to bypass layers must be well made. This counts towards performance by design.

168) There are time periods of peak workload for any storage product. These products serve annual holiday sales, specific anniversaries and anticipated high demand. Utilization of the product under such activity is unusually high. While capacity may be planned to meet the demand, there are ways to tune the existing system to extract more performance. Part of these efforts include switching from being disk intensive to performing more in-memory computation and utilizing other software products to be used with the storage server such as memcache.

169) When the load is high, it is difficult to turn on the profiler to study bottlenecks. However, this can be safely done in advance in performance labs. However, there is an even easier strategy of selectively turning off components that are not required and scaling the component that is under duress. A priority list of mitigatory steps may be predetermined prior to periods of heavy loads.

170) The monitoring of the Service Level Agreements on the storage server allows us to determine the steps needed to maintain the optimum performance of the system. Maintaining standbys for servers and replacements for cluster nodes or spare hardware either on the chassis or as a plugin helps with the speedy resolution of outages.

# usage example
https://1drv.ms/w/s!Ashlm-Nw-wnWuCiBH-iYeyyYCrFG

Thursday, December 13, 2018

Today we continue discussing the best practice from storage engineering:

165) Event Subscription versus appenders: Just like log appenders, there is a possibility to transfer the same event collection result to a large number of destinations simultaneously. These destinations can include files, databases, email recipients and so on.
166) Optional versus mandatory. Every feature in the storage server that is not on the data critical path, is candidate for being turned off to save on resources and improve the data path. This allows equally to components that are not essential. Reducing the number of publisher subscribers is another example of this improvement
167) The number of layers encountered in some operations may become a lot. In such case layering can accommodate components that directly talk to lower layers Layering is not always stripped. However, the justifications to bypass layers must be well made. This counts towards performance by design.
168) There are time periods of peak workload for any storage product. These products serve annual holiday sales, specific anniversaries and anticipated high demand. Utilization of the product under such activity is unusually high. While capacity may be planned to meet the demand, there are ways to tune the existing system to extract more performance. Part of these efforts include switching from being disk intensive to performing more in-memory computation and utilizing other software products to be used with the storage server such as memcache.
169) When the load is high, it is difficult to turn on the profiler to study bottlenecks. However, this can be safely done in advance in performance labs. However, there is an even easier strategy of selectively turning off components that are not required and scaling the component that is under duress. A priority list of mitigatory steps may be predetermined prior to periods of heavy loads.
170) The monitoring of the Service Level Agreements on the storage server allows us to determine the steps needed to maintain the optimum performance of the system. Maintaining standbys for servers and replacements for cluster nodes or spare hardware either on the chassis or as a plugin helps with the speedy resolution of outages.

Wednesday, December 12, 2018

Today we continue discussing the best practice from storage engineering:

160) Nativitity of registries – User registries, on the other hand, are welcome and can be arbitrary. In such cases, the registries are about their own artifacts. However, such rregsitries can be stored just the same way as user data. Consequently, the system does not need to participate in the user registries and they can ear mark storage designated storage artifacts for this purpose.

161) Sequences – Sequences hold a special significance in the storage server. If there are several actions taken by the storage server and the actions don’t belong to the same group and there is no way to assign a sequence number, then we rely on the names of the actions as they appear on the actual timeline such as in the logs. When the names can be collected as sequences, we can perform standard query operations on the collections to determine patterns. This kind of pattern recognition is very useful when there are heterogenous entries and the order in which the user initiates them is dynamic.

162) Event driven framework: Not all user defined actions are fire and forget. Some of them may be interactive and since there can be any amount of delay between interactions, usually some form of event driven framework consistently finds its way into the storage server. From storage drivers, background processors and even certain advanced UI controls use event driven framework.

163) Tracing: The most important and useful application of sequences is the tracing of actions for any activity. Just like logs, event driven framework may provide the ability to trace user actions as different system components participate in their completion. Tracing is very similar to profiling but there needs to be a publisher-subscriber model. Most user mode completion of tasks are done with the help of a message queue and a sink.

164) Event Querying: Event driven framework have the ability to not just operate on whole data but also on streams. This makes it very popular to write stream-based queries involving partitions and going over a partition.

165) Event Subscription versus appenders: Just like log appenders, there is a possibility to transfer the same event collection result to a large number of destinations simultaneously. These destinations can include files, databases, email recipients and so on.

Tuesday, December 11, 2018

Today we continue discussing the best practice from storage engineering:

155) Pooled external resources: It is not just the host and its operating system resources that the storage product requires, it may also require resources that are remote from the local stack. Since such resources can be expensive, it is helpful for the storage product to be thrifty by pooling the resources and servicing as many workers in the storage product as possible.

156) Leveraging monitoring of the host: When an application is deployed to Platform-as-a-service, it no longer has the burden of maintaining its own monitoring. The same applies to storage server as a product depending on where it is deployed. The deeper we go in the stack including the fabric below the storage server, the more amenable they are for the monitoring of the layer above.

157) Immutables: Throughout the storage server we have to use constants for everything from identifiers, names and even temporary data as immutables. While we can differentiate them with number sequences, it is more beneficial to use strings. Strings not only allow names to be given but also help prefix and suffix matching. Even enums have names and we can store them as single instances throughout the system.

158) System artifacts must have names that are not user-friendly because they are reserved and should potentially not come in the way of the names that the user wants to use. Moreover, these names have to be somewhat hidden from the users

159) Registries – When there are collections of artifacts that are reserved for a purpose, they need to be maintained somewhere as a registry. It facilitates lookups. However, registries like lists cannot keep piling up. As long as we encapsulate the logic to determine the lsit, the list is redundant because we can execute the logic and over again. However, this is often hard to enforce as a sound architecture principle

160) Nativitity of registries – User registries, on the other hand, are welcome and can be arbitrary. In such cases, the registries are about their own artifacts. However, such rregsitries can be stored just the same way as user data. Consequently, the system does not need to participate in the user registries and they can ear mark storage designated storage artifacts for this purpose.

Int getIndexOf(node*root, node* moved){
if (root == null) return 0;
if (root.next == null && root == moved) return 0;
if (root.next == null && root != moved) return -1;
int count = 0;
node* tail=root;
while (tail->next) {
If (tail == moved) break;
tail = tail->next;
count++;
}
if (count ==0) return -1;
Return count;
}