Cluster computing

Monday, December 17, 2018

185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.

186) Cost based optimization: When we are able to determine the cost function for a state of the storage system or the processing of a query, we naturally try to work towards the optimum by progressively decreasing the cost. Some methods like simulated annealing serves this purpose. But the tradeoff is that the cost function is an oversimplification the trend to consistently lower the costs as a linear function does not represent all the parameters of the system. Data mining algorithms may help here better if we can form a decision tree or a classifier that can encapsulate all the logic associated with the parameters from both supervised and unsupervised learning

187) AI/ML pipeline: One of the emerging trends of vectorized execution is its use with new AI/ML packages that are easy to run on GPU based machines and pointing to the data from the pipeline. While trees, graphs and forests are a way to represent the decision-making models of the system, the storage tier can enable the analysis stack with better concurrency, partitions and summation forms.

188) Declarative querying: SQL is a declarative querying language. It works well for database systems and relational data. It’s bridging to document stores and Big Data is merely a convenience. A storage tier does not necessarily participate in the data management systems. Yet the storage tier has to enable querying.

189) Support for transformations from batch to Stream processing using the same Storage Tier: Products like Apache Flume are able to support dual mode processing by allowing transformations to different stores. Unless we have a data management system in place a storage tier does not support SQL query keywords like Partition, Over, On, Before, TumblingWindow. The support for SQL directly from the storage tier using an iteration of storage resources, is rather limited. However if the support for products like Flume is there, then there is no difference to analysis whether the product is a time series database or an Object Storage.

190) A pipeline may use the storage tier as a sink for the data. Since pipelines have their own reliability issues, a storage product cannot degrade the pipeline no matter how many pipelines share the storage.

Cluster computing

Monday, December 17, 2018

No comments:

Post a Comment