Today we continue discussing the best practice from storage engineering:
181) Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. If the storage can serve both these models of execution, then they work really well.
182) Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for data stores has become more important than services Therefore datastores do well to support at least some of these primitives in order to cater to a wide variety of analytical workloads.
183) Time-Travel - Time Travel walking through different versions of the data In SQL queries, this is done with the help of AT and BEFORE keywords. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes except that we get there differently
184) Multi-version concurrency control - This provides concurrent access to the data in what may be referred to as transactional memory. In order to prevent reads and writes from seeing inconsistent views, we use locking or multiple copies of each data item.Since a version is a snapshot in time and any changes result in a new version, it is possible for writers to rollback their change while making sure the changes are not seen by others as we proceed.
185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.
181) Vectorized execution means data is processed in a pipelined fashion without intermediary results as in map-reduce. The push-based execution means that the relational operators push their results to their downstream operators, rather than waiting for these operators to pull data. If the storage can serve both these models of execution, then they work really well.
182) Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for data stores has become more important than services Therefore datastores do well to support at least some of these primitives in order to cater to a wide variety of analytical workloads.
183) Time-Travel - Time Travel walking through different versions of the data In SQL queries, this is done with the help of AT and BEFORE keywords. Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements. This is similar to change data capture in SQL Server so that we have historical record of all the changes except that we get there differently
184) Multi-version concurrency control - This provides concurrent access to the data in what may be referred to as transactional memory. In order to prevent reads and writes from seeing inconsistent views, we use locking or multiple copies of each data item.Since a version is a snapshot in time and any changes result in a new version, it is possible for writers to rollback their change while making sure the changes are not seen by others as we proceed.
185) Reliability of data: A storage platform can provide redundancy and availability but it has no control on the content. The data from pipelines may sink into the storage but if the pipeline is not the source of truth, the storage tier cannot guarantee that the data is reliable. Garbage in Garbage out applies to storage tier also.
No comments:
Post a Comment