Saturday, July 7, 2018

When we iterate namespaces, buckets and objects, we often have to rely on sequentially visiting each one of them. There is no centralized data structure that speeds them up nor are they organized in a sorted manner unlike the indexes. These S3 artifacts may be stored over a layer that might facilitate data structure that speed lookup. One such example is a B-plus tree – a data structure that relies on storing ranges by their keys. Another example may be skip lists – a data structure that relies on the links not only between adjacent occurring records but also skipping adjacencies usually by an exponent of two. Such techniques improve lookup because they resist from having to visit each element one after the other.
Perhaps we need to put the improvements from the data structure in perspective. Many commercial products are content with the use of iteration over collections simply by virtue of scoping and delegating the bulk of data processing to downstream layers. This brings up the salient notion that no layer alone suffices to meet the operational gains expected by the user. Data infrastructure management seems to be the whole stack. For example, the presence of local storage and processing capability is required to reduce expensive communication. If the lower layer performs redundant and repeated requests and responses, that is going to cost much more than the benefits of speedier lookup. The ability to delegate querying to lower layers is referred to as push-down technology. The functionality to push-down filters for instance saves on the communication costs and the need for maintaining complex data structures at upper layers.
Another important class of query processing is the notion to distribute the queries to the relevant data sources. This Acquisition Query Processing is about determining which node has the data to acquire and even being intelligent about the choice of the node to reduce costs. Moreover, the data and the attributes required to query it may also be chosen in a way to answer queries with the minimum cost. Therefore, there can be a hierarchical query processing so far as we keep the queries and the data that they operate on well-matched and local to a node.
The distribution of queries brings us to a relevant discussion on network first or storage first design principle and the suitable products associated with that. The insatiable hunger for data storage has spurred the growth of virtualized, elastic and software defined storage that hides the networking to give the notion of a seemingly unbounded storage specifically volume, file and object storage to users. In such cases, the compute layer over this kind of storage is rather general purpose. An alternate form of query processing is the view that the network is a database where the queries are pushed all the way to the storage at each of the peers.  A Peer to Peer network can be considered a distributed hash table that simplifies the delegation of local oriented dedicated storage and compute.
A seemingly hybrid solution could be to take the object storage and the associated compute library at each of the participating peers in a massive peer to peer network. This top-heavy processing allows the storage to be redundant no matter its size while keeping the compute relatively local.

No comments:

Post a Comment