Friday, November 23, 2018

Today we continue enumerating the best practice from storage engineering:
Data Types: There are some universally popular data types such as integers, float, date and string that are easy to recognize even on the wire and most storage engineering products also see them as ways of packing bytes in a byte sequence. However, storage engineering products including log indexes are actually far more convenient to expose more data types to the data they store than any other product because the data finally persists in these products. Conventional relational databases made sense of the data in the tables only because the data type was registered with them. Not all storage engineering products have that luxury. Log indexes for example ingest data without user interaction. The ability to infer data types and auto-register them to facilitate richer forms of search and analytics is not popular although it holds a lot of promise. Most products work largely by inferring fields rather than types within the data because it gives them a way to allow users to search and analyze using tags that are familiar to them. However, looking up fields together as types only adds a little bit more aggregation on the server side while improving the convenience to the users.

Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content.  Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both. 

No comments:

Post a Comment