Saturday, November 24, 2018

Today we continue discussing the best practice from storage engineering: 

92) Words: For the past fifty years that we have learned to persist our data, we have relied on the physical storage being the same for our photos and our documents and relied on the logical organization over this storage to separate our content, so we may run or edit them respectively. From file-systems to object storage, this physical storage has always been binaries with both the photos and documents appearing as 0 or 1. However, text content has syntax and semantics that facilitate query and analytics that are coming of age. Recently, natural language processing and text mining has made significant strides to help us do such things as classify, summarize, annotate, predict, index and lookup that were previously not done and not at such scale as where we save them today such as in the cloud. Even as we are expanding our capabilities on text, we have still relied on our fifty-year-old tradition of mapping letters to binary sequence instead of the units of organization in natural language such as words. Our data structures that store words spell out the letters instead of efficiently encoding the words. Even when we do read words and set up text processing on that content, we limit ourselves to what others tell us about their content.  Words may appear not just in documents, they may appear even in such unreadable things as executables. Neither our current storage nor our logical organization is enough to fully locate all items of interest, we need ways to expand our definitions of both.

93) Inverted Lists: We have referred to collections both in the organization of data as well as from the queries over data. Another way we facilitate search over the data is by maintaining inverted lists of terms from the storage organizational units. This enables a faster lookup of locations corresponding to the presence of the search term. This inverted list may be constantly updated so that it remains consistent with the data. The lists are also helpful to gather overall ordering of terms by their occurrences.

94) Deletion policies/ Retention period: This must be a configurable setting which helps ensure that information is not erased prior to the expiration of a policy which in this case could be the retention period. At the same time, this retention period could also be set as "undetermined" when content is archived but have a specific retention period at the time of an event.

95) Reconfiguration: Most storage products are subject to some pools of available resources managed by some policies that can change from time to time. Whenever the server resources are changed, they must be done in one operation so that the system presents a consistent view to all usages going forward. Such a system wide is a reconfiguration and is often implemented across storage products.

No comments:

Post a Comment