Cluster computing: October 2018

Wednesday, October 31, 2018

We were discussing the use of object storage to stash state across workers from applications and cluster nodes on a lease basis in the previous post
This section is not exclusive to Ticketing layer and is mentioned here only for convenience. We suggested earlier that tickets may be issued with any existing planning and tracking software in the ticket layer over object storage. There is no dearth of choices for such software in the market. Almost any tracking software that can issue tickets will suffice. This allows for easy integration with existing software so that applications may leverage their favorite API to create, update and resolve tickets.
In addition, each ticket may have rich metadata that can be annotated by parties other than the workers who create, update and resolve tickets. Such metadata may carry any number of fields to improve the tracking information associated with the objects in the object storage. They could also assist with the update of metadata to objects in the object storage. Such metadata brings many perspectives other than that of the initiator to the users
As long as the tickets are open, they can also maintain links with IDs of resources in systems other than the ticket issuing software. This allows for integration of ticket layer with several other production software that helps navigation from a ticket representing the lease to information in other systems that may have also have additional state for the workers that opened the tickets. This gives a holistic view of all things relevant to the workers merely from their read-write data path
The schema for arranging integrations this way with existing tickets is similar to a snowflake pattern. Each link represents a dimension and therefore the ticket is a representation of all information and not just itself. This pattern also facilitates the independence of the ticket from another system that can go down. The links may only be used if they are available to be reached. Since the ticket itself holds enough information locally about the resource tracking, any external links are nice to have and not mandatory.

This kind of extensibility allows ticket layer to grow without disrupting the path above or below. Since the workers remain unaffected, there is no limit to the extensibility in this layer.

Tuesday, October 30, 2018

We were discussing the use of object storage to stash state across workers from applications and cluster nodes on a lease basis in the previous post. The object storage acts as an aggregator across workers in this design. It brings elasticity from the virtualized storage customized to individual workers. There is no need for capacity planning for any organization or the payment for file-system, block or other forms of storage as the object storage not only represents the unification of such storage but also supports billing with the availability of detailed study on worker usages. The performance trade-off for workers is small and they are required to change their usages of conventional storage with preferred S3 access. They become facilitators for moving compute layer beyond virtual machines and disks to separate compute and storage platforms the likes of which can generate a new wave of commodity hardware suitable for both on-premise and datacenters.

The ticketing and leasing system proposed in this post need not just be applications and clusters. It can also be other local platform providers that want to help ease the migration of workloads from local and unmanaged storage.

The use of a leasing service facilitates object lifecycle management as much as it can be done individually by workers. It can be offloaded to workers but the notion is that the maintenance activities move from organization owned to self-managed by this solution. Tickets may be implemented with any software as long as they map workers to resources in the object storage. 
Moreover, not all the requests need to reach the object storage. In some cases, web ticket may use temporary storage from hybrid choices. The benefits of using a web ticket including saving bandwidth, reducing server load, and improving request-response time. If a dedicated content store is required, typically the ticketing and server are encapsulated into a content server. This is quite the opposite paradigm of using object storage and replicated objects to directly serve the content from the store. The distinction here is that there are two layers of functions - The first layer is the Ticket layer that solves the life cycle management of storage resources. The second layer is the storage concerns of the actual storage which we mitigate with the help of object storage. We will call this the storage engine and will get to it shortly.

The Ticket would be only the ID that the worker needs to track all of its storage resources.   

Monday, October 29, 2018

Ticket Service works closely with issue management software and traditionally both have been long standing products in the marketplace. Atlassian Jira is a planning and tracking software that generates tickets for managing and tracking work-items. This ticket layer is well-positioned for using such existing software directly. This document does not repeat the internals of tracking management software and instead focuses on its use within object storage so that newer workloads can use a storage platform rather than their disks and local storage.

The ticketing and leasing system proposed in this document need not just be applications and clusters. It can also be other local platform providers that want to help ease the migration of workloads from local and unmanaged storage.

Sunday, October 28, 2018

Object Storage is perceived as backup and tertiary storage. However, in the next few posts we argue that object storage is a persistent thread local storage for all workers in any system so that there is never any loss of state when a worker disappears. While files have traditionally been the storage of choice for workers, we argue that this is just a matter of access and that file protocols or http access serve just the same to file-system enabled object storage. Moreover, not all data needs to be written deep into the object storage at once. With the help of a cache layer discussed earlier, we can even allow workers higher performance than working with remote storage. The requirements for object storage need not even change while the reads and writes from the workers can be handled

Many applications maintain concurrent activities. Large scale query processing on big data requires intermediate storage for their computations. Similarly cluster computing also requires storage for the processing of their nodes. Traditionally, these nodes have been sharing volumes or maintaining local file storage. However, most disk operations are already written off as expensive as computing moves more into memory. A large number of workers are generally commodity workers. They don’t need to maintain high performance that object storage cannot step in and fill. Moreover each worker or node gets it own set of objects in the object storage which can be considered as shared-nothing as its own disks. All they need is their storage to be in the cloud and managed so that they never have to be limited by their disks.

That is how this Ticket Layer positions itself. It offers a tracking and leasing of universal storage that do away with disks for nodes in a cluster and workers in an application. Several forms of data access can be supported in addition to file-system protocols and S3 access in an object storage. The ticketing system is a planning and tracking software management system that generates tickets and leases for remote storage on a workload by workload basis so that the clouds’ elasticity can be brought to individual workers even within the same node or application.

Saturday, October 27, 2018

We were discussing Query over object storage. We now discuss query rewrites. A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier. This implies that not only the data types but also the query commands may be translated to suit the delegations to the storage stacks. If the commands can be honored, then they will return results and since the resolver in the virtualization layer may check all of the registered storage stacks, we will eventually get the result if the data matches.
Delegation doesn’t have to be the only criteria for the virtualization layer. Both the administrator and the system may maintain rules and configurations with which to locate the store for the data. More importantly the rules can be both static and dynamic. The former refers to rules that are declared ahead of the launch of the service and the service merely loads it in. The latter refers to the evaluations that dynamically assign queries to store based on classifiers and connection attributes. There is no limit to the attributes with which the queries are assigned to stores and the evaluation is done with a logic in a module that can be loaded and executed.
Query assignment logic may be directed towards the store that has the relevant data. This is not the only purpose of the logic. Since there can be many stores with the same data for availability, the dynamic assignment can also function as a load balancer. In some cases, we can go beyond load balancing to have resource pools that have been earmarked with different levels of resources and queries may be assigned to these pools. Usually resource pools refer to compute and certain queries may require more resources for computation than others even when the data is made available. Moreover, it is not just resource intensive queries, it is also priority of queries as some may have priority than others. Therefore, the logic for dynamic assignment of queries can have more than one purpose.

Friday, October 26, 2018

The notion that there can be hybrid storage for data virtualization does not jive with the positioning of object storage as a limitless single storage. However, customers choose on-premise or cloud object storage for different reasons. There is no question that some data is more suited to be in the cloud while others can remain on-premise. Even if the data were to be within the same object storage but in different zones, they would have different address and the resolver will need to resolve to one of them. Therefore, whether the resolution is between zones, between storage or between storage types, the resolver has to deliberate over the choices.
The nature of the query language determines the kind of resolving that the data virtualization needs to do. In addition, the type of storage that the virtualization layer spans also depends on the query language. LogParser was able to enumerate unstructured data but it is definitely not mainstream for unstructured data. Commands and scripts have been used for unstructured data in most places and they have worked well. Solutions like Saltstack enabled the same command and scripts to run over files in different storages even if they were on different operating systems. SaltStack used ZeroMQ for messaging and the commands would be executed by agents on those computers. The master on the SaltStack resolved the minions on which the commands and scripts would be run. This is no different from data virtualization layer that needs to resolve the types to different storages.
In order to explain the difference between data virtualization over structured and unstructured storage types, we look at metadata in structure storage. All data types used are registered. Whether they are system builtin types or user defined types, the catalog helps with the resolution. The notion that json documents don’t have a rigid type and that the document model can be extensible is not lost to the resolver. It merely asks for a inverted list of documents corresponding to a template that can be looked up. Furthermore, the logic to resolve json fields to documents does not necessarily have to be within the resolver. The inverted list and the lookup can be provided by the index layer while the resolver merely delegates it. Therefore, the resolver can span different storage types as long as the resolution is delegated to the storage types. The query language for structure and unstructured may or may not be unified with a universal query language but that is beside the consideration that we can different storage types as different stacks and yet have the data virtualization over them.
Query rewrites has not been covered in the topic above. A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier. This implies that not only the data types but also the query commands may be translated to suit the delegations to the storage stacks. If the commands can be honored, then they will return results and since the resolver in the virtualization layer may check all of the registered storage stacks, we will eventually get the result if the data matches.

Thursday, October 25, 2018

Many systems use the technique of reflection to find out whether a module has a particular data type. An object storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the object storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type. This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer can draw inspiration from these well-known examples
Most of the implementations for a layer is usually in the form of a micro-service. This helps modular design as well as testability. Moreover, the microservice provides http access for other services. This has been proven to be helpful from the point of view of production support. The services that depend on the data virtualization to be running can also directly go to the object storage via http. Therefore, the data virtualization layer merely acts like a proxy. The proxy can be transparent, thin or fat. It can even play the man in the middle and therefore it requires very little change from the caller.
The implementation of a service for data virtualization is not uncommon – both on–premise and in the cloud. In general, most applications benefit from being modular. Service oriented architecture and micro-services play important roles in making applications modular. This is in fact the embrace of cloud technologies as the modules are hosted on their own virtual machines and containers. With the move towards cloud, we get the added benefits of elasticity and billing on a pay per use model that we otherwise do not have. The benefits of managed instances, hosts and containers is never lost on the application modularity. Therefore, data virtualization can also be implemented as a service that can span both on-premise as well as cloud storage.
The data virtualization service is also an opt-in because the consumers can go directly to the storage. For example, the callers will either use the proxy or they won’t so long as they take care of resolving their queries with fully qualified namespaces and data types. It is merely a convenience and very much suited for encapsulating in a module. The testability of the data virtualization also improves significantly.
Unlike other services, data virtualization is exclusively dependent on the nature of data retrieval from the query layer. While data transformation services such as Extract-Transform-Load prepare data in forms that are more suitable for consumer services, there is no read-write operation here. The purpose of fully resolving data types so that queries can be executed depends exclusively on the queries. This is why the language of the queries is important. If it were something standard like the SQL query language then it becomes helpful to bridge that language over unstructured data. LogParser came very close to that by viewing the data as an enumerable but the reason the language for LogParser became highly restrictive was that it needed to be LogParser friendly. If the types mentioned in the log parser queries did not match the data from the Component Object Model (COM) interfaces, Log Parser would not be able to understand the query. If there were a data virtualization layer within the LogParser that mashed-up the datatype to the data set by using one or more interfaces, it would have expanded the type of queries being written with the LogParser. Here, we utilize the same concept for data virtualization over Object Storage.

Wednesday, October 24, 2018

We continue discussing data virtualization over object storage. Data Virtualization does not need to be a thin layer as a mashup of different object storage. Even if it was just a gateway, it could have allowed customizations from users or administrators for workloads. However, it doesn’t stop there as described in this write-up: https://1drv.ms/w/s!Ashlm-Nw-wnWt0XqQrQQTQBtuKYh. This layer can be intelligent to interpret data types to storage. Typically queries specify the data location either as part of the query with fully resolved data types or as part of the connection in which the queries are made. In a stateless request-response mode, this can be part of the request. The layer resolving all of the request can determine the storage location. Therefore, the data virtualization can be an intelligent router.
Notice that both user as well as system can determine the policies for the data virtualization. If the data type is only one then the location is known If the same data type is available in different locations, then the determination can be made by user with the help of rules and configurations. If the user has not specified, then the system can choose to determine the location for the data type. The rules are not necessarily maintained by the system in the latter case. It can simply choose the first available store that has the data type.
Many systems use the technique of reflection to find out whether a module has a particular data type. An object storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the object storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type. This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer is pretty much known.
Most of the implementations for a layer is usually in the form of a micro-service. This helps modular design as well as testability. Moreover, the microservice provides http access for other services. This has been proven to be helpful from the point of view of production support. The services that depend on the data virtualization to be running can also directly go to the object storage via http. Therefore, the data virtualization layer merely acts like a proxy. The proxy can be transparent, thin or fat. It can even play the man in the middle and therefore it requires very little change from the caller.

Tuesday, October 23, 2018

Let us now discuss data virtualization for query over object storage. The destination of queries is usually a single data source but query execution like any other application retrieves data without requiring technical details about the data such as where it is located. Location usually depends on the organization that has a business justification for growing and maintaining data. Not everybody like to dip right into the data lake right in the beginning of a venture as most organizations do. They have to grapple with their technical need and changing business priorities before the data and its management solution can be called stable.
Object Storage unlike databases allows for incredible almost limitless storage with sufficient replication groups to cater to organizations that need their own copy. In addition, the namespace-bucket-object hierarchy allow the level of separation as organizations need it.
The role of object storage in data virtualization is only on the physical storage level where we determine which site/zone to get the data from. A logical data virtualization layer that knows which namespace or bucket to go to within an object storage does not come straight out of the object storage. A gateway would server that purpose. The queries can then choose to run against a virtualized view of the data by querying the gateway which in turn would fetch the data from the corresponding location.
There are many levels of abstraction. First, the destination data source may be within an object storage. Second the destination data source may be from different object storage. Third the destination may be from different storages such as an object storage and cluster file system. In all these cases, the virtualization logic resides external to the storage and can be written in different forms. Commercial products of this kind such as Denodo are proof that this is a requirement for businesses. Finally, the logic within the virtualization can be customized so that queries can make the most of it. We refer this article to describe the usages of data virtualization while here we discuss the role of object storage.

Monday, October 22, 2018

We were discussing search with object storage. The language for the query has traditionally been SQL. Tools like LogParser allow sql queries to be executed over enumerables. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as builtins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the object storage could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.
Aggregation operations have had the benefit that they can support both batch and streaming mode operations. These operations therefore can operate on large datasets because they view only a portion at a time. Furthermore, the batch operation can be parallelized to a number of processors. This has generally been the case with Big Data and cluster mode operations. Until recently, streaming mode operations were not so common with the big data. However, streaming conversions of summation form processing in batch data is now facilitated directly out of the box from streaming algorithm packages from some public cloud providers. This means that both batch and stream processing can operate on unstructured data. Some other forms of processing are included here.
The language may support search expressions as well as user defined operators. Structured query language works for structured data. Unstructured documents are best served by search operators and expressions. Examples include the search expressions and piped operations used with logs. With the broader umbrella inclusion of statistical and machine learning packages, universal query language is now trying to broaden the breadth of existing query language and standardize it.

Sunday, October 21, 2018

There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations. Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes. The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.

Full-text search is not the only stack. There can be another search stack that can be added to this object storage. For example, an iterator-based .NET style standard query operator may also be provided over this object storage. Even query tools like LogParser that opens up a COM interface to the objects in the storage can be used. Finally, a comprehensive and dedicated query engine that studies, caches and replays the query is possible since the object storage does not restrict the search layer.

There are a few techniques that can improve query execution. The degree of parallelism helps the query to execute faster by partitioning the data and invoking multiple threads. while increasing these parallel activities, we should be careful to not have too many otherwise the system can get into thread thrashing mode. The rule of thumb for increasing DoP is that the number of threads is one more than the number of processors and this refers to the operating system threads. There are no limits to lightweight workers that do not have contention.
Caching is another benefit to query execution. if a query repeats itself over and over again, we need not perform the same calculations to determine the least cost of serving the query. We can cache the plan and the costs which we can reuse.

Saturday, October 20, 2018

We were discussing full-text search with object storage. Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups. Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
For example, the page rank algorithm can be selectively used to filter the results. The nodes are the terms and the edges are the inverted list of documents that contain two nodes. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on. PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized inverted document list matrix.
Full text search facilitates text mining just the same way a corpus does. While documents are viewed as a bag of words, the indexer represents a collection of already selected keywords for each indexed document. Both are input to the text mining algorithms. The neural nets will calculate the mutual information between terms regardless of the source and classify them with the softmax classifier. This implies that the indexer document can allow user input to be added or collected as fields in the index document which can then be treated the same way as the corpus documents.
There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations. Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes. The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.

Friday, October 19, 2018

We were discussing full-text search with object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Although S3 supports adding access control descriptions to the objects, those are for securing the objects from other users and not the system. Search is a system-wide operation. Blacklisting any of the hierarchy artifacts is possible but it leaves the onus on the user.
This pervasive full-text has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
The use of object storage with Lucene as a full text solution for unstructured data also comes with many benefits other than search. For example, the fields extracted from the raw data together also forms the input for other analysis.
The tags generated on the metadata via supervised or unsupervised learning also forms useful information for subsequent queries. When index documents are manually classified, there is no limit to the number of tags that can be added. Since the index documents and the tags are utilized by the queries, the user gets more and more predicates to use.
The contents of the object storage do not always represent text. They can come in different formats and file types. Even when they do represent text, they may not always be clean. Consequently, a text pre-processing stage is needed prior to indexing. Libraries that help extract from different file types may be used. Also, stemmers and term canonicalizers may be used.
The index documents are also unstructured storage. They can be saved and exported from object storage. The contents of the index document and the fields they retain are readable using the packages with which they were created. They are not proprietary per se if we can read the fields in the index documents and store them directly in object storage as json documents. Most of the fields in the index documents are enumerated by the doc.toString() method. It is easy to take the string collection and save them as text files if we want to make the terms available to other applications. This conversion of information in the various file extensions of the Lucene Index documents such as term infos, term infos index, term vector index, term vector documents and term vector fields can be converted to any form we like. Consequently we are not limited to using one form of search over the metadata.
Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
A Lucene index contains a sequence of documents each of which is a sequence of fields. The fields are named sequence of terms and each term is a string in the original text that was indexed. The same term may appear in different fields but have different names. The index may also have partitions called segments Each segment is a fully independent index which could be searched separately. New segments may be created or existing segments may be merged. This organization is re-usable in all contexts of using inverted indexes. Any external format for exporting these indexes may also use a similar organization.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups. Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.

Thursday, October 18, 2018

We were discussing full-text search on object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists They don’t even record any information regarding the user in the fields of the document. The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Even the documents in the object storage are not secured so as to hide them from the indexer. Although S3 supports adding access control descriptions to the objects, those are for securing the objects from other users and not the system. Similarly the buckets and namespaces can be secured from unwanted access and this is part of the hierarchy in object storage. However, the indexer cannot choose to ignore want part of the hierarchy because it would put the burden on the user to decide what to index. This is possible technically by blacklisting any of the hierarchy artifacts but it is not a good business policy unless the customer needs advanced controls
This has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
Finally, we mention that the users are not required to use different search queries by using different document collections. All queries can target the same index at different locations that is generated for all document collection.

Wednesday, October 17, 2018

We were discussing full text search over object storage. As we enumerate some of the advantages of separating object index from object data, we realize that the metadata is what we choose to keep from the Lucene generated fields. With enhancement to fields of the documents added to the index, we improve not only the queries but also the collection on which we can perform correlations. If we started out with only a few fields, the model for statistical analysis has only a few parameters. The more fields we add the better the correlation. This is true not just for queries on the data and the analysis over historical accumulation of data, but also the data mining and machine learning methods.
We will elaborate each of these. Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
Let us also look at data mining techniques. These include clustering techniques and rely largely on adding additional tags to existing information. Even though we are searching Lucene index documents and they may already have fields, there is nothing preventing these techniques to classify and come up with newer labels which can be persisted as fields. Therefore, unlike the read only nature of the queries mentioned earlier, these are techniques where one stage of processing may benefit from the read-write of another. Data mining algorithms are computationally heavy as compared to some of the regular queries for grouping, sorting and ranking. Even the similarity between an index document and a cluster might not be cheap. That is why it helps to have the results of one processing benefit another especially if the documents have not changed.
Now let us take a look at the machine learning which is by far most involved with computations than all of the above. In these cases, we again benefit with more and more data to process. Since the machine learning methods are implemented in packages from different sources, there is more emphasis on long running tasks written in environments different from the data source. Hence making all of the data available in different compute environments becomes more important. If it helps the performance in these cases, the object storage can keep replications of the data in different zones.

Tuesday, October 16, 2018

In addition to regular time-series database, object storage can participate in text based search both as full text search as well as text mining techniques. One example of using full-text based search is demonstrated with Lucene here: https://github.com/ravibeta/csharpexamples/blob/master/SourceSearch/SourceSearch/SourceSearch/Program.cs which is better written in Java and File system enabled object storage. An example is available at https://github.com/ravibeta/JavaSamples/blob/master/FullTextSearchOnS3.java
The text mining techniques including machine learning techniques is demonstrated with scikit python packages and sample code at https://github.com/ravibeta/pythonexamples/blob/master/
The use of indexable key values in full text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket we no longer use the actual object and thus avoid searching through large objects. Instead we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key value entries reduced space and keeps them together so that a greater range of objects can be searched . This technique has been quite popular with many indexes.

Monday, October 15, 2018

The object storage as a time-series data-store
Introduction: Niche products like log indexes and time series database often rely on unstructured storage. While they are primarily focused on their respective purpose, they have very little need to focus on storage. Object storage serves as a limitless no maintenance store that not only replaces a file-system as the traditional store for these products but also brings the best from storage best practice. As a storage tier, object storage is more suited to not only take the place of other storage used by these products but also bring those product features into the storage tier. This article examines the premise of using object storage to directly participate as the base tier of log index products.
Description: We begin by examining the requirements for a log index store.
First, the log indexes are maintained as time series database. The data in the form of cold, warm and hot buckets which are used to denote d the progression in the entries made to the store. Each of the index is maintained from a continuous rolling input of data that fills one bucket after another. Each entry is given a timestamp if it cannot be interpreted from the data either by parsing or by some tags specified by the user. This is called the raw entry and is used to parse the data for fields that can be extracted and used with the index. An object storage also enables key values to be written in its objects. Indexes on raw entries can be maintained together with the data or in separately named objects. The organization of time-series artifacts is agreeable to the namespace-bucket-object hierarchy in object storage.
Second, most log stores accept the input directly over the http via their own proprietary application calls. The object storage already has well known S3 APIs that facilitate data read and write. This makes the convention of calling the API uniform not just for the log indexing service but also the callers. The ability to input data via the http is not the only way to do so. A low-level connector that can take file system protocols for data input and output may also improve the access for these time series databases. A file-system backed object storage that supports file-system protocols may serve this lower level data path. A log store’s existing file system may directly be replicated into an object storage and this is merely a tighter integration of the log index service as native to the object storage if the connectors are available to accept data from sockets, files, queues and other sources.
Third, the data input does not need to be collected from data sources. In fact, log appenders are known to send data to multiple destinations and object storage is merely another destination for them. For example, there is an S3 api based log appender http://s3appender.codeplex.com/ that can be used directly in many applications and services because they use log4j. The only refinement we mention here is that the log4j appenders need not go over the http and can used low level protocols as file system open and close.
Fourth, the object storage brings the best in storage practice in terms of durability, redundancy, availability and organization and these can be leveraged for simultaneous log analytical activities that were generally limiting in a single log index server.
Conclusion: The object storage is well suited for saving, parsing, searching and reporting from logs.
Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWt2h4_zHbC-u_MKIn

Sunday, October 14, 2018

Object Storage unlike file-systems is an ideal destination for logs and improve production support drastically with their access over S3 APIs. The Cache service can directly re-use the object storage as its log store. The store is limitless and has no maintenance. The time-series databases make progressive buckets as they fill events in each bucket and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can rollover buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for read over the http. The only caveat is that some production support request may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and may be accommodated on-premise or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well. In addition, the entire querying stack for reading these entries, can be built on copies or selections of buckets and objects. More about saving logs and their indexes in object storage is available at : https://1drv.ms/w/s!Ashlm-Nw-wnWt3eeNpsNk1f3BZVM

Saturday, October 13, 2018

We were discussing object cache in production systems. The production system, as opposed to other environments, needs to be closely monitored. This means that we have alerting infrastructure for exceeding thresholds on various metrics that will be relevant to the operation of the cache. In addition, whenever the system needs to be diagnosed for troubleshooting, some supportability features will be required from the system. These include running counters so that the logs need not be the only source of information. The level of alerts and monitors on a production system will far exceed those in any other environment because it represents the operations of a company or organization. In addition to these safeguards, a production system is also the longest standing environment and needs to be running smooth. Consequently, all activities on the production system must be well understood, rehearsed and orchestrated.
The list of features for production support include many more items other than logging. While logs are authoritative on all aspects of the operation of the services, they grow quickly and roll over into several files even when archived. Most organizations are not able to retain more than few months history of logs in their file-system. Even if the logs can be persisted in file system remote from the production systems, they will require a lot of storage. Moreover, the trouble with searching files directly with the file-system is that we are limited to command line tools. This is not necessarily the case with log indexes which can grow arbitrarily large and support impressive analytical capabilities by way of search and query operators.
The logs of the cache can directly into object storage in addition to other destinations via the log4j utility. The log4j is an asynchronous logging utility that supports writing and publishing of logs to multiple destinations at the same time. Much like the Enterprise Application Block for logging in the web application world, this utility allows any number of loggers to log to files. It is fast, flexible, reliable, extensible and supports filters and levels. It has three components- the loggers, the appenders and the layouts. The appenders emit the logs to files, console, socket, syslog, smtp, memory mapped files, NoSQL and ZeroMQ. Although object storage is not directly listed as destination and in fact object storage works very well as a time-series store, there doesn’t seem to be a direct s3 appender. However, since queues and webhooks are supported, it should be straightforward to setup object storage as a sink for the logs.

Friday, October 12, 2018

If the cache is distributed, the performance analysis may need to find if any of the distributed hashing leads to nodes that are overloaded. These nodes can further be expanded by the addition of new nodes. Performance analysis in a distributed framework is slightly more involved than on a single server because there is a level of indirection. Such study must not only make sure that the objects are cached satisfactorily at the local level but also that they participate in the global statistics.
Statistics cannot be at the object level alone. We need running counters of hits and misses across objects. These may be aggregated from all the nodes in a distributed hash table. Some like to view this independent of the network between the nodes. For example, they take a global view regardless of the distribution of the actual objects. As long as we can cumulate the hits and misses on per object level and across objects in a global view, the networking does not matter at this level. Although, nodes are expected to have uniform distribution of objects, they may get improperly balanced at which point the network level statistics become helpful.
The performance measurements of a cache can be done in a test lab by simulating the workload from a production system. This requires just the signature of the workload in terms of the object accesses and everything else can be isolated from the production system. The capture of the workload is not at all hard because we only want the distribution, duration, type and size of accesses. The content itself does not matter as much as it does to applications and users. If we know the kind of object accesses done by the workloads, we know what the cache is subjected to in production. Then we can artificially generate as many objects and their access as necessary and it would not matter to the test because it would not change the duress on the cache. We can also maintain different t-shirt size artificial workloads to study the impact on the cache. Some people raise the concern that a system in theory may not work as well in practice but in this case, when we keep all the parameters the same, there is very little that can deviate the results from the theory. The difference between the lab and the production can be tightened so thin that we can assert that the production will meet the need of the workloads after they have been studied in the lab. Many organizations take this approach even for off-the shelf software because they don’t want to exercises anything in production. Moreover, access to the production system is very restricted because it involves many perspectives and not just performance. Compliance, regulations, auditing and other such concerns require that the production system is hardened beyond development and test access. Imagine if you were gain to access to all the documents of the users using the production cache as a development or test representative of the organization. Even the pipelines feeding into the production are maintained with multiple stages of vetting so that we minimize the pollution to the production where it becomes necessary to revert to a previous version. Productions systems are also the assets of a company that represent the cumulative efforts of the entire organization if not the company itself. It becomes necessary to guard it more than others. Moreover, there is only one production system as opposed to multiple test and development environments.
#codingexercise
We were discussing subset sum yesterday. If we treat every element of the array as candidate for subset evaluation as above, we can find all this elements that satisfy the subset property. Since each iteration has overlaps with the previous iteration we can maintain a dp table of whether an element has a subset sum or product property. If the current element has subset each of which has a value in the integer dp table corresponding to the number of ways of forming the subset, we can aggregate it for the current element.

The memoization merely helps to not redo the same calculations again.

Thursday, October 11, 2018

We were discussing the design of cache yesterday. At this point, we can also separate the concerns between the cache and the storage for consistency. We can leverage the object storage for the consistency model of the objects. As long as there is a cache miss, we can translate the calls to the storage. In some cases, we can designate the cache as read through or write through. Therefore, as long as the architecture allows, our cache can be repurposed in more than one manner according to the workload and the provisioning of the policies. If the policies are determined by the layer above the cache, then the cache can become more robust. In the absence of policies, the cache can leverage the consistency model of the storage. It is for this reason that the caches that work with relational databases have been read-throughs or write-throughs.
There can definitely be a feedback cycle that can help tune the cache for a given setup. For example, the statistics that we collect from the cache in terms of hits and misses over time can help determine the minor adjustments to be made so that the applications see consistent performance. Most caches need to be warmed up before they can participate in the feedback cycle. This refers to the initial bringing of objects into the cache so that subsequent accesses may directly be made from the cache. This is true for both the application workloads as well as the cache. After the warm-up period, a workload may attain a regular rate of access. It is such patterns of access that we can hope to make improvements in the cache. Random accesses that do not have any pattern, are generally ignored from tuning.

#codingexercise
The dp technique of including or excluding an element in the subset problem also applies to subset product determination

Wednesday, October 10, 2018

Now that we have looked at marking the object collection in the cache for eviction, let us look at a few techniques to improve the information passed from the garbage collection to the cache. We maintained that the cache need not implement any strategy such as the least-recently-used, time-to-live and such others. The garbage collection already maintains a distinct set of generations and it is grading down the objects and the information passed to the cache need not be a mark to delete. It can be transparent to the cache with all the statistics it gathers during the collection run. Therefore, the cache may be able to determine the next steps even if the garbage collector suddenly disappeared. This means it includes everything from counts of accesses to the object, initialization time, the last modified time and such others. Although we use the term garbage collector, it is really a metadata collector on the object because a traditional garbage collector relied extensively on the root object hierarchy and the scopes introduced by a predetermined set of instructions. Here, we are utilizing the metadata for the policy which the cache need not implement. Therefore, all information from the layer above may be saved so that the cache can use it just the same in the absence of any direct information of which objects to evict. Finally, the service for the cache may be able to bring the policy into the cache layer itself.
Let us now look at the topology of the cache. Initially we suggested that a set of servers can participate in the cache if the objects are distributed among the servers. In such a case, the cache was distributed among n servers as hash(o) modulo n. This had the nasty side-effect that when one or more servers went down or were added into the pool, all the objects in the cache would lose their hash because the variable n changed. Instead consistent hashing came up with the scheme of accommodating new servers and taking old servers offline by arranging the hashes around a circle with cache points. When a cache is removed or added, the objects with hashes along the circle are moved clockwise to the next cache point. It also introduced “virtual nodes” which are replicas of cache points in the circle. Since the caches may have non-uniform distribution of objects across caches, the virtual nodes have replicas of objects from a number of cache points.
#codingexercise
Find the minimum number of subset elements of an integer array that sum to a given number
This follows the dynamic programming :
return min(1+ recursion_with_the_candidate_and_sum_minus_candidate, recursion_without_candidate_and_same_sum)
We add the validations for the terminal conditions.

Tuesday, October 9, 2018

We were discussing cache policy for aging. While there can be other mechanisms that directly translate to a table and query for the cache, grading and shifting objects is sufficient to achieve aging and compaction. This then translates to an effective cache policy.
Another strategy for the interaction between garbage collection and the cache is for the cache to merely hold a table of objects and their status. The status is always progressive from initialized->active->marked-for-eviction->deleted. The state of the objects is determined by the garbage collector. Therefore, the net result of garbage collection is a set of new entries in the list of objects for the cache.
Also, items marked for eviction or deleted from the cache may increase over time. These may then be put archived on a periodic rolling basis into the object storage so that the cache merely focuses on the un-evicted items. The cache therefore sees only a window of objects over time and this is quite manageable for the cache because objects are guaranteed to expire. The garbage collector publishes the result to the cache as a list without requiring any object movement.
def reaper (objects, graveyard):
for dead in expired(objects):
If dead not in graveyard:
graveyard += [dead]
objects = list(set(objects) - set(dead))
else:
objects = list(set(objects) - set(dead))
def setup_periodic_reaping(sender, **kwargs):
sender.add_periodic_task(10.0, reaper(kwargs[‘objects’], kwargs[‘graveyard’]) , name=’reaping’)

#codingexercise

Find the length of the largest dividing subsequence of a number array. An dividing subsequence is one where the elements appearing prior to an element in the subsequence are proper divisor of that element. 2, 4, 8 has a largest dividing subsequence of 3 and that of 2,4,6 is 2.
public static void printLDS(List <int> a) {
if (a == null || a.size () ==0) {System.out.println (“0”);}
if (a.size () == 1) { System.out.println (“1”);}
int lds = new int [a.size ()+1];
lds [0] = 1;
for (int I = 1; I < a.size (); I++){
for (int j = 0; j < I; j++){
if (lds [j] != 0 && a [j] != 0 && a [i] > a [j] && a [i] % a [j] == 0) {
lds [i] = Math.max (lds [i], lds [j] +1);
}
}
List b = Arrays.asList(ArrayUtils.toObject(lds));
System.out.println (Collections.max (b));
}
The above method can be modified for for any longest increasing subsequence

The subsequence of divisors also from an increasing subsequence

Monday, October 8, 2018

We were discussing garbage collection and compaction.With objects, we don't have to create the graph. Instead we have to find ways to classify them by generation. Any classification of the objects is considered temporary until the next usage. At that point, the same objects may need reclassification. Since the classification is temporal, we run the risk of mislabeling generations and consequently reclaiming an object when it wasn't supposed to be.
We can also consider a markdown approach where after labeling the objects, we progressively mark them down so that we take actions only on the labels that are older than the oldest classification. This will help with keeping the classification and the reaping separate. The above approach will enable no actions to be taken if just the classification is needed. And also helping with taking different actions on the different tiers. For example, evict and delete may be considered separate actions.
The unit of moving operation is an object here only because applications use objects. However, the cache may choose to stash swaths of objects at a time for shifting or ignoring. For example, we could use heap to keep track of the biggest available swath to roll for objects of arbitrary size.
We may also have a compact version for keeping a running track of all the swaths. For example, if they are consecutive, we could coalesce them. If there is a set of active objects that can be margined into a swath, it can be marked as busy and ignored. If there are too many alternating swaths of free and busy object swaths, they can be retained as such and ignored from coalescing.
Busy swaths may need to be updated as candidate for conversion to free whenever objects are moved out of it.
Busy swaths can also be coalesced if they permit. This is usually not necessary because it is too aggressive and only a tiny fraction of the overall collection.
Movements of free and busy swaths are always opposite to each other. The free swaths move towards the younger generation where they can be utilized for gen 0.
The movement operation itself can be optimized by efficient bookkeeping so that it translates only to updates in the book-keeping