Cluster computing

Wednesday, October 17, 2018

We were discussing full text search over object storage. As we enumerate some of the advantages of separating object index from object data, we realize that the metadata is what we choose to keep from the Lucene generated fields. With enhancement to fields of the documents added to the index, we improve not only the queries but also the collection on which we can perform correlations. If we started out with only a few fields, the model for statistical analysis has only a few parameters. The more fields we add the better the correlation. This is true not just for queries on the data and the analysis over historical accumulation of data, but also the data mining and machine learning methods.
We will elaborate each of these. Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
Let us also look at data mining techniques. These include clustering techniques and rely largely on adding additional tags to existing information. Even though we are searching Lucene index documents and they may already have fields, there is nothing preventing these techniques to classify and come up with newer labels which can be persisted as fields. Therefore, unlike the read only nature of the queries mentioned earlier, these are techniques where one stage of processing may benefit from the read-write of another. Data mining algorithms are computationally heavy as compared to some of the regular queries for grouping, sorting and ranking. Even the similarity between an index document and a cluster might not be cheap. That is why it helps to have the results of one processing benefit another especially if the documents have not changed.
Now let us take a look at the machine learning which is by far most involved with computations than all of the above. In these cases, we again benefit with more and more data to process. Since the machine learning methods are implemented in packages from different sources, there is more emphasis on long running tasks written in environments different from the data source. Hence making all of the data available in different compute environments becomes more important. If it helps the performance in these cases, the object storage can keep replications of the data in different zones.

Tuesday, October 16, 2018

In addition to regular time-series database, object storage can participate in text based search both as full text search as well as text mining techniques. One example of using full-text based search is demonstrated with Lucene here: https://github.com/ravibeta/csharpexamples/blob/master/SourceSearch/SourceSearch/SourceSearch/Program.cs which is better written in Java and File system enabled object storage. An example is available at https://github.com/ravibeta/JavaSamples/blob/master/FullTextSearchOnS3.java
The text mining techniques including machine learning techniques is demonstrated with scikit python packages and sample code at https://github.com/ravibeta/pythonexamples/blob/master/
The use of indexable key values in full text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket we no longer use the actual object and thus avoid searching through large objects. Instead we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key value entries reduced space and keeps them together so that a greater range of objects can be searched . This technique has been quite popular with many indexes.

Monday, October 15, 2018

The object storage as a time-series data-store
Introduction: Niche products like log indexes and time series database often rely on unstructured storage. While they are primarily focused on their respective purpose, they have very little need to focus on storage. Object storage serves as a limitless no maintenance store that not only replaces a file-system as the traditional store for these products but also brings the best from storage best practice. As a storage tier, object storage is more suited to not only take the place of other storage used by these products but also bring those product features into the storage tier. This article examines the premise of using object storage to directly participate as the base tier of log index products.
Description: We begin by examining the requirements for a log index store.
First, the log indexes are maintained as time series database. The data in the form of cold, warm and hot buckets which are used to denote d the progression in the entries made to the store. Each of the index is maintained from a continuous rolling input of data that fills one bucket after another. Each entry is given a timestamp if it cannot be interpreted from the data either by parsing or by some tags specified by the user. This is called the raw entry and is used to parse the data for fields that can be extracted and used with the index. An object storage also enables key values to be written in its objects. Indexes on raw entries can be maintained together with the data or in separately named objects. The organization of time-series artifacts is agreeable to the namespace-bucket-object hierarchy in object storage.
Second, most log stores accept the input directly over the http via their own proprietary application calls. The object storage already has well known S3 APIs that facilitate data read and write. This makes the convention of calling the API uniform not just for the log indexing service but also the callers. The ability to input data via the http is not the only way to do so. A low-level connector that can take file system protocols for data input and output may also improve the access for these time series databases. A file-system backed object storage that supports file-system protocols may serve this lower level data path. A log store’s existing file system may directly be replicated into an object storage and this is merely a tighter integration of the log index service as native to the object storage if the connectors are available to accept data from sockets, files, queues and other sources.
Third, the data input does not need to be collected from data sources. In fact, log appenders are known to send data to multiple destinations and object storage is merely another destination for them. For example, there is an S3 api based log appender http://s3appender.codeplex.com/ that can be used directly in many applications and services because they use log4j. The only refinement we mention here is that the log4j appenders need not go over the http and can used low level protocols as file system open and close.
Fourth, the object storage brings the best in storage practice in terms of durability, redundancy, availability and organization and these can be leveraged for simultaneous log analytical activities that were generally limiting in a single log index server.
Conclusion: The object storage is well suited for saving, parsing, searching and reporting from logs.
Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWt2h4_zHbC-u_MKIn

Sunday, October 14, 2018

Object Storage unlike file-systems is an ideal destination for logs and improve production support drastically with their access over S3 APIs. The Cache service can directly re-use the object storage as its log store. The store is limitless and has no maintenance. The time-series databases make progressive buckets as they fill events in each bucket and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can rollover buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for read over the http. The only caveat is that some production support request may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and may be accommodated on-premise or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well. In addition, the entire querying stack for reading these entries, can be built on copies or selections of buckets and objects. More about saving logs and their indexes in object storage is available at : https://1drv.ms/w/s!Ashlm-Nw-wnWt3eeNpsNk1f3BZVM

Saturday, October 13, 2018

We were discussing object cache in production systems. The production system, as opposed to other environments, needs to be closely monitored. This means that we have alerting infrastructure for exceeding thresholds on various metrics that will be relevant to the operation of the cache. In addition, whenever the system needs to be diagnosed for troubleshooting, some supportability features will be required from the system. These include running counters so that the logs need not be the only source of information. The level of alerts and monitors on a production system will far exceed those in any other environment because it represents the operations of a company or organization. In addition to these safeguards, a production system is also the longest standing environment and needs to be running smooth. Consequently, all activities on the production system must be well understood, rehearsed and orchestrated.
The list of features for production support include many more items other than logging. While logs are authoritative on all aspects of the operation of the services, they grow quickly and roll over into several files even when archived. Most organizations are not able to retain more than few months history of logs in their file-system. Even if the logs can be persisted in file system remote from the production systems, they will require a lot of storage. Moreover, the trouble with searching files directly with the file-system is that we are limited to command line tools. This is not necessarily the case with log indexes which can grow arbitrarily large and support impressive analytical capabilities by way of search and query operators.
The logs of the cache can directly into object storage in addition to other destinations via the log4j utility. The log4j is an asynchronous logging utility that supports writing and publishing of logs to multiple destinations at the same time. Much like the Enterprise Application Block for logging in the web application world, this utility allows any number of loggers to log to files. It is fast, flexible, reliable, extensible and supports filters and levels. It has three components- the loggers, the appenders and the layouts. The appenders emit the logs to files, console, socket, syslog, smtp, memory mapped files, NoSQL and ZeroMQ. Although object storage is not directly listed as destination and in fact object storage works very well as a time-series store, there doesn’t seem to be a direct s3 appender. However, since queues and webhooks are supported, it should be straightforward to setup object storage as a sink for the logs.

Friday, October 12, 2018

If the cache is distributed, the performance analysis may need to find if any of the distributed hashing leads to nodes that are overloaded. These nodes can further be expanded by the addition of new nodes. Performance analysis in a distributed framework is slightly more involved than on a single server because there is a level of indirection. Such study must not only make sure that the objects are cached satisfactorily at the local level but also that they participate in the global statistics.
Statistics cannot be at the object level alone. We need running counters of hits and misses across objects. These may be aggregated from all the nodes in a distributed hash table. Some like to view this independent of the network between the nodes. For example, they take a global view regardless of the distribution of the actual objects. As long as we can cumulate the hits and misses on per object level and across objects in a global view, the networking does not matter at this level. Although, nodes are expected to have uniform distribution of objects, they may get improperly balanced at which point the network level statistics become helpful.
The performance measurements of a cache can be done in a test lab by simulating the workload from a production system. This requires just the signature of the workload in terms of the object accesses and everything else can be isolated from the production system. The capture of the workload is not at all hard because we only want the distribution, duration, type and size of accesses. The content itself does not matter as much as it does to applications and users. If we know the kind of object accesses done by the workloads, we know what the cache is subjected to in production. Then we can artificially generate as many objects and their access as necessary and it would not matter to the test because it would not change the duress on the cache. We can also maintain different t-shirt size artificial workloads to study the impact on the cache. Some people raise the concern that a system in theory may not work as well in practice but in this case, when we keep all the parameters the same, there is very little that can deviate the results from the theory. The difference between the lab and the production can be tightened so thin that we can assert that the production will meet the need of the workloads after they have been studied in the lab. Many organizations take this approach even for off-the shelf software because they don’t want to exercises anything in production. Moreover, access to the production system is very restricted because it involves many perspectives and not just performance. Compliance, regulations, auditing and other such concerns require that the production system is hardened beyond development and test access. Imagine if you were gain to access to all the documents of the users using the production cache as a development or test representative of the organization. Even the pipelines feeding into the production are maintained with multiple stages of vetting so that we minimize the pollution to the production where it becomes necessary to revert to a previous version. Productions systems are also the assets of a company that represent the cumulative efforts of the entire organization if not the company itself. It becomes necessary to guard it more than others. Moreover, there is only one production system as opposed to multiple test and development environments.
#codingexercise
We were discussing subset sum yesterday. If we treat every element of the array as candidate for subset evaluation as above, we can find all this elements that satisfy the subset property. Since each iteration has overlaps with the previous iteration we can maintain a dp table of whether an element has a subset sum or product property. If the current element has subset each of which has a value in the integer dp table corresponding to the number of ways of forming the subset, we can aggregate it for the current element.

The memoization merely helps to not redo the same calculations again.

Thursday, October 11, 2018

We were discussing the design of cache yesterday. At this point, we can also separate the concerns between the cache and the storage for consistency. We can leverage the object storage for the consistency model of the objects. As long as there is a cache miss, we can translate the calls to the storage. In some cases, we can designate the cache as read through or write through. Therefore, as long as the architecture allows, our cache can be repurposed in more than one manner according to the workload and the provisioning of the policies. If the policies are determined by the layer above the cache, then the cache can become more robust. In the absence of policies, the cache can leverage the consistency model of the storage. It is for this reason that the caches that work with relational databases have been read-throughs or write-throughs.
There can definitely be a feedback cycle that can help tune the cache for a given setup. For example, the statistics that we collect from the cache in terms of hits and misses over time can help determine the minor adjustments to be made so that the applications see consistent performance. Most caches need to be warmed up before they can participate in the feedback cycle. This refers to the initial bringing of objects into the cache so that subsequent accesses may directly be made from the cache. This is true for both the application workloads as well as the cache. After the warm-up period, a workload may attain a regular rate of access. It is such patterns of access that we can hope to make improvements in the cache. Random accesses that do not have any pattern, are generally ignored from tuning.

#codingexercise
The dp technique of including or excluding an element in the subset problem also applies to subset product determination