Saturday, October 6, 2018

This post is in continuation of the design for an object cache layer as described here. A specific use case now explains the aging of objects in the cache. While the object storage saved the objects for durability, the cache made it available for workloads to save on the cost of going deep into object storage.  The cache also translated the user workloads into an append-only workload for the object storage. This simplifies versioning and replication which can now be done in object storage at little or no cost. A Garbage Collection System demonstrates this aging.  While the cache can implement any one of the caching policies for retention of object, it can also be delegated to the user workload where the workload specifies which objects have aged. Then the cache merely schedules the storage of these objects into the object storage. Such a policy is best demonstrated in a user workload that implements a garbage collection system. And once it works well in the user workload, the logic can then be moved into the cache layer.
In this post and the next few, we bring up the .Net garbage collection as an example of such an aging policy. The .Net garbage collection is a generational markdown compactor. The compactor uses the notion of a generation to identify objects by their usage. The most used objects belong to a younger generation and those that are less used, belong to the older generation. The generation gives us an indication of age. The age of the object on the other hand, is the time from last use. We will describe the sweep and the collection shortly but we proceed with the notion that there is a metric that lets us identify the age of the object which the cache then rolls to object storage.

#codingexercise
Check if a number has less set than unset bits. 
boolean hasLessSetThanUnsetBits(int n) 
{ 
    int set = 0; 
    int unset = 0; 
    while (n) 
    { 
        if (n & 0x1) { 
             set++; 
        } else {  
             unset++; 
        } 
        n = n >> 1; 
    } 
    return (set < unset); 
} 

Friday, October 5, 2018

We were discussing the queue layer over Object Storage.

The reserving of certain queues in the queue layer and the reserving of certain object storage namespaces alleviates the need to have special purpose stacks. In addition, it reuses the existing to infrastructure for concerns such as monitoring, reporting, and background processes. There are no additional artifacts here other than the reservation so that they don’t mix with user data.

User queues and object storage could  be made as programmatically available collectively as they are made available individually. Many message queuing stacks provide APIs to manage the queues and they follow similar patterns. Unlike other SDKs, the queue layer does not merely provide the APIs associated with the queue. It consolidates functionality through the queue layer on the object storage that would have otherwise been directly accessible. Such automation of common tasks associated with the object storage for the purpose of queue management is a useful addition to the SDK.

While Queues serve a variety of purpose, they can come in especially useful to capture traffic. From network packets, api messages to just about anything that has a continuous flow, a queue will help record the traffic regardless of the flow. There are no upper limits to the number of messages in a queue and there is no limit to what can be stored in the object storage, therefore these will serve almost any traffic.

Queues do not need to actively participate in standard query operations on the captured traffic. They merely transfer the data to the object storage which can then serve the queries with the help a query execution engine like log parser that uses the available data as an enumerable. The query operations could include selection, aggregation, grouping and ordering.

Thursday, October 4, 2018

We were discussing the queue layer over Object Storage.

The reserving of certain queues in the queue layer and the reserving of certain object storage namespaces alleviates the need to have special purpose stacks. In addition, it reuses the existing to infrastructure for concerns such as monitoring, reporting, and background processes. There are no additional artifacts here other than the reservation so that they don’t mix with user data.

Securing system artifacts is equally necessary just as it is important to not allow data corruption. This can be achieved by isolation and encryption. Audit logs for example should not be tampered with. Therefore, they may be encrypted so that the data may not be modified even if they are leaked. System artifacts can use specific prefixes so that they are differentiated from user namespaces. In addition, these namespaces may not be allowed to be created by the user. Finally, data in transit must be secured just as much as it is necessary for the data at rest to be secured.

On the other hand, user queues and object storage must be made as programmatically available collectively as they are made easy individually. Many message queuing stacks provide APIs to manage the queues and they follow similar patterns. These APIs are made available over the web so they can be invoked remotely via direct calls. In addition, if SDKs are made available, then they can improve programmability. SDKs make working with queues and objects easier in the programming language of the user. They don’t need any extra setup and facilitate the calls made to the APIs.

Unlike other SDKs, the queue layer does not merely provide the APIs associated with the queue. It consolidates functionality through the queue layer on the object storage that would have otherwise been directly accessible. Such automation of common tasks associated with the object storage for the purpose of queue management is a useful addition to the SDK.

Wednesday, October 3, 2018

We resume our discussion on the queue layer over object storage from the previous post. It is important for the Queue layer to understand resources usages. These resources usage metrics may apply to different queues, per queue or even as granular as per services. For example, we may have a queue that services a single consumer. It may have process that return different status codes. A simple metric may be to find out success rate in terms of the overall processing made. The success in this case is determined by the number of success status code generated over the processing. Since this kind of metric comes from individual messages, the queue may want to aggregate these metrics across all messages. This is best shown in a dashboard with drill downs to different services. 
  
It's important to note that metrics data is inherently time based. Although metrics can be spot measurements, cumulative over time or delta between time slices, they have an association with the time that they were measured.  Consequently, most metric data is stored in a time series database. Also, previous measurements can be summarized with statistics while the running measurements apply to only the current window of time. Therefore, some of these time-series databases can even be fix-sized. 
  
There are many different tools available to present such charts from data sources. Almost all such reporting mechanisms want to pull data for their charts. However, they cannot directly call the services to query their metrics because this may affect the functionality and performance of those services. Therefore, they do it in a passive manner by finding information from log trails and database states. Most services have logging configured and their logs are usually indexed in a time series database of their own. In our case we can utilize the objects to as time-series buckets that fill over. Consequently, the pull information can come from dedicated namespaces, buckets and objects in the object storage. 


boolean hasLessSetThanUnsetBits(int n)
{
    int set = 0;
    int unset = 0;
    while (n)
    {
        if (n & 0x1) {
             set++;
        } else { 
             unset++;
        }
        n = n >> 1;
    }
    return (set < unset);
}

Tuesday, October 2, 2018

We were discussing topic detection in text document yesterday.
The following is another method to do it :
Def simultaneous_regressor_and_classifier(text, corpus):
                     Bounding_box = initialize_bounding_box
                     Clusters = []
                     Regressors = []
                     # finite iterations or till satisfaction
                     For i in range(10):
                                selected_text = select_text(bounding_box, text)
                                Cluster = classify(selected_text)
                                Clusters += [(bounding_box, cluster)]
                                Regressor = generate_match(cluster, corpus)
                                 Regressors += [(bounding_box, regressor)]
                                 Bounding_box = next_bounding_box(regressors, text)
                     Selections = select_top_clusters(clusters)
                     Return summary_from(selections)


The motivation behind this method is that the whole document need not be a single bounding box. The steps taken to determine topics in the whole document by the clustering of word vectors is also the same technique we apply to smaller sections. This guides towards locating topics within the text. What we did not specify was the selection of the bounding boxes. This holds no particular deviation from the general practice in object detection. The proposals each have a value for the intersection over union over the ground truth which can be used in a linear regression.  This penalizes false positives and noisy proposals in favor of those that lie along the regression line. As an alternative to choosing enough random proposals for the plot, we can also be selective in the choice of the bounding boxes by choosing those that have higher concentration of keywords. 

Monday, October 1, 2018

We were discussing topic detection in text document yesterday.

The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps from associated cluster centroids.
Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.
The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data
def determining_neurons_for_topic(document, corpus):
                              Return cluster_centroids_of_top_clusters(corpus)

def batch_summarize_repeated_pass(text, corpus)
                    doc_cluster = classify(text, corpus)
                    Bounding_boxes = gen_proposals(doc_cluster)
                    clusters = [(FULL, doc_cluster)]
                     For bounding_box in bounding_boxes:
                                             Cluster = get_cluster_bounding_box(bounding_box, document, corpus)
                                              clusters += [(bounding_box, cluster)]
                     Selections = select_top_clusters(clusters)
                     Return summary_from(selections)

Def select_top_clusters(threshold, clusters):
                     return clusters_greater_than_goodness_of_fit_weighted_size(threshold, clusters)             

Sunday, September 30, 2018

Today we continue discussing on the text summarization techniques. We came up with the following steps :
Def gen_proposals(proposals, least_squares_estimates):
       # a proposal is origin, length, breadth written as say top-left and bottom-right corner of a bounding box
       # given many known topic vectors, the classifer helps detect the best match.
       # the bounding box is adjusted to maximize the intersection over union of this topic.
       # text is flowing so we can assume bounding boxes of sentences
       # fix origin and choose fixed step sizes to determine the adherence to the regression
       # repeat for different selections of origins.
       Pass

def get_iou_topic(keywords, topic):
       Return sum_of_square_distances(keywords, topic)
     Pass

Def gen_proposals_alternative_without_classes(proposals, threshold)
          # cluster all keywords in a bounding box
          # use the threshold to determine high goodness of fit to one or two clusters
          # use the goodness of fit to scatter plot bounding boxes and their linear regression
          # use the linear regression to score and select the best bounding boxes.
          # select bounding boxes with diversity to overall document cluster
          # use the selected bounding boxes to generate a summary
          pass
       
The keyword selection is based on softmax classifier and operates the same regardless of the size of input from bounding boxes. Simultaneously the linear regressor proposes different bounding boxes.
We have stemming, keyword selection as common helpers for the above method. In addition to classification, we measure goodness of fit. We also keep a scatter plot of the bounding boxes and the goodness of fit. We separate out the strategy for the selection of bounding boxes in a separate method. Finally, we determine the summary as the top discrete bounding boxes in a separate method.
Topics unlike objects have an uncanny ability to be represented by one or more keywords. Just like we cluster similar topics in a thesaurus the bounding boxes need to compare only with the thesaurus for matches.
We are not looking to reduce the topics to words or classify the whole bag of words. What we are trying to do is find coherent clusters by determining the size of the bounding box and the general association of that cluster to domain topics. Therefore, we have well known topic vectors from a domain instead of collocation based feature maps which we train as topic vectors and use those in the bounding boxes for their affiliations.
The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps that are based on the associated cluster.  
Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.  
The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data