Cluster computing

Wednesday, October 3, 2018

We resume our discussion on the queue layer over object storage from the previous post. It is important for the Queue layer to understand resources usages. These resources usage metrics may apply to different queues, per queue or even as granular as per services. For example, we may have a queue that services a single consumer. It may have process that return different status codes. A simple metric may be to find out success rate in terms of the overall processing made. The success in this case is determined by the number of success status code generated over the processing. Since this kind of metric comes from individual messages, the queue may want to aggregate these metrics across all messages. This is best shown in a dashboard with drill downs to different services.

It's important to note that metrics data is inherently time based. Although metrics can be spot measurements, cumulative over time or delta between time slices, they have an association with the time that they were measured. Consequently, most metric data is stored in a time series database. Also, previous measurements can be summarized with statistics while the running measurements apply to only the current window of time. Therefore, some of these time-series databases can even be fix-sized.

There are many different tools available to present such charts from data sources. Almost all such reporting mechanisms want to pull data for their charts. However, they cannot directly call the services to query their metrics because this may affect the functionality and performance of those services. Therefore, they do it in a passive manner by finding information from log trails and database states. Most services have logging configured and their logs are usually indexed in a time series database of their own. In our case we can utilize the objects to as time-series buckets that fill over. Consequently, the pull information can come from dedicated namespaces, buckets and objects in the object storage.

boolean hasLessSetThanUnsetBits(int n)
{
int set = 0;
int unset = 0;
while (n)
{
if (n & 0x1) {
set++;
} else {
unset++;
}
n = n >> 1;
}
return (set < unset);
}

Tuesday, October 2, 2018

We were discussing topic detection in text document yesterday.
The following is another method to do it :
Def simultaneous_regressor_and_classifier(text, corpus):
Bounding_box = initialize_bounding_box
Clusters = []
Regressors = []
# finite iterations or till satisfaction
For i in range(10):
selected_text = select_text(bounding_box, text)
Cluster = classify(selected_text)
Clusters += [(bounding_box, cluster)]
Regressor = generate_match(cluster, corpus)
Regressors += [(bounding_box, regressor)]
Bounding_box = next_bounding_box(regressors, text)
Selections = select_top_clusters(clusters)
Return summary_from(selections)

The motivation behind this method is that the whole document need not be a single bounding box. The steps taken to determine topics in the whole document by the clustering of word vectors is also the same technique we apply to smaller sections. This guides towards locating topics within the text. What we did not specify was the selection of the bounding boxes. This holds no particular deviation from the general practice in object detection. The proposals each have a value for the intersection over union over the ground truth which can be used in a linear regression. This penalizes false positives and noisy proposals in favor of those that lie along the regression line. As an alternative to choosing enough random proposals for the plot, we can also be selective in the choice of the bounding boxes by choosing those that have higher concentration of keywords.

Monday, October 1, 2018

We were discussing topic detection in text document yesterday.

The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps from associated cluster centroids.
Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.
The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data
def determining_neurons_for_topic(document, corpus):
Return cluster_centroids_of_top_clusters(corpus)

def batch_summarize_repeated_pass(text, corpus)
doc_cluster = classify(text, corpus)
Bounding_boxes = gen_proposals(doc_cluster)
clusters = [(FULL, doc_cluster)]
For bounding_box in bounding_boxes:
Cluster = get_cluster_bounding_box(bounding_box, document, corpus)
clusters += [(bounding_box, cluster)]
Selections = select_top_clusters(clusters)
Return summary_from(selections)

Def select_top_clusters(threshold, clusters):
return clusters_greater_than_goodness_of_fit_weighted_size(threshold, clusters)

Sunday, September 30, 2018

Today we continue discussing on the text summarization techniques. We came up with the following steps :
Def gen_proposals(proposals, least_squares_estimates):
# a proposal is origin, length, breadth written as say top-left and bottom-right corner of a bounding box
# given many known topic vectors, the classifer helps detect the best match.
# the bounding box is adjusted to maximize the intersection over union of this topic.
# text is flowing so we can assume bounding boxes of sentences
# fix origin and choose fixed step sizes to determine the adherence to the regression
# repeat for different selections of origins.
Pass

def get_iou_topic(keywords, topic):
Return sum_of_square_distances(keywords, topic)
Pass

Def gen_proposals_alternative_without_classes(proposals, threshold)
# cluster all keywords in a bounding box
# use the threshold to determine high goodness of fit to one or two clusters
# use the goodness of fit to scatter plot bounding boxes and their linear regression
# use the linear regression to score and select the best bounding boxes.
# select bounding boxes with diversity to overall document cluster
# use the selected bounding boxes to generate a summary
pass

The keyword selection is based on softmax classifier and operates the same regardless of the size of input from bounding boxes. Simultaneously the linear regressor proposes different bounding boxes.
We have stemming, keyword selection as common helpers for the above method. In addition to classification, we measure goodness of fit. We also keep a scatter plot of the bounding boxes and the goodness of fit. We separate out the strategy for the selection of bounding boxes in a separate method. Finally, we determine the summary as the top discrete bounding boxes in a separate method.
Topics unlike objects have an uncanny ability to be represented by one or more keywords. Just like we cluster similar topics in a thesaurus the bounding boxes need to compare only with the thesaurus for matches.
We are not looking to reduce the topics to words or classify the whole bag of words. What we are trying to do is find coherent clusters by determining the size of the bounding box and the general association of that cluster to domain topics. Therefore, we have well known topic vectors from a domain instead of collocation based feature maps which we train as topic vectors and use those in the bounding boxes for their affiliations.

The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps that are based on the associated cluster.

Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.

The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data

Saturday, September 29, 2018

This article is in continuation of a previous post. We were referring to the design of message queues using object storage. Most message queues scale by virtue of the number of nodes in a cluster based deployment. Object Storage is accessible over S3 APIs to each of these nodes. The namespaces and buckets are organized according to the queues so that the messages may be looked up directly based on the object storage conventions. Since the storage takes care of all ingestion related concerns, the nodes merely have to utilize the S3 APIs to get and put the messages. In addition, we brought up the availability of indigenous queues to be used as a background processor in case the data does need to be sent deep into the object storage. This has at least two advantages. First, it is flexible for each queue to determine what it needs to do with the object. Second the scheduled saving of all messages into the object storage works well for the latter because it is continuous feed with very little read access.

This prompted us to separate this particular solution in its own layer which we called the cache layer so that the queues may work with the cache or with the object storage as required. The propagation of objects from cache to storage may proceed in the background. There are no mandates for the queues related to the cache to serve user workloads. They are entirely internal and specific to the system. Therefore the schedule and their operation can be set as per the system configuration.

The queues on the other hand have to implement one of the protocols from AMQP, STOMP or so on. Also, customers are likely to use the queues in one of the following ways each of which implies a different layout for the same instance and cluster size.

The queues may be mirrored across multiple nodes – This means we can use a cluster
The queues may be chained where one feeds into the other – This means we can use federation
The queues may be arbitrary depending on application needs – This means we build our own aka the shovel work

Consequently the queue layer can be designed independent of the cache and the object storage. While Queue services are available in the cloud and so are the one-stop—shop cloud databases, this kind of stack holds a lot of promise in the on-premise market.

While the implementation of the queue layer is open, we can call out what it should not be. The queues should not be implemented as micro-services. This fails the purpose of the message broker as a shared platform to alleviate the dependencies that the micro-services have in the first place. Also the Queues should not be collapsed into the database or the object storage unless there is runtime to process the messages and the programmability to store and execute logic. With these two extremes, the queue layer can be fashioned as an api gateway, switching fabric and anything that can handle retries, poison queue, dead letters and journaling. Transactional semantics are not the concern here since we are relying on versioning. Finally, the queues can use existing products such as ZeroMQ, RabbitMQ if they allow customizations for on-premise deployment of this stack.

Friday, September 28, 2018

Object Storage is not inherent to a cluster. It does not participate in another cluster. Many applications utilize a cluster to scale. And they don’t interact with each other by any means other than as application endpoints or a shared volume. A database, file-system or unstructured storage may be hosted elsewhere and then used with a cluster within an application. Consequently, the storage and the application form separate layers. An application that utilizes a cluster specifically for messaging is a message queue server also called a message broker. Message Queues facilitate an alternative for applications and services to send and receive data. Instead of directly calling the receiver by the sender, a message broker allows the sender to leave a message and proceed. This makes a microservice architecture become even more lean and focused while the messaging framework is now put in its own layer. Message queues also enable a number of processors to operate on a variety of messages while being resilient to errors in general. Since messages can grow to arbitrary size and their numbers can be mind boggling, the messages need to be saved where they can be retrieved and updated.

Object Storage can store messages very well. The Queues are not nested and the hierarchy within object storage allows for grouping of these messages as easily as in a queue. Generally, a database is used for the storage of these messages but it is not for the transactional semantics. Even a file system could be sufficient for these messages. Object storage, on the other hand, is perceived as backup and tertiary storage otherwise. This may come from the interpretation that this storage is not suitable for read and write intensive data transfers that are generally handled by file-system or database. However, not all data needs to be written deep into the object storage at once. The requirements for object storage need not even change while the reads and writes from the applications can be handled with a background processor. There can be a middle layer as a proxy for a file system to the application while utilizing the object storage for persistence. This alleviates performance considerations to read and write deep into the private cloud each time. Therefore, a Queue Layer may use the Object Storage with slight modifications. And it offers the same performance as it continued to provide. The Queues not only work as a staging for application data but also as something that asynchronously dispatches into object storage.

Queue service has been a commercially viable offering and utilize a variety of protocols. Message Queue is an example of a Queue service that has shown substantial improvements to APIs. Since objects are also accessed via S3 web Apis, the use of such Queue service works very well if each message is stored and retrieved individually. Traditional Queue services have usually maintained ordered delivery of messages, retries, dead letter handling, along with journaled messages and Queue writes and their writes have been mostly write-throughs which reach all the way to the disk. This service may be looked at in the form of a cloud service that not only maintains its persistence in the object storage but also uses the hierarchy for isolating its queues.

Thursday, September 27, 2018

The centerpiece for the solution to the problem statement yesterday
The Queue can be used independent of the Object Storage.

The use of a Queue facilitates distributed communications, request routing and batched writes. It can be offloaded to hardware. Queues may utilize Message Queuing software such as RabbitMQ, ZeroMQ and their solution stacks. They need not be real web servers and can route traffic to sockets on steroids. They may be augmented globally or in a partitioned server. 
Moreover, not all the requests need to reach the object storage. In some cases, web Queue may use temporary storage from hybrid choices. The benefits of using a web Queue including saving bandwidth, reducing server load, and improving request-response time. If a dedicated content store is required, typically the queuing and server are encapsulated into a content server. This is quite the opposite paradigm of using object storage and replicated objects to directly serve the content from the store. The distinction here is that there are two layers of functions - The first layer is the Queue layer that solves distribution using techniques such as queuing, message handling and message processor organization. The second layer is the compute and storage bundling in the form of a server or a store with shifting emphasis on code and storage.  We will call this the storage engine and will get to it shortly. 
The Queue would do the same as an asynchronous write without any change in the application logic and to multiple recipients.