Cluster computing

Friday, September 7, 2018

We were discussing stochastic gradient method for text mining.
The stochastic gradient descent also hoped to avoid the batch processing and repeated iterations of the batch that came to refine the gradient. The overall cost function is written with a term that measures how the hypothesis is holding on the current datapoint. To fit the next data point, the parameter is modified. As it scans the data points, it may wander off but eventually reach near the optimum. The iteration over the scanning is repeated so that the dependence of the sequence of data points is eliminated. For this reason, the datapoints are initially shuffled which goes along the lines of a bag of word vectors rather than the sequence of words in a narrative. Generally, the overall iterations of all data points are restricted to a small number, say 10, and the algorithm makes improvements to gradients from one datapoint to another without having to wait for all of the batch.
The steps to take in order to utilize this method for text mining, we first perform feature extraction of the text. Then we use the stochastic gradient descent method as a classifier to learn the distinct features and Finally we can measure the performance using the F1 score. Again, vector, classifier and evaluator become the three-stage processing in this case.

We explore regions of interest detection using neural net. Here we begin with a feature map. A residual learning network can take the initial sample as input and output feature maps where each feature map is specialized in the detection of topic. If the model allows backpropagation for training and inference, we could simultaneously detect topics which are location independent as well as the position of occurrence.

With the feature map, there are two fully connected layers formed one for box-regression and another for box-classification.

The bounding boxes are the proposals. each box from the feature map is evaluated once each for regression and classification.
The classifier detects the topic and the regressor adjusts the coordinates of the bounding box.
We mentioned that the text is a bag of words and we don't have the raster data which is typical with image data. The notion here is merely the replacement of a box with a bag so that we can propose different clusters. If a particular bag emphasises one and only one cluster, then it is said to have detected a topic.The noise is avoided may even form its own cluster.

Thursday, September 6, 2018

The streaming algorithms are helpful so long as the incoming stream is viewed as a sequence of word vectors. However, word vectorization itself must happen prior to clustering. Some view this as a drawback of this method of algorithms because word vectorization is done with the help of neural nets and softmax classifier over the entire document and there are ways to use different layers of the neural net to form regions of interest. Till date there has been no application of a hybrid mix of detecting regions of interest in a neural net layer with the help of stream-based clustering. There is, however, a way to separate the stages of word vector formation from the stream-based clustering if all the words have previously well-known word vectors that may be looked up from something similar to a dictionary.

The kind of algorithms that work on stream of data points are not restricted to the above algorithms. It could involve a cost function. For example, stochastic gradient descent also hoped to avoid the batch processing and repeated iterations of the batch that came to refine the gradient. The overall cost function is written with a term that measures how the hypothesis is holding on the current datapoint. To fit the next data point, the parameter is modified. As it scans the data points, it may wander off but eventually reach near the optimum. The iteration over the scanning is repeated so that the dependence of the sequence of data points is eliminated. For this reason, the datapoints are initially shuffled which goes along the lines of a bag of word vectors rather than the sequence of words in a narrative. Generally, the overall iterations of all data points are restricted to a small number, say 10, and the algorithm makes improvements to gradients from one datapoint to another without having to wait for all of the batch.

The steps to take in order to utilize this method for text mining, we first perform feature extraction of the text. Then we use the stochastic gradient descent method as a classifier to learn the distinct features and Finally we can measure the performance using the F1 score. Again, vector, classifier and evaluator become the three-stage processing in this case.

Wednesday, September 5, 2018

While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to form regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors. It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text.

Data stream clustering is especially suited for clustering word vectors in text that arrives continuously. This is a streaming method that takes a sequence of word vectors and makes a good cluster of the stream. The goodness of fit is determined by minimizing the distances between the cluster center and the data points.

This method has many varieties of stream-based algorithms such as 1) Growing Neural Gas based algorithms, 2) Hierarchical Stream based algorithms, 3) Density based stream algorithms, 4) Grid based stream algorithms and Partitioning stream algorithms. Out of these hierarchical stream-based algorithms such as BIRCH are very popular. BIRCH stands for Balanced Iterative Reducing and Clustering Using Hierarchies and is known for incrementally clustering incoming data points. BIRCH has received award for standing the “test of time”.

boolean isDivisibleBy851(uint n)
{
return isDivisibleBy23(n) && isDivisibleBy37(n);
}

Tuesday, September 4, 2018

Applying regions of interest to latent topic detection in text document mining:
Regions-of-interest is a useful technique to detect objects in raster data which is data laid out in the form of a matrix. The positional information is useful in aggregating data within bounding boxes with the hope that one or more boxes will stand out from the rest and will likely represent the object. When the data is representative of the semantic content and the aggregation can be performed, the bounding boxes become very useful differentiators from the background and thus help detect objects.
Text is usually considered flowing with no limit to size, number and arrangement of sentences – the logical unit of semantic content. Keywords however represent significant information and their relative positions also give some notion of the topic segmentation within the text. Several keywords may represent a local topic. These topics are similar to objects and therefore topic detection may be considered similar to object detection.
Furthermore, topics are represented by a group of keywords. These groupings are meaningless because their number and content can vary widely and yet remain the same topic. It is also very hard to treat these groupings as any standardized form of representing topics. Fortunately words are represented by vectors with mutual information with other keywords. When we treat a group of words as a bag of word vectors, we get plenty of information on what stands out of the documents as keywords.
The leap from words to groups is not straightforward as vector addition. With vector addition, we lose the notion that some keywords represent a topic better than a combination of any other. On the other hand, we know classifiers are able to represent clusters that are more meaningful. When we partition the keywords into discrete set of clusters, we are able to represent some notion of topics in the form of clusters.
Therefore cluster and not groups become much more representative of topics and subtopics. Positional information merely allows us to cluster only those keywords that appear in a bounding bag of words. By choosing different bags of words based on their positional information, we can aspire to detect topics just as we were determining regions of interest. The candidates within this bounding bags may change with the bag as we window the bag over the text knowing that the semantic content is progressive in a text.
However, clustering is not cheap and the sizing and rolling over of bags of words across the text from start to finish provides a lot of combinations. These therefore imply that clustering is beneficial only when it is done with sufficient samples such as the whole document and done once over all the word vectors taken in the bag of words representing the document. How then do we view several bags of words as local topics within the overall document.
While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to for regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors. It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text.

Monday, September 3, 2018

Public cloud providers make it easy to write application gateways in the cloud without requiring the gateway to be part of the object storage. Such application gateway has path-based rules with path mappings that are essentially url rewrites such as /video or /image. These application gateways are minimal and equivalent static gateway routing requiring the attention from administrators to where copies of objects are maintained and the lifetime of these rules. By pushing the gateway into the object storage, we require that this maintenance no longer applies.

Since the gateway service in our case is part of the object storage, we wanted the url rewritten to the site where the object would be found. Since the object storage with a gateway is mostly used for deployments where there are three or more sites, the url is written to the site-specific address where the site is the closest to the origin of the incoming request.

In this kind of url rewrite, we are leveraging the translation to the site. Since the gateway is part of the object storage from the discussions in our previous posts, even the namespace-bucket-object part of the address is also candidate for address translation. In this case, an object located within a bucket and part of a namespace ns1 may now be copied to another namespace, bucket and object which is now made available to the request. This destination object storage may have been setup temporarily for the duration of a peak load. The gateway therefore enables object storage to utilize new and dynamic sites within a replication group that has been created to increase geographical content distribution.

The object storage is a no-maintenance storage. Although it doesn't claim to be a content-distribution network, our emphasis was that it is very well suited to be one given how it stores objects and their copies. All we needed was a gateway service. And we said we don't need to maintain it outside because the url rewrites can be done automatically by pushing the gateway into the object storage tier. As the load on an object increases, copies may temporarily be made and the url mappings may be added to distribute the load. Then these rules may subsequently be removed as the copies of the objects are decommissioned. This is triggered only when an object has sufficiently aged without modification and the traffic has increased above a threshold. Alternatively the top ten percent of the most heavily read objects may be subject to this load distribution The load distribution may even be site specific so that the content distribution network continues to work as before.
#codingexercise
A semi-perfect number is the sum of some of its divisors. We can determine if a number is a sum of its subset with dynamic programming:
Bool IsSubsetSum(ref List<int> factors, int sum)
{
If (sum == 0) return true;
If (sum < 0) return false;
if (factors.Count() == 0 && sum != 0) return false;
Var last = factors.last();
factors.RemoveAt(factors.Count() - 1);
if (last> sum)
{
Return IsSubsetSum(ref factors, sum);
}
Return IsSubsetSum(ref factors, sum) || IsSubsetSum(ref factors, sum-last);

}

Sunday, September 2, 2018

URL rewrites in gateway within object storage.

One of the most transparent things a gateway can do is to rewrite urls so that the incoming and outbound addresses are fully logged. This url rewrites in our case could be interpreted as universal address translation to site specific address for an object. The server address in the url changes from gateway to the address of an object storage head service or specifically a virtual data center where the object is guaranteed to be found. Since the gateway service in our case is part of the object storage, we wanted the url rewritten to the site where the object would be found. Since the object storage with a gateway is mostly used for deployments where there are three or more sites, the url is written to the site-specific address where the site is the closest to the origin of the incoming request.

Finally, the gateway may change the url rewrites dynamically instead of relying on static rules. These rules were previously written in the form of regex in a conf file. However, it can be a method called the classifier that evaluates these configurations at runtime allowing one or more parameters to formatted in the declaration and evaluation of these rules.

Saturday, September 1, 2018

Distributed Gateways
This answers the question that if the gateway is a service within the object storage, can gateways be chained across object storage. Along the lines of the previous question, if the current object storage does not resolve the address for an object located in its storage pools, is it possible to distribute the query to another object storage. These kinds of questions imply that the resolver merely needs to forward the queries that it cannot answer to a default pre-registered outbound destination. In a distributed gateway, the queries can make sense simply out of the namespace-bucket-object hierarchy and say if a request belongs to it or not. If it does not, it simply forwards it to another object storage. This is somewhat different from the original notion that the address is something opaque to the user and does not have any interpretable part that can determine the site to which the object belongs. The linked object storage does not even need to take time to search for an object within its store to see if it exists. It can merely translate the address to know if it belongs to it with the help of a registry. This shallow lookup means a request can be forwarded faster to another linked object storage and ultimately to where it may be guaranteed to be found. The Linked Storage has no criteria for the object store to be similar and as long as the forwarding logic is enabled, any implementation can exist in each of the storage for translation, lookup and return. This could have been completely mitigated if the opaque addresses were hashes and the destination object storage was determined based on a hash table. Whether we use routing tables or a static hash table, the networking over the object storage can be its own layer facilitating request resolution at different object storage.