Wednesday, September 5, 2018

While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to form regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors.  It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text. 

Data stream clustering is especially suited for clustering word vectors in text that arrives continuously. This is a streaming method that takes a sequence of word vectors and makes a good cluster of the stream.   The goodness of fit is determined by minimizing the distances between the cluster center and the data points.  

This method has many varieties of stream-based algorithms such as 1) Growing Neural Gas based algorithms, 2) Hierarchical Stream based algorithms, 3) Density based stream algorithms, 4) Grid based stream algorithms and Partitioning stream algorithms. Out of these hierarchical stream-based algorithms such as BIRCH are very popular. BIRCH stands for Balanced Iterative Reducing and Clustering Using Hierarchies and is known for incrementally clustering incoming data points. BIRCH has received award for standing the “test of time”. 

The streaming algorithms are helpful so long as the incoming stream is viewed as a sequence of word vectors. However, word vectorization itself must happen prior to clustering. Some view this as a drawback of this method of algorithms because word vectorization is done with the help of neural nets and softmax classifier over the entire document and there are ways to use different layers of the neural net to form regions of interest. Till date there has been no application of a hybrid mix of detecting regions of interest in a neural net layer with the help of stream-based clustering.  There is, however, a way to separate the stages of word vector formation from the stream-based clustering if all the words have previously well-known word vectors that may be looked up from something similar to a dictionary.  

boolean isDivisibleBy851(uint n)
{
return isDivisibleBy23(n) && isDivisibleBy37(n);
}


Tuesday, September 4, 2018

Applying regions of interest to latent topic detection in text document mining:
Regions-of-interest is a useful technique to detect objects in raster data which is data laid out in the form of a matrix. The positional information is useful in aggregating data within bounding boxes with the hope that one or more boxes will stand out from the rest and will likely represent the object. When the data is representative of the semantic content and the aggregation can be performed, the bounding boxes become very useful differentiators from the background and thus help detect objects.
Text is usually considered flowing with no limit to size, number and arrangement of sentences – the logical unit of semantic content. Keywords however represent significant information and their relative positions also give some notion of the topic segmentation within the text. Several keywords may represent a local topic. These topics are similar to objects and therefore topic detection may be considered similar to object detection.
Furthermore, topics are represented by a group of keywords. These groupings are meaningless because their number and content can vary widely and yet remain the same topic. It is also very hard to treat these groupings as any standardized form of representing topics. Fortunately words are represented by vectors with mutual information with other keywords. When we treat a group of words as a bag of word vectors, we get plenty of information on what stands out of the documents as keywords.
The leap from words to groups is not straightforward as vector addition. With vector addition, we lose the notion that some keywords represent a topic better than a combination of any other. On the other hand, we know classifiers are able to represent clusters that are more meaningful. When we partition the keywords into discrete set of clusters, we are able to represent some notion of topics in the form of clusters.
Therefore cluster and not groups become much more representative of topics and subtopics. Positional information merely allows us to cluster only those keywords that appear in a bounding bag of words. By choosing different bags of words based on their positional information, we can aspire to detect topics just as we were determining regions of interest. The candidates within this bounding bags may change with the bag as we window the bag over the text knowing that the semantic content is progressive in a text.
However, clustering is not cheap and the sizing and rolling over of bags of words across the text from start to finish provides a lot of combinations. These therefore imply that clustering is beneficial only when it is done with sufficient samples such as the whole document and done once over all the word vectors taken in the bag of words representing the document. How then do we view several bags of words as local topics within the overall document.
While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to for regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors.  It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text.

Monday, September 3, 2018

Public cloud providers make it easy to write application gateways in the cloud without requiring the gateway to be part of the object storage. Such application gateway has path-based rules with path mappings that are essentially url rewrites such as /video or /image. These application gateways are minimal and equivalent static gateway routing requiring the attention from administrators to where copies of objects are maintained and the lifetime of these rules. By pushing the gateway into the object storage, we require that this maintenance no longer applies.
Since the gateway service in our case is part of the object storage, we wanted the url rewritten to the site where the object would be found. Since the object storage with a gateway is mostly used for deployments where there are three or more sites, the url is written to the site-specific address where the site is the closest to the origin of the incoming request. 

In this kind of url rewrite, we are leveraging the translation to the site. Since the gateway is part of the object storage from the discussions in our previous posts, even the namespace-bucket-object part of the address is also candidate for address translation. In this case, an object located within a bucket and part of a namespace ns1 may now be copied to another namespace, bucket and object which is now made available to the request. This destination object storage may have been setup temporarily for the duration of a peak load. The gateway therefore enables object storage to utilize new and dynamic sites within a replication group that has been created to increase geographical content distribution. 

The object storage is a no-maintenance storage. Although it doesn't claim to be a content-distribution network, our emphasis was that it is very well suited to be one given how it stores objects and their copies. All we needed was a gateway service.  And we said we don't need to maintain it outside because the url rewrites can be done automatically by pushing the gateway into the object storage tier. As the load on an object increases, copies may temporarily be made and the url mappings may be added to distribute the load. Then these rules may subsequently be removed as the copies of the objects are decommissioned. This is triggered only when an object has sufficiently aged without modification and the traffic has increased above a threshold. Alternatively the top ten percent of the most heavily read objects may be subject to this load distribution The load distribution may even be site specific so that the content distribution network continues to work as before.
#codingexercise
A semi-perfect number is the sum of some of its divisors. We can determine if a number is a sum of its subset with dynamic programming: 
Bool IsSubsetSum(ref List<int> factors, int sum) 

If (sum == 0) return true; 
If (sum < 0) return false; 
if (factors.Count() == 0 && sum != 0) return false; 
Var last = factors.last(); 
factors.RemoveAt(factors.Count() - 1); 
if (last> sum)  

Return IsSubsetSum(ref factors, sum); 

Return IsSubsetSum(ref factors, sum) || IsSubsetSum(ref factors, sum-last); 

Sunday, September 2, 2018

URL rewrites in gateway within object storage. 
One of the most transparent things a gateway can do is to rewrite urls so that the incoming and outbound addresses are fully logged. This url rewrites in our case could be interpreted as universal address translation to site specific address for an object. The server address in the url changes from gateway to the address of an object storage head service or specifically a virtual data center where the object is guaranteed to be found. Since the gateway service in our case is part of the object storage, we wanted the url rewritten to the site where the object would be found. Since the object storage with a gateway is mostly used for deployments where there are three or more sites, the url is written to the site-specific address where the site is the closest to the origin of the incoming request. 
In this kind of url rewrite, we are leveraging the translation to the site. Since the gateway is part of the object storage from the discussions in our previous posts, even the namespace-bucket-object part of the address is also candidate for address translation. In this case, an object located within a bucket and part of a namespace ns1 may now be copied to another namespace, bucket and object which is now made available to the request. This destination object storage may have been setup temporarily for the duration of a peak load. The gateway therefore enables object storage to utilize new and dynamic sites within a replication group that has been created to increase geographical content distribution. 
Finally, the gateway may change the url rewrites dynamically instead of relying on static rules. These rules were previously written in the form of regex in a conf file. However, it can be a method called the classifier that evaluates these configurations at runtime allowing one or more parameters to formatted in the declaration and evaluation of these rules.   

Saturday, September 1, 2018

Distributed Gateways
This answers the question that if the gateway is a service within the object storage, can gateways be chained across object storage. Along the lines of the previous question, if the current object storage does not resolve the address for an object located in its storage pools, is it possible to distribute the query to another object storage. These kinds of questions imply that the resolver merely needs to forward the queries that it cannot answer to a default pre-registered outbound destination. In a distributed gateway, the queries can make sense simply out of the namespace-bucket-object hierarchy and say if a request belongs to it or not. If it does not, it simply forwards it to another object storage. This is somewhat different from the original notion that the address is something opaque to the user and does not have any interpretable part that can determine the site to which the object belongs.  The linked object storage does not even need to take time to search for an object within its store to see if it exists. It can merely translate the address to know if it belongs to it with the help of a registry. This shallow lookup means a request can be forwarded faster to another linked object storage and ultimately to where it may be guaranteed to be found. The Linked Storage has no criteria for the object store to be similar and as long as the forwarding logic is enabled, any implementation can exist in each of the storage for translation, lookup and return. This could have been completely mitigated if the opaque addresses were hashes and the destination object storage was determined based on a hash table.  Whether we use routing tables or a static hash table, the networking over the object storage can be its own layer facilitating request resolution at different object storage.

Friday, August 31, 2018

The gateway as a classifier.
The rules of a gateway need not mere regex translation of incoming address to another site-specific address. We are dealing with objects an all part of the object endpoint address such as the hierarchical namespace – bucket – object may be translated to another all-together different address but pointing to the same copy of the object. For that matter hashes of web addresses may be translated so that the caller may only need a tiny url to access an object and internally the same copy of the object may be provided at lightning speed from site specific buckets. We are not just putting the gateway on steroids, we are also making it smarter by allowing the user to customize the rules. These rules can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The gateway works more than an http proxy or a message queue server. It is a lookup of objects without sacrificing performance and without restrictions to the organization of objects within or distributed stores. It works much like routers and although we have referred to gateway as a networking layer over storage, it provides a query execution service as well.  All the queries are similar in their nature. They are mostly web addresses of objects. The storage server only knows about three internal copies of an object for durability. These copies share the same address and different objects have different web address. What a storage server may think as different objects may even be the same object for the user. How the user organizes the objects in namespaces and buckets may be based on her rules that are beyond the site replication. if the gateway can route the request to the same object to different sites, there is nothing preventing the gateway to let the user add custom rules that utilize this address translation for purposes other than geography based content distribution. Fundamentally, a specific address just for an object each does not benefit the customer when she wants to hand out the same address for content that are served by two or more same objects. Where those objects are located and how the address translation works may be based on statics site based routing via regex or dynamic routing based on rules and program.  Moreover, the gateway has the ability to interpret aliases of addresses that the object storage cannot. 

Thursday, August 30, 2018

The case of the Cloud Gateways for storage. 
Some view the cloud gateway as a device that can be placed at the customer’s premise and translate low level file commands into high level http requests that use cloud storage. Public cloud providers distort it further by saying the gateway is provided from the cloud. They offer easy integration into existing infrastructure because they route requests between options. Sometimes direct integration can be very expensive requiring manipulation of APIs for create, update and delete. On the other hand, the gateways feature as adapters and do away with the cost of integration by leveraging existing commands.  
Others use gateway for segregating their workloads. Every store in an organization does not get used uniformly and gateways help to consolidate the infrastructure behind a common entrypoint. This allows users to use the same construct that they have while allowing the planners to separate the storage into high and low usage cases.  
Cloud gateways can also be used for heterogenous stores where the data existing on one storage need not be replicated to another storage as long as they are accessible from the same common entrypoint. 
Regardless of what gateway means for someone, they find universal appeal in their utility. Gateways distribute traffic. It works exceptionally well when it routes request to on-premise or cloud object stores. The on-premise helps with closer access of data. The same concept may apply to geographical distribution If similar content where each object storage serves a specific region. In this case replication may need to be set-up between different object storage. we could leverage an object storage replication group to do automatic replication. It might be considered a bottleneck if the same object storage is used. This is different from redirecting requests to separate servers/caches. However, shared services may offer at par service level agreement as an individual service for requests. Since a gateway will not see a performance degradation when sending to a proxy server or a shared dedicated store, it works in both these cases. Replacing a shared dedicated store with a shared dedicated storage such as an Object Storage is therefore also a practical option. Moreover, a cache generally improves performance over what might have been incurred in going to the backend. That is why different proxy servers behind a gateway could maintain their own cache.  A dedicated cache service like AppFabric may also be sufficient to handle requests. In this case, we are consolidating proxy server caches with a dedicated cache. 
#codingexercise
Determine if a number is perfect.  A perfect number is the sum of all of its divisors.
Boolean isPerfect(uint n ) 
{
var factors = GetFactors(n);
return n == factors.sum();
}
List<int> GetFactors(uint n)
{
var ret = new List<int>();
ret.Add(1);
For (int I = 2; i <= Math.sqrt(n); I++) {
 If (n %I ==0 ) {
       ret.Add(I); // add lo factor
       if (n/i != I ) ret.Add(I); // /add high factor
}
}
return ret;
}