Saturday, September 8, 2018

The bridge between relational and object storage   

We review the data transfer between relational and object storage. We argue that the natural migration of relational data to object storage is via a data warehouse.    We further state that this bridge is more suited for star schema design as input than other forms of structured data. Finally, all three data stores may exist in the cloud as virtual stores and can each be massively scalable.

Role of Object Storage: 

Object storage is very good for durability of data. Virtual data warehouses often directly use S3 API to stash interim analytical byproducts. The nature of these warehouses is that they accumulate data over time for analytics. Since the S3 Apis are a popular programmatic access to store files in Object storage, most of the data in the warehouses may often be translated to smaller files or blobs and then stored in object storage. A connector works similarly but when the tables are in a star schema, it is easier to migrate the tables. The purpose of the connector is to move data quickly and steadily between source as structured store and destination as object storage. Therefore, the connector needs to fragment and upload in multiple parts into the object storage and S3 Api with their feature for multi part upload into the object storage is a convenience. The connector merely automates these transfers which would otherwise have been pushed rather than pulled by the object storage. This difference is enormous in terms of convenience to stash and for reusability of the objects for reading. The nature of the automation may also be flexible to store the fragments of data in the form that is most likely used during repeated analytics. The star schema is a way of saying that there are many tables and they are joined into a central table for easier growth and dimensional analysis.  This makes the data stand out as independent dimensions or facts which can be migrated in parallel. In addition to the programmatic access for stashing, object storage is widely popular over conventional file storage making it all the more appealing for durable stashes or full copy of data. While transactional data may have highly normalized tables, there is nothing preventing it from being translated into unraveled star schema. Besides they are already a part of the chain between how transactional data ends up in a warehouse. The object storage does not have to concern itself with the data store directly and can exist and operate without any impact to warehouse or object storage that may well be a lot bigger than the database.

With this context, let us consider how to write such a connector. In this regard, we are lured by the dimensions of the snowflake design to be translated into objects There are only two aspects that need to be concerned about. First is the horizontal and vertical partitioning of a dimension and the second is the order and co-ordination of transfers from all dimensions. The latter may be done in parallel and with the expertise of database migration wizards. Here the emphasis is on the checklist of migrations so that the transfer is merely in the form of archival. Specifically, in transferring a table we repeat the following steps: while there are records to be migrated between the source and the destination, select a record,  check if it exists in the destination. If it doesn’t we write it as a key-value and repeat the process. We make this failproof by checking at every step. Additionally, where the source is a staging area, we may choose delete from the staging so that we can keep it trimmed. This technique is sufficient to write data in the form of key-values which are then more suited for object storage

Conclusion:

Data is precious and almost all platforms compete for a slice of the data. Data is sticky so the platform and tools built around the data increase in size. Object Storage is a universally accepted store for its purpose and the connector improves the migration of this data to the store. Although the word migration is used to indicate transfer, it does not necessarily mean taking anything away from the origin.

Friday, September 7, 2018

We were discussing stochastic gradient method for text mining.
The stochastic gradient descent also hoped to avoid the batch processing and repeated iterations of the batch that came to refine the gradient.  The overall cost function is written with a term that measures how the hypothesis is holding on the current datapoint. To fit the next data point, the parameter is modified. As it scans the data points, it may wander off but eventually reach near the optimum. The iteration over the scanning is repeated so that the dependence of the sequence of data points is eliminated. For this reason, the datapoints are initially shuffled which goes along the lines of a bag of word vectors rather than the sequence of words in a narrative. Generally, the overall iterations of all data points are restricted to a small number, say 10, and the algorithm makes improvements to gradients from one datapoint to another without having to wait for all of the batch.  
The steps to take in order to utilize this method for text mining, we first perform feature extraction of the text. Then we use the stochastic gradient descent method as a classifier to learn the distinct features and Finally we can measure the performance using the F1 score. Again, vector, classifier and evaluator become the three-stage processing in this case.
We explore regions of interest detection using neural net. Here we begin with a feature map. A residual learning network can take the initial sample as input and output feature maps where each feature map is specialized in the detection of topic. If the model allows backpropagation for training and inference, we could simultaneously detect topics which are location independent as well as the position of occurrence.   

With the feature map,  there are two fully connected layers formed one for box-regression and another for box-classification.
The bounding boxes are the proposals. each box from the feature map is evaluated once each for regression and classification.
The classifier detects the topic and the regressor adjusts the coordinates of the bounding box.
We mentioned that the text is a bag of words and we don't have the raster data which is typical with image data. The notion here is merely the replacement of a box with a bag so that we can propose different clusters. If a particular bag emphasises one and only one cluster, then it is said to have detected a topic.The noise is avoided may even form its own cluster.

Thursday, September 6, 2018

The streaming algorithms are helpful so long as the incoming stream is viewed as a sequence of word vectors. However, word vectorization itself must happen prior to clustering. Some view this as a drawback of this method of algorithms because word vectorization is done with the help of neural nets and softmax classifier over the entire document and there are ways to use different layers of the neural net to form regions of interest. Till date there has been no application of a hybrid mix of detecting regions of interest in a neural net layer with the help of stream-based clustering.  There is, however, a way to separate the stages of word vector formation from the stream-based clustering if all the words have previously well-known word vectors that may be looked up from something similar to a dictionary.  

The kind of algorithms that work on stream of data points are not restricted to the above algorithms. It could involve a cost function. For example, stochastic gradient descent also hoped to avoid the batch processing and repeated iterations of the batch that came to refine the gradient.  The overall cost function is written with a term that measures how the hypothesis is holding on the current datapoint. To fit the next data point, the parameter is modified. As it scans the data points, it may wander off but eventually reach near the optimum. The iteration over the scanning is repeated so that the dependence of the sequence of data points is eliminated. For this reason, the datapoints are initially shuffled which goes along the lines of a bag of word vectors rather than the sequence of words in a narrative. Generally, the overall iterations of all data points are restricted to a small number, say 10, and the algorithm makes improvements to gradients from one datapoint to another without having to wait for all of the batch.  

The steps to take in order to utilize this method for text mining, we first perform feature extraction of the text. Then we use the stochastic gradient descent method as a classifier to learn the distinct features and Finally we can measure the performance using the F1 score. Again, vector, classifier and evaluator become the three-stage processing in this case. 


Wednesday, September 5, 2018

While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to form regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors.  It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text. 

Data stream clustering is especially suited for clustering word vectors in text that arrives continuously. This is a streaming method that takes a sequence of word vectors and makes a good cluster of the stream.   The goodness of fit is determined by minimizing the distances between the cluster center and the data points.  

This method has many varieties of stream-based algorithms such as 1) Growing Neural Gas based algorithms, 2) Hierarchical Stream based algorithms, 3) Density based stream algorithms, 4) Grid based stream algorithms and Partitioning stream algorithms. Out of these hierarchical stream-based algorithms such as BIRCH are very popular. BIRCH stands for Balanced Iterative Reducing and Clustering Using Hierarchies and is known for incrementally clustering incoming data points. BIRCH has received award for standing the “test of time”. 

The streaming algorithms are helpful so long as the incoming stream is viewed as a sequence of word vectors. However, word vectorization itself must happen prior to clustering. Some view this as a drawback of this method of algorithms because word vectorization is done with the help of neural nets and softmax classifier over the entire document and there are ways to use different layers of the neural net to form regions of interest. Till date there has been no application of a hybrid mix of detecting regions of interest in a neural net layer with the help of stream-based clustering.  There is, however, a way to separate the stages of word vector formation from the stream-based clustering if all the words have previously well-known word vectors that may be looked up from something similar to a dictionary.  

boolean isDivisibleBy851(uint n)
{
return isDivisibleBy23(n) && isDivisibleBy37(n);
}


Tuesday, September 4, 2018

Applying regions of interest to latent topic detection in text document mining:
Regions-of-interest is a useful technique to detect objects in raster data which is data laid out in the form of a matrix. The positional information is useful in aggregating data within bounding boxes with the hope that one or more boxes will stand out from the rest and will likely represent the object. When the data is representative of the semantic content and the aggregation can be performed, the bounding boxes become very useful differentiators from the background and thus help detect objects.
Text is usually considered flowing with no limit to size, number and arrangement of sentences – the logical unit of semantic content. Keywords however represent significant information and their relative positions also give some notion of the topic segmentation within the text. Several keywords may represent a local topic. These topics are similar to objects and therefore topic detection may be considered similar to object detection.
Furthermore, topics are represented by a group of keywords. These groupings are meaningless because their number and content can vary widely and yet remain the same topic. It is also very hard to treat these groupings as any standardized form of representing topics. Fortunately words are represented by vectors with mutual information with other keywords. When we treat a group of words as a bag of word vectors, we get plenty of information on what stands out of the documents as keywords.
The leap from words to groups is not straightforward as vector addition. With vector addition, we lose the notion that some keywords represent a topic better than a combination of any other. On the other hand, we know classifiers are able to represent clusters that are more meaningful. When we partition the keywords into discrete set of clusters, we are able to represent some notion of topics in the form of clusters.
Therefore cluster and not groups become much more representative of topics and subtopics. Positional information merely allows us to cluster only those keywords that appear in a bounding bag of words. By choosing different bags of words based on their positional information, we can aspire to detect topics just as we were determining regions of interest. The candidates within this bounding bags may change with the bag as we window the bag over the text knowing that the semantic content is progressive in a text.
However, clustering is not cheap and the sizing and rolling over of bags of words across the text from start to finish provides a lot of combinations. These therefore imply that clustering is beneficial only when it is done with sufficient samples such as the whole document and done once over all the word vectors taken in the bag of words representing the document. How then do we view several bags of words as local topics within the overall document.
While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to for regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors.  It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text.

Monday, September 3, 2018

Public cloud providers make it easy to write application gateways in the cloud without requiring the gateway to be part of the object storage. Such application gateway has path-based rules with path mappings that are essentially url rewrites such as /video or /image. These application gateways are minimal and equivalent static gateway routing requiring the attention from administrators to where copies of objects are maintained and the lifetime of these rules. By pushing the gateway into the object storage, we require that this maintenance no longer applies.
Since the gateway service in our case is part of the object storage, we wanted the url rewritten to the site where the object would be found. Since the object storage with a gateway is mostly used for deployments where there are three or more sites, the url is written to the site-specific address where the site is the closest to the origin of the incoming request. 

In this kind of url rewrite, we are leveraging the translation to the site. Since the gateway is part of the object storage from the discussions in our previous posts, even the namespace-bucket-object part of the address is also candidate for address translation. In this case, an object located within a bucket and part of a namespace ns1 may now be copied to another namespace, bucket and object which is now made available to the request. This destination object storage may have been setup temporarily for the duration of a peak load. The gateway therefore enables object storage to utilize new and dynamic sites within a replication group that has been created to increase geographical content distribution. 

The object storage is a no-maintenance storage. Although it doesn't claim to be a content-distribution network, our emphasis was that it is very well suited to be one given how it stores objects and their copies. All we needed was a gateway service.  And we said we don't need to maintain it outside because the url rewrites can be done automatically by pushing the gateway into the object storage tier. As the load on an object increases, copies may temporarily be made and the url mappings may be added to distribute the load. Then these rules may subsequently be removed as the copies of the objects are decommissioned. This is triggered only when an object has sufficiently aged without modification and the traffic has increased above a threshold. Alternatively the top ten percent of the most heavily read objects may be subject to this load distribution The load distribution may even be site specific so that the content distribution network continues to work as before.
#codingexercise
A semi-perfect number is the sum of some of its divisors. We can determine if a number is a sum of its subset with dynamic programming: 
Bool IsSubsetSum(ref List<int> factors, int sum) 

If (sum == 0) return true; 
If (sum < 0) return false; 
if (factors.Count() == 0 && sum != 0) return false; 
Var last = factors.last(); 
factors.RemoveAt(factors.Count() - 1); 
if (last> sum)  

Return IsSubsetSum(ref factors, sum); 

Return IsSubsetSum(ref factors, sum) || IsSubsetSum(ref factors, sum-last); 

Sunday, September 2, 2018

URL rewrites in gateway within object storage. 
One of the most transparent things a gateway can do is to rewrite urls so that the incoming and outbound addresses are fully logged. This url rewrites in our case could be interpreted as universal address translation to site specific address for an object. The server address in the url changes from gateway to the address of an object storage head service or specifically a virtual data center where the object is guaranteed to be found. Since the gateway service in our case is part of the object storage, we wanted the url rewritten to the site where the object would be found. Since the object storage with a gateway is mostly used for deployments where there are three or more sites, the url is written to the site-specific address where the site is the closest to the origin of the incoming request. 
In this kind of url rewrite, we are leveraging the translation to the site. Since the gateway is part of the object storage from the discussions in our previous posts, even the namespace-bucket-object part of the address is also candidate for address translation. In this case, an object located within a bucket and part of a namespace ns1 may now be copied to another namespace, bucket and object which is now made available to the request. This destination object storage may have been setup temporarily for the duration of a peak load. The gateway therefore enables object storage to utilize new and dynamic sites within a replication group that has been created to increase geographical content distribution. 
Finally, the gateway may change the url rewrites dynamically instead of relying on static rules. These rules were previously written in the form of regex in a conf file. However, it can be a method called the classifier that evaluates these configurations at runtime allowing one or more parameters to formatted in the declaration and evaluation of these rules.