Wednesday, September 5, 2018

While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to form regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors.  It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text. 

Data stream clustering is especially suited for clustering word vectors in text that arrives continuously. This is a streaming method that takes a sequence of word vectors and makes a good cluster of the stream.   The goodness of fit is determined by minimizing the distances between the cluster center and the data points.  

This method has many varieties of stream-based algorithms such as 1) Growing Neural Gas based algorithms, 2) Hierarchical Stream based algorithms, 3) Density based stream algorithms, 4) Grid based stream algorithms and Partitioning stream algorithms. Out of these hierarchical stream-based algorithms such as BIRCH are very popular. BIRCH stands for Balanced Iterative Reducing and Clustering Using Hierarchies and is known for incrementally clustering incoming data points. BIRCH has received award for standing the “test of time”. 

The streaming algorithms are helpful so long as the incoming stream is viewed as a sequence of word vectors. However, word vectorization itself must happen prior to clustering. Some view this as a drawback of this method of algorithms because word vectorization is done with the help of neural nets and softmax classifier over the entire document and there are ways to use different layers of the neural net to form regions of interest. Till date there has been no application of a hybrid mix of detecting regions of interest in a neural net layer with the help of stream-based clustering.  There is, however, a way to separate the stages of word vector formation from the stream-based clustering if all the words have previously well-known word vectors that may be looked up from something similar to a dictionary.  

boolean isDivisibleBy851(uint n)
{
return isDivisibleBy23(n) && isDivisibleBy37(n);
}


No comments:

Post a Comment