Sunday, September 30, 2018

Today we continue discussing on the text summarization techniques. We came up with the following steps :
Def gen_proposals(proposals, least_squares_estimates):
       # a proposal is origin, length, breadth written as say top-left and bottom-right corner of a bounding box
       # given many known topic vectors, the classifer helps detect the best match.
       # the bounding box is adjusted to maximize the intersection over union of this topic.
       # text is flowing so we can assume bounding boxes of sentences
       # fix origin and choose fixed step sizes to determine the adherence to the regression
       # repeat for different selections of origins.
       Pass

def get_iou_topic(keywords, topic):
       Return sum_of_square_distances(keywords, topic)
     Pass

Def gen_proposals_alternative_without_classes(proposals, threshold)
          # cluster all keywords in a bounding box
          # use the threshold to determine high goodness of fit to one or two clusters
          # use the goodness of fit to scatter plot bounding boxes and their linear regression
          # use the linear regression to score and select the best bounding boxes.
          # select bounding boxes with diversity to overall document cluster
          # use the selected bounding boxes to generate a summary
          pass
       
The keyword selection is based on softmax classifier and operates the same regardless of the size of input from bounding boxes. Simultaneously the linear regressor proposes different bounding boxes.
We have stemming, keyword selection as common helpers for the above method. In addition to classification, we measure goodness of fit. We also keep a scatter plot of the bounding boxes and the goodness of fit. We separate out the strategy for the selection of bounding boxes in a separate method. Finally, we determine the summary as the top discrete bounding boxes in a separate method.
Topics unlike objects have an uncanny ability to be represented by one or more keywords. Just like we cluster similar topics in a thesaurus the bounding boxes need to compare only with the thesaurus for matches.
We are not looking to reduce the topics to words or classify the whole bag of words. What we are trying to do is find coherent clusters by determining the size of the bounding box and the general association of that cluster to domain topics. Therefore, we have well known topic vectors from a domain instead of collocation based feature maps which we train as topic vectors and use those in the bounding boxes for their affiliations.
The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps that are based on the associated cluster.  
Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.  
The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data

No comments:

Post a Comment