We were discussing topic detection in text document yesterday.
The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps from associated cluster centroids.
Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.
The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data
def determining_neurons_for_topic(document, corpus):
Return cluster_centroids_of_top_clusters(corpus)
def batch_summarize_repeated_pass(text, corpus)
doc_cluster = classify(text, corpus)
Bounding_boxes = gen_proposals(doc_cluster)
clusters = [(FULL, doc_cluster)]
For bounding_box in bounding_boxes:
Cluster = get_cluster_bounding_box(bounding_box, document, corpus)
clusters += [(bounding_box, cluster)]
Selections = select_top_clusters(clusters)
Return summary_from(selections)
Def select_top_clusters(threshold, clusters):
return clusters_greater_than_goodness_of_fit_weighted_size(threshold, clusters)
The ability to discern domain is similar to discern latent semantics using collocation. The latter was based on pointwise mutual information, reduced topics to keywords for data points and used collocation data for training the softmax classifier that relied on the PMI. Here we use feature maps from associated cluster centroids.
Keywords form different topics when the same keywords are used with different emphasis. We can measure the emphasis only by the cluster. Clusters are dynamic but we can record the cohesiveness of a cluster and its size with the goodness of fit measure. As we record different goodness of fit, we have a way of representing not just keywords but also the association with topics. We use the vector for the keyword and we use the cluster for the topic. An analogy is a complex number. For example, we have a real part and we have a complex part. The real-world domain is untouched by the complex part but the two together enables translations that are easy to visualize.
The metric to save cluster information discovered in the text leads us to topic vectors where the features maps are different from collocation-based data. This is an operation over metadata but has to be layered with the collocation data
def determining_neurons_for_topic(document, corpus):
Return cluster_centroids_of_top_clusters(corpus)
def batch_summarize_repeated_pass(text, corpus)
doc_cluster = classify(text, corpus)
Bounding_boxes = gen_proposals(doc_cluster)
clusters = [(FULL, doc_cluster)]
For bounding_box in bounding_boxes:
Cluster = get_cluster_bounding_box(bounding_box, document, corpus)
clusters += [(bounding_box, cluster)]
Selections = select_top_clusters(clusters)
Return summary_from(selections)
Def select_top_clusters(threshold, clusters):
return clusters_greater_than_goodness_of_fit_weighted_size(threshold, clusters)
No comments:
Post a Comment