We were discussing topic detection in text document yesterday.
The following is another method to do it :
Def simultaneous_regressor_and_classifier(text, corpus):
Bounding_box = initialize_bounding_box
Clusters = []
Regressors = []
# finite iterations or till satisfaction
For i in range(10):
selected_text = select_text(bounding_box, text)
Cluster = classify(selected_text)
Clusters += [(bounding_box, cluster)]
Regressor = generate_match(cluster, corpus)
Regressors += [(bounding_box, regressor)]
Bounding_box = next_bounding_box(regressors, text)
Selections = select_top_clusters(clusters)
Return summary_from(selections)
The motivation behind this method is that the whole document need not be a single bounding box. The steps taken to determine topics in the whole document by the clustering of word vectors is also the same technique we apply to smaller sections. This guides towards locating topics within the text. What we did not specify was the selection of the bounding boxes. This holds no particular deviation from the general practice in object detection. The proposals each have a value for the intersection over union over the ground truth which can be used in a linear regression. This penalizes false positives and noisy proposals in favor of those that lie along the regression line. As an alternative to choosing enough random proposals for the plot, we can also be selective in the choice of the bounding boxes by choosing those that have higher concentration of keywords.
The following is another method to do it :
Def simultaneous_regressor_and_classifier(text, corpus):
Bounding_box = initialize_bounding_box
Clusters = []
Regressors = []
# finite iterations or till satisfaction
For i in range(10):
selected_text = select_text(bounding_box, text)
Cluster = classify(selected_text)
Clusters += [(bounding_box, cluster)]
Regressor = generate_match(cluster, corpus)
Regressors += [(bounding_box, regressor)]
Bounding_box = next_bounding_box(regressors, text)
Selections = select_top_clusters(clusters)
Return summary_from(selections)
The motivation behind this method is that the whole document need not be a single bounding box. The steps taken to determine topics in the whole document by the clustering of word vectors is also the same technique we apply to smaller sections. This guides towards locating topics within the text. What we did not specify was the selection of the bounding boxes. This holds no particular deviation from the general practice in object detection. The proposals each have a value for the intersection over union over the ground truth which can be used in a linear regression. This penalizes false positives and noisy proposals in favor of those that lie along the regression line. As an alternative to choosing enough random proposals for the plot, we can also be selective in the choice of the bounding boxes by choosing those that have higher concentration of keywords.
No comments:
Post a Comment