Monday, April 22, 2019

The techniques for reducing noise from sequences:
Any algorithm can create clusters from data. However, clusters are only as good when each cluster is cohesive, meaningful and separate from other clusters.  There is a measure for the goodness of fit for clusters and this measure reduces the sum of square of errors. This gives a quantitative assessment of clusters.
Sequences behave very much the same way. Any set of sequences can be formed from a combination of elements. This explodes the number of sequences possible and without a quantitative measure of their usefulness, the sequences cannot be filtered. The presence of this measure enables sequences to be checked against a threshold that can separate the noise from the meaningful sequences as long as each sequence is given a value of this measure.
Sequences can also be clustered just like any other entities. The clustering of sequences helps in determining those that represent a cohesive property while outliers represent insignificance that can be ignored.  With good clustering where the latent semantics of the sequences are included, the size and density of clusters represents the most significant collections. If the clustering technique were to simultaneously perform the representation of sequences to a vector and the clustering of these vectors, it may even result in a noise cluster that draws all the outliers into its own cluster. This enables cleaner formation of clusters with all most of the outliers in the noise cluster. The noise cluster can then be ignored.
Therefore, the usefulness of the sequences with a metric for each sequence as well as a choice of good distance metric, proper vectorization of sequences that brings out its latent meaning and a good clustering algorithm can efficiently remove noise from the overall large set of sequences that can be generated.
Another useful metric for this purpose is the F-score, which is a way to represent precision and recall. This gives precision measure as the ratio of successful classification to overall classifications resulting in selective labelling. The recall measure is given by the ratio of the successful classification to the actual number of sequences. The F-score ranks the classifier with the precision and recall taken together twice as a fraction of their sums This further improves the use of clusters to select the sequences.

No comments:

Post a Comment