Wednesday, February 13, 2013

libraries for data mining

Some of the data mining packages available on googlecode are nifty utilities to mine 2D data. They come with various  alogorithms and technique implementations. This can directly be used with different kinds of data. One such logic involves computing cluster centroids and points, then finding euler distance between the points and the centroids. Euler distance is the square root of the sum of squares of the x and y offsets from the centroid. In the case of finding keywords in a text, wordnet similarity distance between the selected candidates could be considered and a 2D array (points, clusters) of similarity populated each ranging between 0 and 1. We could iterate through all these points to compute sum of euler distance for a point from all of the clusters. Also, for the same point we could compute the inverse of the fraction of the distance to each cluster to the sum just computed. Then we could normalize the matrix to tag each data point to different clusters. Once we know the membership of each words to different clusters, we could populate index with words that maximize cluster membership.

No comments:

Post a Comment