Cluster computing

Sunday, October 26, 2014

We review another chapter from the same book as mentioned in the previous post.
We will briefly mention multidimensional scaling.
Like clustering, this is an unsupervised technique that does not make predictions but just makes it easier to understand how different items are related. It creates a lower dimensional representation of a dataset where the distances are maintained as it were in the original dataset as much as possible. If the data is in N dimensions, we then reduce it to two dimensions.
Suppose we have a data set in the three dimensional space (x, y, z)
Then the Euclidean distance between any two pair of points is sq.rt ((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)
Then if we calculated a N x N matrix for distance between every pair of points, we will use this formula over and over again for every pair and come up with something like this:

         A     B    C   D
A      0.1   0.2  0.3 0.4
B      0.1   0.3 0.15 0.4
C       ..
D      ..

This computation was expensive. The goal here is to draw all the items in a two dimensional chart so that the distances are as close as possible, to their distances in four dimensions.
To do this we first place all the items randomly on a chart and compute the distances between items. For every pair of items, the target distance is compared to the current distance and an error term is calculated. Every item is moved a small amount closer or farther in proportion to the error between the two items. Every item is moved according to the combination of all the other items pushing and pulling on it. Each time this is done, the difference between the current distance and the target distance should get smaller and we repeat the iterations until it doesn't.

assuming
                // real distances in higher dimension and
                // fake distances in lower dimension

totalerror = 0
for k in range(n):
   for j in range(n):
       if (j ==k) continue
       errorterm = fake -real / real
       pos[k][0] += errorterm_proportion_for_k_0
       pos[0][k] += errorterm_proportion_for_0_k
       totalerror += abs(errorterm)
print totalerror

lasterror = totalerror

# move each point by learning the rate times the gradient
for k in range(n):
    loc[k][0] -= rate*grad[k][0]
    loc[k][1] -= rate*grad[k][1]

return loc

Cluster computing

Sunday, October 26, 2014

No comments:

Post a Comment