Sunday, October 19, 2014

Non-negative matrix factorization: This is a technique used for extracting the important features of the data. This technique involves matrix multiplication where a matrix with  r number of columns  is multiplied with another that has the same number of rows r and a cell in the resulting matrix is  scalar product of the values in the corresponding same row in the first matrix with same column in the second matrix. It also involves transposing where the matrix elements are written in column order first.
Given a set of articles containing words, we can write the usual article matrix where the articles will be the rows and the words appearing in them will be the columns. The words are the properties for the articles and we take words that are common but not too common :
wordvec = []
for w,c in allw.items():
   if c > 3 and c < len(articlew) * 0.6:
    wordvec.append(w)
This technique  is about factorizing the article matrix into two matrices. a features matrix and a weights matrix. The features matrix has a row for each feature and a column for each word.  The weights matrix tells us how much each feature applies to each article. Each row is an article and each column is a word.
The articles matrix may be large. The purpose of this technique is to reduce the matrix to a smaller set of common features. This smaller set of features could be combined with different weights to get back the article matrix. Although it is unlikely to get back the original, this technique attempts to get back as closely as possible.
Its called non-negative because it returns features and weights with no negative values. Since we count the inclusions or occurrences of a word, this technique is not applied such that we consider the exclusions of words. This approach doesn't look for the best factorization but its results are often easier to interpret.
The algorithm we use in this approach would have a utility function to determine how close the reconstruction is. This is done with a function that sums the squares of the differences between two equal sized matrices. Then we gradually update the matrix to reduce the cost function.
Here we could use a number of the optimization techniques such as annealing but we discuss multiplicative update rules.
 Give the original matrix which this optimization technique calls data matrix, we make four new update matrices    
hn = the transposed weight matrix multiplied by the data matrix.
hd = the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix.
wn = the data matrix multiplied by the transposed features matrix.
wd = the weights matrix multiplied by the features matrix multiplied by the transposed features matrix.
The weights and features matrix are initialized with random values.
To update the features and weights matrices, all these matrices are converted to arrays.
Every value in the features matrix is multiplied by the corresponding value in hn and divided by the corresponding value in hd. Likewise, every value in the weights matrix is multiplied by the corresponding value in wn and divided by the value in wd.
This is repeated until the number of features are found or the number of iterations is exceeded.
def factorize(v,pc=10,iter=50):
    ic=shape(v)[0]
    fc=shape(v)[1]
    # Initialize the weight and feature matrices with random values
    w=matrix([[random.random( ) for j in range(pc)] for i in range(ic)])
    h=matrix([[random.random( ) for i in range(fc)] for i in range(pc)])
    # Perform operation a maximum of iter times
    for i in range(iter):
        wh=w*h
        # Calculate the current difference
        cost=difcost(v,wh)
        if i%10==0: print cost
        # Terminate if the matrix has been fully factorized
        if cost==0: break
        # Update feature matrix
        hn=(transpose(w)*v)
        hd=(transpose(w)*w*h)
        h=matrix(array(h)*array(hn)/array(hd))
        # Update weights matrix
        wn=(v*transpose(h))
        wd=(w*h*transpose(h))
        w=matrix(array(w)*array(wn)/array(wd))
    return w,h

Recall that an Eigenvector and corresponding eigenvalue is another way to factor the matrix into vectors and since we know that the det(A-Lambda.I) = 0, we can solve for the Eigen values and Eigen vectors. This may be useful for finding a single feature.
 Another way to do the optimization could be to use the updates of the matrix using the Laplace transform of a matrix valued function as mentioned here (http://see.stanford.edu/materials/lsoeldsee263/10-expm.pdf). This is useful when we don't want to find the Eigen values and we can use the resolvents. Moreover the statewise transitions are linear.
For example, the harmonic oscillator function can be written as
x-dot = [   0   1 ]  x
             [  -1  0 ]
and the state transitions are x(t) = [cos t  sin t] x(0)
                                                       [-sin t cos t]

Similarly the matrix exponential solution is x(t) = e ^(tA) x(0)
where the matrix exponential is e ^ M = I + M + M^2 / 2! + ...

for t = 1 and A = [ 0  1] 
                            [ 0  0], the power series has M^2 = M^3 = ... = 0
and e^A = I + A =  [ 1 1 ]
                               [ 0 1 ]

We can also explore weighted  laplacian  matrix for the partitioning.
• Larry Hardesty, Short algorithm, long-range consequences, MIT News, 1 March 2013 is an interesting read on the subject.

No comments:

Post a Comment