Wednesday, October 29, 2014

In today's post we discuss conjugate gradient methods. This is a very popular method for solving systems of the form Ax = b where x is an unknown vector, b is a known vector and A is a known square matrix. Iterative methods like CG are suited for use with sparse matrices. If A is dense, then it is better factored and then the equation solved by back substitution. This is because the factorization and iteration are somewhat same cost for dense matrices. On the other hand, factorization of sparse matrices is much more expensive. And so is the back-solving step which involves trying with different values of b.
Equations of the form Ax = b are linear so we can solve for their point of intersection in n hyperplanes where each has dimension n-1. The equation can also be written in a quadratic form. which we can view as a surface whose minimum point of this surface is the solution to Ax = b. When the quadratic form is expressed in contours, they are concentric ellipsoidal curves each of which has a constant f(x). The gradient of the quadratic form points in the direction of the steepest increase of f(x), and is orthogonal to the contour lines. Gradients come in useful when they are set to zero in which case the function is minimized.
If we wonder why we transform a simpler looking equation Ax = b into the tougher looking quadratic form, then its because the methods we are looking at were developed and intuitively understood in terms of the minimization problems and not in terms of intersecting hyperplanes.
It also helps to think of these problems as Eigenvectors and Eigenvalues because they serve as analysis tools. The steepest descent and CG methods don't calculate eignevector as part of their algorithms. We might recall Eigenvector v from previous posts as a nonzero vector that does not rotate when B is applied to it, except perhaps to point it in precisely the opposite direction. v may change length or reverse its direction but it won't turn sideways. In other words there is some scalar constant lambda such that Bv = lambda v and this lambda is the eigen value.

Courtesy: Shewchuk 1994 An introduction to conjugate gradient method without the agonizing pain.

Tuesday, October 28, 2014

In today's post I will return to the matrix factorization and optimization techniques. We were discussing Newton's method. This is a method that finds matrix factors by iterations using Taylor series and converging near the roots. Keeping the Taylor series to the first order, this method finds the tangent line to the curve (x, f(x))  at xo where the tangent line intersects the x-axis.
Assuming that the xk+1 converges towards the x, we can define the error in the k-th step. which when we expand, we see that the method converges quadratically. We will next discuss line search and trust region which we use in the computation of the incremental stuff. In line search, we pick a direction and then decide how far to move along that direction. 1
Each iteration of the line search is computed this way:
Xk+1 = Xk + alpha-k Pk
Where the incremental step depends both on Xk and alpha-k. Generally that is chosen which makes the descent to reduce the function. The search is done in two phases the bracketing phase where a range is found to bound the step length and an extrapolation phase where a good one is picked within the range. The iteration terminates when there is said to be sufficient reduction of the function.
The trust region is more robust. It recognizes that the approximation model chosen to compute the next iteration is only applicable locally. Therefore there is a trust region in which the model applies. The line search could be an approximation model too. As the model is revised for each iteration, the trusted region should reduce as the function converges. This is the termination condition for the iteration.
The trust region methods are considered newer as compared to the line search methods. The difficulty with the line search methods was that there is not a sufficient reduction in f at each step.  The condition we maintain is that the reduction in f should be proportional to both the step length alpha-k and the directional derivative. If the reduction in f  denoted by phi decreases, increases consecutively and the second decrease dips below zero, the step length and directional derivative part is a straight line through this curve. which separates the graph into three regions - the first and the last portion where the actual reduction is less than the computed and is therefore acceptable and the middle portion which isn't.  Thus the sufficient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress, because we see that it is satisfied for all small values of step-length where there is a steep descent.
To rule out unacceptably short steps, a second requirement was introduced, called the curvature condition where we check for the slope of the actual reduction at step length is a constant times the initial slope. In other words, if the slope is strongly negative, we have an indication that we can reduce f significantly by moving further along the chosen direction. So we choose a desired slope and we don't accept those that are too low or when the reduction is small or even positive.  This helps us get a better progress with very small skips. Together the sufficient decrease and curvature conditions are known collectively as Wolfe conditions.

For sparse matrices, conjugate gradient methods are better suited. we will discuss this next.

Monday, October 27, 2014

In today's post I want to discuss a topic different from the previous few posts and perhaps one time only.
This is the support for windows file system like ACLs using row level security in database
Let us say resources are saved in a table where we want to specify an access control list for this resource so that it can be validated against policies for different access. We will leave out the case where the resources can be hierarchical.
In this case, we want to add a column to the resource that describes this control. It could be a user name or it could be a label or a foreign key to a set of markings that represent the sensitivity or the permissions of the data.
Keeping track of usernames, for example, is helpful in the case where all accesses are directly isolated per user. In other words users don't have to share the ownership of the folders and that their accesses are essentially to manage the lifetime of their folders
We can also create a label for each resource if we know that the labels are distinct and symbolic.  In most cases where the operations are predefined but their actors are not, we can use different labels. For example, we can say whether the resource is read-only or read-write.
However users and their access are both dynamic and they can be hierarchical. In this case, hierarchical means that entries can inherit top down or gain incremental privileges bottom up. In addition, a user may grant or revoke permission, take ownership. etc.
In such a case, its better to view the labels as an output resulting from a schema of categories and markings.
The categories and markings are predefined in a system. The categories represent a domain  of markings. This can be a classification that is hierarchical or compartment that are distinct, they can be one-to-one or zero-to-many in how many values can be applied, and may have a comparison rule - such as any or all.  The markings are the compartment or classification values we would like to division the resources in. Resources can belong to one or more of these. They can have a name.
A particular combination of markings could translate to a unique label for the resource. Usually this table is populated on demand by say a stored procedure. If there is no corresponding row to the combination of markings, then a new row is added and a unique ID is returned as the label.
We join the label and marking in a LabelMarking table to facilitate the translation of the labels.
Having described row level security, let us now look at populating it for file system ACLs.
Here we begin by listing markings such as full control, modify, readwrite, read, read and execute. A combination of these markings can make up a permission  ( we select those that are granted)

Sunday, October 26, 2014

We review another chapter from the same book as mentioned in the previous post.
We will briefly mention multidimensional scaling.
Like clustering, this is an unsupervised technique that does not make predictions but just makes it easier to understand how different items are related. It creates a lower dimensional representation of a dataset where the distances are maintained as it were in the original dataset as much as possible. If the data is in N dimensions, we then reduce it to two dimensions.
Suppose we have a data set in the three dimensional space (x, y, z)
Then the Euclidean distance between any two pair of points is sq.rt ((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)
Then if we calculated a N x N matrix for distance between every pair of points, we will use this formula over and over again for every pair and come up with something like this:

         A     B    C    D
A      0.1   0.2  0.3  0.4
B      0.1   0.3  0.15 0.4
C       ..
D      ..

This computation was expensive. The goal here is to draw all the items in a two dimensional chart so that the distances are as close as possible, to their distances in four dimensions.
To do this we first place all the items randomly on a chart and compute the distances between items. For every pair of items, the target distance is compared to the current distance and an error term is calculated. Every item is moved a small amount closer or farther in proportion to the error between the two items.  Every item is moved according to the combination of all the other items pushing and pulling on it. Each time this is done, the difference between the current distance and the target distance should get smaller and we repeat the iterations until it doesn't.

assuming
                // real distances in higher dimension and
                // fake distances in lower dimension

totalerror = 0
for k in range(n):
   for j in range(n):
       if (j ==k) continue
       errorterm = fake -real / real
       pos[k][0] += errorterm_proportion_for_k_0
       pos[0][k] += errorterm_proportion_for_0_k
       totalerror += abs(errorterm)
print totalerror

lasterror = totalerror

# move each point by learning the rate times the gradient
for k in range(n):
    loc[k][0] -= rate*grad[k][0]
    loc[k][1] -= rate*grad[k][1]

return loc

Saturday, October 25, 2014

Today  I want to review a chapter first from the programming collective intelligence book. In particular, let us look at a few numerical methods:
There are a number of numerical functions available:
1) Trigonometric functions like sine, cosine, and tangent.
2) Other mathematical functions like power, square root, and absolute value.
3) Statistical distributions, such as a Gaussian
4) Distance metrics, like Euclidean and Tanimoto distances
5) A three parameter function that returns -1 if the first parameter is between second and third
function (a, b, c){
if ( b <= a && a= < c) return -1;
return 0;
}
6) A three parameter function that returns 1 if the difference between the first two parameters  is less than the third
function (a, b, c){
if ( Math.abs(a-b) < Math.abs(c)) return 1;
return 0;
}

Numerical methods are reactive. They are dependent on the input. This is the right approach for mathematical functions, however for efficiency programs have to store information for the next round.
One way to do this is to create tree nodes that can store and retrieve values from predefined slots. A store node has a single child  and an index of a memory slot, it gets the result from its child and stores it in the memory slot and then passes this along to its parent. A recall node has no children and simply returns the value in the appropriate slot. If a store node is at the top of a tree, the final result is available to any part of the tree that has the appropriate recall node.
In addition to the individual memory, there can be shared memory set up where programs can share a set of slots that all have access to read and write. This might remind of parallel computing with higher levels of cooperation and competition.

Some other functions include :
Gini Impurity : This is a measure of how impure a set is. If you have a set of items such as [A,A, B,B,B,C,C] then Gini impurity tells you the probability that you would be wrong if you picked one item and randomly guessed its label. If the set were all A instead and your guess was A, then you would never be wrong and you would have a pure set.
Ig (i) = 1 - sum_j_=_1_to_m  f(i,j)^2  = sum_j<>k f(i,j)f(i,k)

Entropy is another way to see how mixed  a set is:
It's a measure of how surprising a randomly selected item from the set is. For a pure set, the entropy would be zero.
H(X) = sum_i_=1_to_n_p(xi)log2( 1  / p(xi)) = - Sum_i=1_to_n p(xi).log_2_p(xi)

Variance is how much a list of numbers varies from the average value. It is found by averaging the squared difference of every number from the mean:

Sigma ^ 2 = (1 / N ) Sum_i_1_to_N(xi-x-average) ^ 2

def variance(vals):
  mean = float(sum(vals))/lens(vals)
  s = sum([(v-mean)**2 for v in vals])
  return s/len(vals)

Gaussian function is the probability density function of the normal curve:
G with a variance of sigma = (1/sigma.sqrt(2.pi)) exp(-(x-mu)^2 / 2 sigma^2 )
import math
def Gaussian(dist, sigma=10.0):
 exp = math.e**(-dist**2/ 2*sigma**2)
return (1/(sigma*(2*math.pi)**.5))*exp

We discussed the non-negative matrix factorization method earlier that simply requires you to call the factorize function with a list of observations and the number of underlying features and gives you the weights and the features. It performed the factorization based on the following four update matrices:
hn the transposed weights matrix multiplied by the data matrix
hd the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix.
wn the data matrix multiplied by the transposed features matrix.
wd the weights matrix multiplied by the features matrix multiplied by the transposed features matrix.
The weights and the features computed may not be the same all the time for small data but for larger data they are usually consistent.

Friday, October 24, 2014

Today we continue our discussion on matrix methods of optimization. We mentioned linear and second order methods. Newton's method is a good example of the latter. It finds a solution to an equation in one unknown. This method is an iterative method that seeks a critical point of the cost function f ( the gradient to f tends to zero ) by selecting an update vector at xk along which the directional derivative of grad f is equal to -grad f(xk). The second order information on the cost function is incorporated through the directional derivative of the gradient.
The iteration of this method now follows:
From an initial point x0 for the function F in domain R superscript N, it constructs a sequence of iterates according to xk+1 = xk - F(xk) / F'(xK+1), where F' denotes the derivative of F.
xk+1 corresponds to the intersection of the tangent to the graph of F at xk with the horizontal axis.
xk+1 is the solution to first order Taylor expansion of F around xk.  which is written in a linear form as F(xk)  + F'(xk) [xk+1  -xk] = 0
Thus this method is a linear approximation to a non-linear equation and has a convergence near the roots of the equations.
This method is then applied to finding the critical point of a cost function f on R n simply by taking F(x) as grad f which is the transpose of the partial derivatives of the function f from 1 to n. On a plane this is the Euclidean gradient of f. Instead of finding the difference the xk+1 - xk, we find xk+1 from the tangent vector in the tangent space xk The step control is either line search or trust region
Newton's method is used as the default method in computing FindRoot.

#coding exercise
Write an RBAC open source program
along the lines of  Permission.Demand()
and declarative [PermissionAttribute(SecurityAction.Demand Role="User")]

function ( role, permissions)
{
   switch (role)
   {
          case user:
                  if (has_permission(permission)) // lookup table
                     return Operation.Allowed;
                  return Operation.Disallowed;
        case Admin:
                if (for_trustedbase(permission))
                    return Operation.Disallowed;
                return Operation.Allowed;
       default:
                 throw;
   }
}

Thursday, October 23, 2014

#codingexercise

Putting optimization technique to applications

Let us write a program to layout users and their resources. So there may be many users sharing many resources. Due to the nature of the problem we could  have users and resources as two parallel rows and connect them with crosses. But let us also add some constraint, each user is matched with one share. Let us label them both with the same name. However their order is not necessarily the same. First let us connect the users to their resources with little or no intersecting lines. Then we overlay the many to many problem.

When connecting the users to the resources on a one-to-one basis and keeping them from crossing each other, one technique to avoid the crossing would be to see that the pairing between users and resource progress incrementally and we skip over those that aren't. To do so, we keep one of them ordered and then compare the other. Let us do this by sorting the set based on users first.
Let us say there are n users. We initialize the longest incremental subsequence (aka lis) to one for each of these users.
Then  starting from the second to the last user, for each one, we start iterating over all the previous users to see which one will increase our count of incremental subsequences and choose the maximum.
for (int i = 1; i < n; i++) {
   for (int j = 0; j < i; j++) {
     if ( users[j].resource < users[i].resource  // we have an incremental pair
                  && lis[i] < lis[j] + 1)
                         lis[i] = lis[j] + 1;
     }
 }
return lis.max();

When connecting many users and resources, we take the users and resources to be the nodes of the graph with existing edges that we have drawn and then draw the remaining edges based on the simulated annealing approach discussed earlier.