Friday, October 31, 2014

In my previous post, I mentioned an example of LDAP login. There are quite a few open source OpenID, OAuth packages as well. I was looking at one in particular called passport-oauth which is written in Node.js. As we know OAuth 2.0 is used for authentication and authorization with the use of scope for resource access granularity. This package eases authorization by making it provider specific instead of using the OAuth framework directly. It's interface is cleaner to use in that it uses a provider and OAuth strategy and returns a token.
var passport = require('passport') , OAuthStrategy = require('passport-oauth').OAuthStrategy; passport.use('provider', new OAuthStrategy({ requestTokenURL: 'https://www.provider.com/oauth/request_token', accessTokenURL: 'https://www.provider.com/oauth/access_token', userAuthorizationURL: 'https://www.provider.com/oauth/authorize', consumerKey: '123-456-789', consumerSecret: 'shhh-its-a-secret' callbackURL: 'https://www.example.com/auth/provider/callback' }, function(token, tokenSecret, profile, done) { User.findOrCreate(..., function(err, user) { done(err, user); }); } ));

Note that the provider need not be a third party referral site, it can be a corporate OTP provider. However the service implementation  could unify both LDAP and OTP so that the auth is in one place
This will simply be
Def login (user, password): # ldap
Def login (user, token):  # OTP
Now let's look at the OTP provider. That grants a token as well. It takes a request and generates a secret. The secret along with a PIN is used as the credential to be validated. The site's request and response is important here for the service implementation to decide what to look up.
The request initiated by the service provider to the identity provider and the SAML contains both information the type of request and the callback.
as in https://openam.example.com:8443/opensso/SSORedirect/metaAlias/internal/externalidpv2?SAMLRequest=&relaystate=<callbackURL>
The SAMLRequest is base64 encoded SAML message usually including both a <saml:AuthnStatement> which describes the act of authentication at the identity provider and a <saml:AttributeStatement> which asserts a multivalued attribute associated with this principal (aka subject)
The same could perhaps be done via OAuth using password grant
curl -i -k -X POST -u user:password -d {"grant_type"="password", "username="whoisthis", "password"="OTPCode", "scope"="what"}
 https://openam.example.com:8443/openam/oauth2/access_token
{
    "expires_in": 599,
    "token_type": "Bearer",
    "refresh_token": "f6dcf133-f00b-4943-a8d4-ee939fc1bf29",
    "access_token": "f9063e26-3a29-41ec-86de-1d0d68aa85e9"
}

$ curl https://openam.example.com:8443/openam/oauth2/tokeninfo\
?access_token=f9063e26-3a29-41ec-86de-1d0d68aa85e9
{
    "mail": "demo@example.com",
    "scope": [
        "mail",
        "cn"
    ],
    "cn": "demo",
    "realm": "/",
    "token_type": "Bearer",
    "expires_in": 577,
    "access_token": "f9063e26-3a29-41ec-86de-1d0d68aa85e9”

}


Thursday, October 30, 2014

We continue our discussion on conjugate gradient methods. First we will discuss the steepest descent method because in our description earlier, we hinted towards excluding certain descent but we were not clear about the choice and the step length. When a matrix is positive definite, the function f(x) is a paraboloid. We want to reduce the function. In the method of the steepest descent, we start at an arbitrary point x0 and slide down to the bottom of the paraboloid. We take a series of steps by choosing a direction and a step length. We choose the direction in which f decreases most quickly, which is the direction opposite f'(x) - the gradient. When we take this step, there will be an error which is the distance from the solution. If we convert this quantity into a vector with the matrix, then this vector called the residual gives the direction of the steepest descent.
Next we choose the step length This is where we see an interesting observation. We want to choose the step length that minimizes f  along a line. We are restricted to choosing a point on the intersection of the vertical plane and the paraboloid. We choose a point based on new = old + step-length-alpha scaled residual. We can pick a residual but we don't know what step length to take on it. At the bottom of the paraboloid, the gradient is zero and the function cannot be minimized any more. So the step length is zero and the iteration terminates. The step length minimizes the function when the directional derivative on this search line is equal to zero. The directional derivative can be expressed in terms of the gradient transpose and the residual taken together. Setting it to zero which is saying their dot product is zero means that the two vectors are orthogonal. As we progress down the search line the rate of increase of f are the projections of the gradient onto the search line and the projection is zero at the minimum. We can conveniently pick the next direction as the orthogonal at the minimum along a search line. Thus we find a zig zag path during the iteration which appears because each gradient is orthogonal to the previous gradient.

#codingexercise:

LDAP is the triangle in which we want to put all devices and applications in. How do we authenticate a user using LDAP?

def load_user(id):
   ld = ldap.initialize('ldap://localhost:1389')
   name = id
   ld.simple_bind_s()
   basedn = "ou=people,dc=example,dc=com"
   filter = "(|(cn=\*" + name + "\*)(sn=\*" + name + "\*))"
   results = ld.search_s(basedn, ldap.SCOPE_SUBTREE, filter)
   import io
   from contextlib import redirect_stdout
   with io.StringIO() as buf, redirect_stdout(buf):
           ldif_writer = ldif.LDIFWriter(buf)
           for dn, entry in results:
                ldif_writer.unparse(dn, entry)
                output = buf.getvalue()
                uname=""
                uid=""
                if id in output:
                    for line in output.splitlines(True):
                          if "dn:" in line:
                              kvp = line.split().split(',')
                              uid = str([i.replace("uid=", "") for i in kvp if 'uid=' in i])
                          if "cn:" in line:
                              uname = line.replace("cn: ", "")
                     if uname and uid:
                         return Users(uname, uid, True)
    ld.unbind_s()
    return Users('Anonymous',  'unknown', False)

Wednesday, October 29, 2014

In today's post we discuss conjugate gradient methods. This is a very popular method for solving systems of the form Ax = b where x is an unknown vector, b is a known vector and A is a known square matrix. Iterative methods like CG are suited for use with sparse matrices. If A is dense, then it is better factored and then the equation solved by back substitution. This is because the factorization and iteration are somewhat same cost for dense matrices. On the other hand, factorization of sparse matrices is much more expensive. And so is the back-solving step which involves trying with different values of b.
Equations of the form Ax = b are linear so we can solve for their point of intersection in n hyperplanes where each has dimension n-1. The equation can also be written in a quadratic form. which we can view as a surface whose minimum point of this surface is the solution to Ax = b. When the quadratic form is expressed in contours, they are concentric ellipsoidal curves each of which has a constant f(x). The gradient of the quadratic form points in the direction of the steepest increase of f(x), and is orthogonal to the contour lines. Gradients come in useful when they are set to zero in which case the function is minimized.
If we wonder why we transform a simpler looking equation Ax = b into the tougher looking quadratic form, then its because the methods we are looking at were developed and intuitively understood in terms of the minimization problems and not in terms of intersecting hyperplanes.
It also helps to think of these problems as Eigenvectors and Eigenvalues because they serve as analysis tools. The steepest descent and CG methods don't calculate eignevector as part of their algorithms. We might recall Eigenvector v from previous posts as a nonzero vector that does not rotate when B is applied to it, except perhaps to point it in precisely the opposite direction. v may change length or reverse its direction but it won't turn sideways. In other words there is some scalar constant lambda such that Bv = lambda v and this lambda is the eigen value.

Courtesy: Shewchuk 1994 An introduction to conjugate gradient method without the agonizing pain.

Tuesday, October 28, 2014

In today's post I will return to the matrix factorization and optimization techniques. We were discussing Newton's method. This is a method that finds matrix factors by iterations using Taylor series and converging near the roots. Keeping the Taylor series to the first order, this method finds the tangent line to the curve (x, f(x))  at xo where the tangent line intersects the x-axis.
Assuming that the xk+1 converges towards the x, we can define the error in the k-th step. which when we expand, we see that the method converges quadratically. We will next discuss line search and trust region which we use in the computation of the incremental stuff. In line search, we pick a direction and then decide how far to move along that direction. 1
Each iteration of the line search is computed this way:
Xk+1 = Xk + alpha-k Pk
Where the incremental step depends both on Xk and alpha-k. Generally that is chosen which makes the descent to reduce the function. The search is done in two phases the bracketing phase where a range is found to bound the step length and an extrapolation phase where a good one is picked within the range. The iteration terminates when there is said to be sufficient reduction of the function.
The trust region is more robust. It recognizes that the approximation model chosen to compute the next iteration is only applicable locally. Therefore there is a trust region in which the model applies. The line search could be an approximation model too. As the model is revised for each iteration, the trusted region should reduce as the function converges. This is the termination condition for the iteration.
The trust region methods are considered newer as compared to the line search methods. The difficulty with the line search methods was that there is not a sufficient reduction in f at each step.  The condition we maintain is that the reduction in f should be proportional to both the step length alpha-k and the directional derivative. If the reduction in f  denoted by phi decreases, increases consecutively and the second decrease dips below zero, the step length and directional derivative part is a straight line through this curve. which separates the graph into three regions - the first and the last portion where the actual reduction is less than the computed and is therefore acceptable and the middle portion which isn't.  Thus the sufficient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress, because we see that it is satisfied for all small values of step-length where there is a steep descent.
To rule out unacceptably short steps, a second requirement was introduced, called the curvature condition where we check for the slope of the actual reduction at step length is a constant times the initial slope. In other words, if the slope is strongly negative, we have an indication that we can reduce f significantly by moving further along the chosen direction. So we choose a desired slope and we don't accept those that are too low or when the reduction is small or even positive.  This helps us get a better progress with very small skips. Together the sufficient decrease and curvature conditions are known collectively as Wolfe conditions.

For sparse matrices, conjugate gradient methods are better suited. we will discuss this next.

Monday, October 27, 2014

In today's post I want to discuss a topic different from the previous few posts and perhaps one time only.
This is the support for windows file system like ACLs using row level security in database
Let us say resources are saved in a table where we want to specify an access control list for this resource so that it can be validated against policies for different access. We will leave out the case where the resources can be hierarchical.
In this case, we want to add a column to the resource that describes this control. It could be a user name or it could be a label or a foreign key to a set of markings that represent the sensitivity or the permissions of the data.
Keeping track of usernames, for example, is helpful in the case where all accesses are directly isolated per user. In other words users don't have to share the ownership of the folders and that their accesses are essentially to manage the lifetime of their folders
We can also create a label for each resource if we know that the labels are distinct and symbolic.  In most cases where the operations are predefined but their actors are not, we can use different labels. For example, we can say whether the resource is read-only or read-write.
However users and their access are both dynamic and they can be hierarchical. In this case, hierarchical means that entries can inherit top down or gain incremental privileges bottom up. In addition, a user may grant or revoke permission, take ownership. etc.
In such a case, its better to view the labels as an output resulting from a schema of categories and markings.
The categories and markings are predefined in a system. The categories represent a domain  of markings. This can be a classification that is hierarchical or compartment that are distinct, they can be one-to-one or zero-to-many in how many values can be applied, and may have a comparison rule - such as any or all.  The markings are the compartment or classification values we would like to division the resources in. Resources can belong to one or more of these. They can have a name.
A particular combination of markings could translate to a unique label for the resource. Usually this table is populated on demand by say a stored procedure. If there is no corresponding row to the combination of markings, then a new row is added and a unique ID is returned as the label.
We join the label and marking in a LabelMarking table to facilitate the translation of the labels.
Having described row level security, let us now look at populating it for file system ACLs.
Here we begin by listing markings such as full control, modify, readwrite, read, read and execute. A combination of these markings can make up a permission  ( we select those that are granted)

Sunday, October 26, 2014

We review another chapter from the same book as mentioned in the previous post.
We will briefly mention multidimensional scaling.
Like clustering, this is an unsupervised technique that does not make predictions but just makes it easier to understand how different items are related. It creates a lower dimensional representation of a dataset where the distances are maintained as it were in the original dataset as much as possible. If the data is in N dimensions, we then reduce it to two dimensions.
Suppose we have a data set in the three dimensional space (x, y, z)
Then the Euclidean distance between any two pair of points is sq.rt ((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)
Then if we calculated a N x N matrix for distance between every pair of points, we will use this formula over and over again for every pair and come up with something like this:

         A     B    C    D
A      0.1   0.2  0.3  0.4
B      0.1   0.3  0.15 0.4
C       ..
D      ..

This computation was expensive. The goal here is to draw all the items in a two dimensional chart so that the distances are as close as possible, to their distances in four dimensions.
To do this we first place all the items randomly on a chart and compute the distances between items. For every pair of items, the target distance is compared to the current distance and an error term is calculated. Every item is moved a small amount closer or farther in proportion to the error between the two items.  Every item is moved according to the combination of all the other items pushing and pulling on it. Each time this is done, the difference between the current distance and the target distance should get smaller and we repeat the iterations until it doesn't.

assuming
                // real distances in higher dimension and
                // fake distances in lower dimension

totalerror = 0
for k in range(n):
   for j in range(n):
       if (j ==k) continue
       errorterm = fake -real / real
       pos[k][0] += errorterm_proportion_for_k_0
       pos[0][k] += errorterm_proportion_for_0_k
       totalerror += abs(errorterm)
print totalerror

lasterror = totalerror

# move each point by learning the rate times the gradient
for k in range(n):
    loc[k][0] -= rate*grad[k][0]
    loc[k][1] -= rate*grad[k][1]

return loc

Saturday, October 25, 2014

Today  I want to review a chapter first from the programming collective intelligence book. In particular, let us look at a few numerical methods:
There are a number of numerical functions available:
1) Trigonometric functions like sine, cosine, and tangent.
2) Other mathematical functions like power, square root, and absolute value.
3) Statistical distributions, such as a Gaussian
4) Distance metrics, like Euclidean and Tanimoto distances
5) A three parameter function that returns -1 if the first parameter is between second and third
function (a, b, c){
if ( b <= a && a= < c) return -1;
return 0;
}
6) A three parameter function that returns 1 if the difference between the first two parameters  is less than the third
function (a, b, c){
if ( Math.abs(a-b) < Math.abs(c)) return 1;
return 0;
}

Numerical methods are reactive. They are dependent on the input. This is the right approach for mathematical functions, however for efficiency programs have to store information for the next round.
One way to do this is to create tree nodes that can store and retrieve values from predefined slots. A store node has a single child  and an index of a memory slot, it gets the result from its child and stores it in the memory slot and then passes this along to its parent. A recall node has no children and simply returns the value in the appropriate slot. If a store node is at the top of a tree, the final result is available to any part of the tree that has the appropriate recall node.
In addition to the individual memory, there can be shared memory set up where programs can share a set of slots that all have access to read and write. This might remind of parallel computing with higher levels of cooperation and competition.

Some other functions include :
Gini Impurity : This is a measure of how impure a set is. If you have a set of items such as [A,A, B,B,B,C,C] then Gini impurity tells you the probability that you would be wrong if you picked one item and randomly guessed its label. If the set were all A instead and your guess was A, then you would never be wrong and you would have a pure set.
Ig (i) = 1 - sum_j_=_1_to_m  f(i,j)^2  = sum_j<>k f(i,j)f(i,k)

Entropy is another way to see how mixed  a set is:
It's a measure of how surprising a randomly selected item from the set is. For a pure set, the entropy would be zero.
H(X) = sum_i_=1_to_n_p(xi)log2( 1  / p(xi)) = - Sum_i=1_to_n p(xi).log_2_p(xi)

Variance is how much a list of numbers varies from the average value. It is found by averaging the squared difference of every number from the mean:

Sigma ^ 2 = (1 / N ) Sum_i_1_to_N(xi-x-average) ^ 2

def variance(vals):
  mean = float(sum(vals))/lens(vals)
  s = sum([(v-mean)**2 for v in vals])
  return s/len(vals)

Gaussian function is the probability density function of the normal curve:
G with a variance of sigma = (1/sigma.sqrt(2.pi)) exp(-(x-mu)^2 / 2 sigma^2 )
import math
def Gaussian(dist, sigma=10.0):
 exp = math.e**(-dist**2/ 2*sigma**2)
return (1/(sigma*(2*math.pi)**.5))*exp

We discussed the non-negative matrix factorization method earlier that simply requires you to call the factorize function with a list of observations and the number of underlying features and gives you the weights and the features. It performed the factorization based on the following four update matrices:
hn the transposed weights matrix multiplied by the data matrix
hd the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix.
wn the data matrix multiplied by the transposed features matrix.
wd the weights matrix multiplied by the features matrix multiplied by the transposed features matrix.
The weights and the features computed may not be the same all the time for small data but for larger data they are usually consistent.

Friday, October 24, 2014

Today we continue our discussion on matrix methods of optimization. We mentioned linear and second order methods. Newton's method is a good example of the latter. It finds a solution to an equation in one unknown. This method is an iterative method that seeks a critical point of the cost function f ( the gradient to f tends to zero ) by selecting an update vector at xk along which the directional derivative of grad f is equal to -grad f(xk). The second order information on the cost function is incorporated through the directional derivative of the gradient.
The iteration of this method now follows:
From an initial point x0 for the function F in domain R superscript N, it constructs a sequence of iterates according to xk+1 = xk - F(xk) / F'(xK+1), where F' denotes the derivative of F.
xk+1 corresponds to the intersection of the tangent to the graph of F at xk with the horizontal axis.
xk+1 is the solution to first order Taylor expansion of F around xk.  which is written in a linear form as F(xk)  + F'(xk) [xk+1  -xk] = 0
Thus this method is a linear approximation to a non-linear equation and has a convergence near the roots of the equations.
This method is then applied to finding the critical point of a cost function f on R n simply by taking F(x) as grad f which is the transpose of the partial derivatives of the function f from 1 to n. On a plane this is the Euclidean gradient of f. Instead of finding the difference the xk+1 - xk, we find xk+1 from the tangent vector in the tangent space xk The step control is either line search or trust region
Newton's method is used as the default method in computing FindRoot.

#coding exercise
Write an RBAC open source program
along the lines of  Permission.Demand()
and declarative [PermissionAttribute(SecurityAction.Demand Role="User")]

function ( role, permissions)
{
   switch (role)
   {
          case user:
                  if (has_permission(permission)) // lookup table
                     return Operation.Allowed;
                  return Operation.Disallowed;
        case Admin:
                if (for_trustedbase(permission))
                    return Operation.Disallowed;
                return Operation.Allowed;
       default:
                 throw;
   }
}

Thursday, October 23, 2014

#codingexercise

Putting optimization technique to applications

Let us write a program to layout users and their resources. So there may be many users sharing many resources. Due to the nature of the problem we could  have users and resources as two parallel rows and connect them with crosses. But let us also add some constraint, each user is matched with one share. Let us label them both with the same name. However their order is not necessarily the same. First let us connect the users to their resources with little or no intersecting lines. Then we overlay the many to many problem.

When connecting the users to the resources on a one-to-one basis and keeping them from crossing each other, one technique to avoid the crossing would be to see that the pairing between users and resource progress incrementally and we skip over those that aren't. To do so, we keep one of them ordered and then compare the other. Let us do this by sorting the set based on users first.
Let us say there are n users. We initialize the longest incremental subsequence (aka lis) to one for each of these users.
Then  starting from the second to the last user, for each one, we start iterating over all the previous users to see which one will increase our count of incremental subsequences and choose the maximum.
for (int i = 1; i < n; i++) {
   for (int j = 0; j < i; j++) {
     if ( users[j].resource < users[i].resource  // we have an incremental pair
                  && lis[i] < lis[j] + 1)
                         lis[i] = lis[j] + 1;
     }
 }
return lis.max();

When connecting many users and resources, we take the users and resources to be the nodes of the graph with existing edges that we have drawn and then draw the remaining edges based on the simulated annealing approach discussed earlier.

In our previous post on use of optimization for flight searches from the Programming Collective Intelligence book, we discussed optimizing for costs and then by preferences. We now explore other optimization techniques such as matrix methods from a paper by Gill, Murray, Saunders, Wright et al. Here the optimization can be written in the form of matrices.
In particular we will look at the sparse matrix methods in optimization. As with most optimization techniques we solve a set of linear equations : Bkyk = bk.  Both direct and iterative equation solvers are used.
While a single equation system was being solved predominantly, soon there was a need for different systems of equations to be solved together. The single system solution was then applied repeatedly to a sequence of modified systems. The speed of factorizing a matrix then becomes less important than solving the subsequent many systems Furthermore, the solution  Bk is related to Bk-1
The simplest example of the use of matrices for such problems is the example of linear programming which tries to find the optimal solution as a region bounded by a set of linear equations in a convex polytope. It has applications in routing, scheduling, assignment and design.
Linear programming has been applied to problems of all sizes, however the relative costs of the steps of the optimization methods changes when the problems become large. For example, when there is no constraint, the cost function is measured based on the number of evaluations. But subsequent recomputations are more expensive because there is more data structure and computing involved. Instead we expressed the evaluation in a way where the (k+1)th iterate is defined as xk+1 = xk + alpha-k.pk and dependent on the previous one.
 We will discuss Newton and Conjugate gradient methods in this regard.

Wednesday, October 22, 2014

#codingexercise
Int MaxR (ref List <int>  items, int start, int end)
{
If (start > end || items == null || items.length == 0 || start < 0 || end >= items.length) throw new Exception();
If (start == end) return items [start];
Int mid = checked ( (start + end) / 2);
Int left = MaxR (ref items, start, mid);
Int right = MaxR (ref items, mid +1, end);
Return left > right ? Left : right;
}

In continuation of our posts on the applications of machine learning from the book Programming Collective Intelligence, we will now see an interesting application of optimization in the real flight searches. The Kayak API is a popular vertical search engine for travel. After registering for developer access to the Kayak package which gives us a developer key, we get a new Kayak Session from the Kayak API server. With this session, we can do different flight searches. Note that the need for a session, the use of XML based response, and the extra long URLs based with query strings comprising of several different search parameters, are all reminiscient of the APIs of those times. With just the query parameters for session Id, destination and depart date, it returns a search id with no results that can be used for polling. With the search id and the session id, we can write the polling function as one that requests the results until there are no more. There is a tag called morepending which has a value true until its all done. The prices, departures and arrivals list retrieved from the query is zipped into a tuple list. The flights are returned primarily in the order of price and secondly in order of time. When we try to create the schedule for the arrivals and departures of the Glass family where the family members visit from different places and leave after the holiday, this flight search is simply performed over and over again for their outbound and return flights for each individual.  Then we can optimize the flights using the actual flight data in the costing function.  The Kayak searches can take a while, we are merely considering this search for the first two passengers. The costing function now takes the domain as a range multiplied by the count of the outbound and return flights of these two passengers and run the same optimization as discussed earlier using this domain. The actual flight data gives us more interesting data ways to expand our optimization.
This far we have discussed optimization based on costs. However, there is such a thing as preferences which may or may not be directly tied to costs. Preferences are user attributed.and this is a ranking that makes the people happy or as little annoyed as possible. The information is taken from the users and combined to produce the optimal result. For example, everyone may want a window seat on the same flight but that may not be possible.  If we make the cost function to return very high values for invalid outcomes, we can resolve this conflict. 

Tuesday, October 21, 2014

#codingexercise
Convert numbers to Roman
public string GetRoman(int t)
{
List<int> numbers = {1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4 ,1 };
List<str> numerals = { “M”, “CM”, “D”, “CD”, “C”, “XC”, “L”, “XL”, “X”, “IX”, “V”, “IV”, “I”};
String ret = string.empty;
int i = 0;
while (t > 0)
{
  int digit = t / numbers[i];
  For (int j = 1; j < digit; j++;)
  {
   ret += numerals[i];
  }
  
  t -= digit * numbers[i];
  i = i + 1;


Return ret;
}

#codingexercise
Public int getInteger(string numeral , dictionary<string, int> dict)
{
Int I = 0;
Int number = 0;

While ( I < numeral . length )
{
If ( I + 1 < numeral.length && dict[numeral [i+1] ]> dict [numeral[i]])
{
Number += dict [numeral.substring(i,2)];
I = I + 2;
}
Else
{
Number += dict [numeral [i]];
I = I + 1;
}
}
Return number;
}
Public string getRoman (int)

Monday, October 20, 2014

#codingexercise
int RomanToInt(string s)
{
 if (String.IsNullOrEmpty(s)) return 0;
//regex match string with +[IVXLCDM]
  var dict = new Dictionary<char, int> ();
        dict['I'] = 1;
        dict['V'] = 5;
        dict['X'] = 10;
        dict['L'] = 50;
        dict['C'] = 100;
        dict['D'] = 500;
        dict['M'] = 1000;
   int ret = 0;
   char temp;

   stack<char> st;
   st.push(s[0]);

  for (int I = 1; I< s.Length; I++)
  {
    if (st.count > 0 && dict[s[I]] > dict[st.Peek()])
    {
        temp = st.top();
        st.pop();
        ret = ret - dict[temp];
       st.Push(s[I]);
    }
    else
      st = st.Push(s[I]);
  }

while(!st.empty)
{
 ret += m[ st.Peek()];
 st.pop();

}
 return ret;
}
// there are other ways of doing this but I'm just listing the earliest one.

int RomanToInt(string s, int index, Dictionary<char, int> dict)
{
// validate s as not empty, not null and roman numerals only
 if (index == s.length) return 0;
 if (index - 1 > 0 && dict[s[index]] > dict[s[index-1]])
     return (0-2 * dict[s[Index-1]]) + dict[s[Index]] + RomanToInt(s, index+1, dict);
else
    return dict[s[Index]] + RomanToInt(s, index + 1, dict);
}

Sunday, October 19, 2014

Non-negative matrix factorization: This is a technique used for extracting the important features of the data. This technique involves matrix multiplication where a matrix with  r number of columns  is multiplied with another that has the same number of rows r and a cell in the resulting matrix is  scalar product of the values in the corresponding same row in the first matrix with same column in the second matrix. It also involves transposing where the matrix elements are written in column order first.
Given a set of articles containing words, we can write the usual article matrix where the articles will be the rows and the words appearing in them will be the columns. The words are the properties for the articles and we take words that are common but not too common :
wordvec = []
for w,c in allw.items():
   if c > 3 and c < len(articlew) * 0.6:
    wordvec.append(w)
This technique  is about factorizing the article matrix into two matrices. a features matrix and a weights matrix. The features matrix has a row for each feature and a column for each word.  The weights matrix tells us how much each feature applies to each article. Each row is an article and each column is a word.
The articles matrix may be large. The purpose of this technique is to reduce the matrix to a smaller set of common features. This smaller set of features could be combined with different weights to get back the article matrix. Although it is unlikely to get back the original, this technique attempts to get back as closely as possible.
Its called non-negative because it returns features and weights with no negative values. Since we count the inclusions or occurrences of a word, this technique is not applied such that we consider the exclusions of words. This approach doesn't look for the best factorization but its results are often easier to interpret.
The algorithm we use in this approach would have a utility function to determine how close the reconstruction is. This is done with a function that sums the squares of the differences between two equal sized matrices. Then we gradually update the matrix to reduce the cost function.
Here we could use a number of the optimization techniques such as annealing but we discuss multiplicative update rules.
 Give the original matrix which this optimization technique calls data matrix, we make four new update matrices    
hn = the transposed weight matrix multiplied by the data matrix.
hd = the transposed weights matrix multiplied by the weights matrix multiplied by the features matrix.
wn = the data matrix multiplied by the transposed features matrix.
wd = the weights matrix multiplied by the features matrix multiplied by the transposed features matrix.
The weights and features matrix are initialized with random values.
To update the features and weights matrices, all these matrices are converted to arrays.
Every value in the features matrix is multiplied by the corresponding value in hn and divided by the corresponding value in hd. Likewise, every value in the weights matrix is multiplied by the corresponding value in wn and divided by the value in wd.
This is repeated until the number of features are found or the number of iterations is exceeded.
def factorize(v,pc=10,iter=50):
    ic=shape(v)[0]
    fc=shape(v)[1]
    # Initialize the weight and feature matrices with random values
    w=matrix([[random.random( ) for j in range(pc)] for i in range(ic)])
    h=matrix([[random.random( ) for i in range(fc)] for i in range(pc)])
    # Perform operation a maximum of iter times
    for i in range(iter):
        wh=w*h
        # Calculate the current difference
        cost=difcost(v,wh)
        if i%10==0: print cost
        # Terminate if the matrix has been fully factorized
        if cost==0: break
        # Update feature matrix
        hn=(transpose(w)*v)
        hd=(transpose(w)*w*h)
        h=matrix(array(h)*array(hn)/array(hd))
        # Update weights matrix
        wn=(v*transpose(h))
        wd=(w*h*transpose(h))
        w=matrix(array(w)*array(wn)/array(wd))
    return w,h

Recall that an Eigenvector and corresponding eigenvalue is another way to factor the matrix into vectors and since we know that the det(A-Lambda.I) = 0, we can solve for the Eigen values and Eigen vectors. This may be useful for finding a single feature.
 Another way to do the optimization could be to use the updates of the matrix using the Laplace transform of a matrix valued function as mentioned here (http://see.stanford.edu/materials/lsoeldsee263/10-expm.pdf). This is useful when we don't want to find the Eigen values and we can use the resolvents. Moreover the statewise transitions are linear.
For example, the harmonic oscillator function can be written as
x-dot = [   0   1 ]  x
             [  -1  0 ]
and the state transitions are x(t) = [cos t  sin t] x(0)
                                                       [-sin t cos t]

Similarly the matrix exponential solution is x(t) = e ^(tA) x(0)
where the matrix exponential is e ^ M = I + M + M^2 / 2! + ...

for t = 1 and A = [ 0  1] 
                            [ 0  0], the power series has M^2 = M^3 = ... = 0
and e^A = I + A =  [ 1 1 ]
                               [ 0 1 ]

We can also explore weighted  laplacian  matrix for the partitioning.
• Larry Hardesty, Short algorithm, long-range consequences, MIT News, 1 March 2013 is an interesting read on the subject.
#codingexercise
Get the last minimum value in a set of positive integers
Int getLastlocalMin (int [] v)
{
If (v == null) return -1;
for ( Int u = v.length - 2; u >= 0; u--)
     If (v [u] > v [u+1]) return v [u+1];
Return v [0];
       
}

Saturday, October 18, 2014

#codingexercise
Int LastindexOf(int [] v, Int i)
{

Int ret = -1;
If (v!= null)
For  ( Int u = v.length-1; u>= 0; u--)
      If (v[u] == i)
           Return u;

Return ret;
}

Friday, October 17, 2014

Today I want to discuss building REST based APIs  for enabling CRUD Operations. Specifically  we will be discussing the need for a store behind the API. As one example, the data model itself can be exposed as REST  based in which case it becomes OData it helps if the data is in a database in which case we get the guarantees that come with it sell such as the operations are atomic, consistent, isolated and durable. But let us assume that the data is not in the database and that the API layer has to do operations on the wire to fetch the data. In such cases I've used some of the following. First, we keep the resources  well identified. By that I mean we use as many columns as necessary to index the data  indexing has a lot of benefits but the suggestion here is to decide between  keeping the data sorted or building a separate index.  Secondly we could keep a status column that cycles between created modified and deleted and some other states in between  depending on what would help. The status column serves to eliminate such concerns as duplicates,  retries, consistency, concurrency on systems that don't support it already. Let us take a look at a few examples now.
First, the user wants to insert a record. Where do we store this. Is it a separate partition, table, database, database server, etc. If there are many of those, then they can probably be arranged in a row or a grid where they are accessible by their position. In fact if we have redundancies we could use the arrangement for the clones so they are side by side with the master in the array.
Next, how do we represent the record is in this location. This can probably be done by keeping the location information as part of the key for the record. A Multipart name could be helpful. Something like database server. database.schema.table etc. Given this record we know how to reach it and that there are no conflicts.
Next, we look at the operations themselves? Do we need transaction or can we do retries ? Which one is less intensive and cheaper.
Next we look at the operations on the same item? Do we need to make sure the item will be consistent between concurrent operations ? The same goes for sequential train of operations.
Lastly, we look at the state of the records. They should be progressive and cyclical if not terminating.
If the states are forward only, it supports retries. This helps it be robust but the operations may require lookups for each retry.
Retries work something like this:

status retry(Item item)
{
   var state = lookup(item);
   if (state is null)
    {
    // create item
      item.state = created;
     return item.state;
    }
  if (state is created)  
    {
       // modify or delete
       update(item);
      item.state = modified;
      return item.modified;
    }
  if (item is notReady)
   {
     ready(item);
     item.state = ready;
     return item.state;
    }
  item.state = null;
  return item.state;
}
#codingexercise

Int  getcountmatch (list <int> v, Int i)
{
If (v == null) return -1;
Int count = 0;
For ( Int k=0; k < v.length;  k++)
      If (v [k ] == i) count++;

Return count;

}
In continuation of our discussion on  classifiers such as Kernel methods, we will now cover Support Vector Machines (SVM). These are probably the most sophisticated of all classifiers. An SVM finds the line  that separates the data more cleanly, which means that its the greatest possible distance from points near the dividing line. The only points necessary to determine where the line should be are the points closest to it and hence are called support vectors. There's no need to go through the training data to classify new points once the line has been found.
As opposed to the linear classification which may mis-classify two points closest to the line while fitting the line for the rest of the data points whose averages are computed, the dividing line is chosen such that it is as far away from the points in the class and the only points needed to make that determination in these cases are the two points. The trick is to find just these support vectors and then draw the dividing line.  Also, kernel methods used dot product to perform a non-linear classification by using the kernel trick where a dot product was used for comparisons. SVMs also use dot products and with kernels to perform non-linear classification. SVM datasets are typically those that are very complex, high dimensional and data intensive. For example, facial expressions can be classified with SVMs. Another important feature for applications is that they can naturally accept input data that are not in the form of vectors such as strings, trees and images.
Mathematically,  the underlying idea is that a hyperplane that is far from any observed data points, should minimize the risk of making wrong decisions when classifying new data. The hyperplane that solves this constraint is called the optimal separating hyperplane. This hyperplane (wx + b) can be found as the solution to an equivalent optimization problem:
 min w,b = 1/2 (|| w ||) ^ 2 subject to (w'. x + b ) >= 1 constraint. Notice that only a small set of data will solve this equation and these are the support vectors which has all the information necessary to derive the decision rule. However, when w has very high dimension, this optimization becomes intractable.Instead we solve it  with the transformation
find min alpha    alpha'Qalpha - alpha subject to constraint alpha-i >= 0, y'.alpha = 0 where Q = (yi yj xj xi) and this has a unique solution when Q is positive semidefinite.
When we solve for alpha, we can get
 w = Sigma-i (Alpha-i, yi, xi )
 b = 1/2 ( w ' x(+) + w ' x(-) )  where x(+) and x(-) are two support vectors belonging to opposite classes.
Courtesy : Wolfram media

Thursday, October 16, 2014

#codingexercise

Int  getmatch (list <int> v, Int i)
{
If (v == null) return -1;
For ( Int k=0; k < v.length;  k++)
      If (v [k ] == i) return k;

Return -1;

}
In this article, we investigate Kernel methods in continuation of our readings on the use of different machine learning techniques in web applications. We can think of these as advanced classifiers in the set of classifiers that included decision trees, Bayesian classifiers and neural networks.  A typical example of a dataset for which this kind of classifier is used is for matchmaking people as for a dating site. There are many variables, both numerical and nominal, and many non-linear relationships. One of the things we should realize is that a dataset and a classifier cannot be arbitrarily paired. We have to preprocess the data and select the best algorithm to get good results.

In the matchmaking example we want to build a classifier on, we recognize several different variables such as age, habits, interests, wants children, location etc. Instantly we see how complex this dataset is and how simplifying our assumption is for what we model the dataset to be. Let us say we write a function that can take the population and generate a dataset with values assigned against the variables. This could simply be reading a preprocessed data file. The next step is to take these variables in pairs and make a scatter plot to see how correlated they are. For example, age difference may be moretolerated as we grow older. If we build a decision tree classifier on this scatter plot, we will have split the data on numerical boundaries. This boundary that splits the data into different categories is called thedecision boundaries and it helps to visualize how divided the data is. Next, we can build a linear classifier that finds the average of all the data in each class, finds a center point and then classifies the new points to the nearest center point. By drawing a line through the averages in the scatter plot, we can take new data for matching based on this variable and predict the match based on how close they are to the nearest average. While closeness of a point to the average was measured with Euclidean distance earlier, we could now improve it with vectors and dot-product. The dot-product is a scalar quantity and can be positive or negative based on the cosine of the angle between the vectors. And with this way to find the closeness, we now have a linear classifier.

However, data may not always indicate a dividing line for a linear classifier to work effectively. In such a case, we use Kernel methods to improve the classifier to do non-linear classification.

But before we jump into the kernel methods let us quickly see how to work with variables that are not necessarily as simple as age to use a scatter plot. For example, interests could be diverse and we may end up creating a new variable for every interest. To overcome this, we represent the interests in a hierarchy and then use the structure to compute the closeness of interests.  Together with a higher number of close interests, we can have a much smaller variable for this dataset. Similarly location could be represented by many different things such as street, city, state, country etc. To normalize location, we could use geocoding API that translates locations into latitudes and longitudes and then computes the distances as miles. Lastly we can scale the dataset for higher or lower resolution to enable applying alinear classifier.

Nevertheless, we may have a complex dataset that is not subject to a linear classifier. For example, if the data were divided by a circle forming two different regions with a concentric average, it is not east to apply a linear classifier because there is no line. Instead we use kernel methods. For the same example of concentric regions in a scatter plot of X and Y, we could take the squares of the corresponding variables and a scatter plot of the squares shows the inner region data points moving to a corner and all the others as away from the corner. In such a case, it’s easy to draw a dividing line separating the corner from the rest. Transforming the data for a couple of variables is easy. But transforming the data for vectors with a large number of variables or dimensions is not easy. Instead we apply what is called the kernel trick which involves replacing the dot product function with a new function that returns what thedot-product would have been if the data had first been transformed to a higher dimensional space using a mapping function. One such function is the radial-basis function which is like a dot product in that it takes the two vectors and returns a value but it also takes in an additional parameter which can be adjusted to get the best linear separation. And instead of calculating the dot-product of the new point to classify and the average point, we now find the radial-basis between the point and every other point in the class and average them. The weight adjustment to help with categorization called the offset is alsodifferent in the transformed space. It is usually pre-calculate for this dataset and passed in when classifying new points. This has now tremendously improved the linear classifier for say the impact of age difference in match making.

#codingexercise

Wednesday, October 15, 2014

#codingexercise:
size_t strlen (char* pc)
{
If (!pc) return -1;
size_t count = 0;
while (*pc) {
    pc++;
    count++;
}
return count;
}


@codingexercise
Int strcmp (char* a, char* b)
{
If (a ==null || b == null) return -1;
While (*a && *b && *a == *b )
{
    a++;
    b++;
}
Return (*a-*b);
}

Tuesday, October 14, 2014

#codingexercise
A linked list is given such that each node  additionally  points to a random other node in the list. Perform a deep copy.
Node* deepCopy(node* source)
{
 if (source == NULL) return NULL;
 Node* clone = deepCopy(source);
 Node* curSrc = source;
 Node* curDest = clone;
while (curSrc)
{
  if (curSrc->other == NULL) {curDest->other = NULL; continue;}
  int offset = relativePos(curSrc, curSrc->other, source);
  if (offset <0)
  {
    Node* dest = clone;
    for (int i = offset+1; i < 0 && dest; i = i + 1)
             dest = dest->next;
   curDest->other = dest;
  }
  else
  {
     Node* dest = curDest;
     for (int i =0; i< offset && dest; i++)
             dest = dest->next;
     curDest->other = dest;
  }
  curSrc = curSrc->next;
  curDest = curDest->next;
}
return clone;
}

Node* shallowCopy (Node* source)
{
If (!source) return NULL;
Node* ret = new Node ();
If (!ret) throw out_of_memory;
Ret->next = shallowCopy (source->next);
Return ret;
}

Int relativePos (node* source, node* target, node* head)
{
If (Source == target) Return 0;
Int I = 1;
while ( source && source->next && source->next != target)
{
Source = source->next;
I++;
}
If (source->next == target) return i;
I = -1;
if (head == target) return i;
While (head && head->next && head->next != target)
{
Head = head->next;
I =i-1;
}
Return i;
}
Today we look at inverted lists. In Apache Lucene  index, a document consists of multiple fields and a field consists of multiple terms.  The indexing of the document is controlled by labeling thefields as whether it needs to be analyzed (preprocessed), indexed, or stored (ready to be retrieved). The inverted index has a list of fields, their labels and offsets within the postings list. The entry in the postings list corresponding to the field has a term and frequency and positions in the frequency list. The postion in the frequency list gives the documents sorted by their Id that contain this term and each with offsets into the position in terms of say word count where the term appears. The whole index also called the vocabulary is sorted first by document then by fields and then by terms.   Thus it’s a hierarchy of lists. 

We now look at document retrieval step. This is the second step of the two listed earlier.  Here we are interested in finding the top ranking matches of documents.  Each document is checked for a match and retrieved one at a time.   The documents are found from the postings list of the terms in the query.  The postings list can be traversed concurrently.  The TF-IDF score of the document is calculated to determine a match for the document.  Apache Lucene calculates a static and a dynamic score which are based on the document and the dot product of the query and document vectors respectively. It also has a boost factor for the term frequency. The reason we bring up the static score is that the documents could additionally be indexed based on static score. However, we will just use the notion of a score to rank the documents by inserting into a priority queue data structure.   

For large collections, the index can be distributed across multiple machines. The indexes are partitioned based on term partitioning and document partitioning.  Usually these are determined outside thesystem because the caller will know how to partition.  In ElasticSearch for example, the machines are organized by rows and columns. Each column represents a partition of documents while each rowrepresents a replica of the corpus.  

Note that for the score we use, the IDF is calculated based on the whole corpus not just the partitions. In this case, either we treat the partitions as uniform fractions of the corpus or we poll and collect thescores from each.  


#codingexercise
Implement strtok
Char* strtok (char* s, const char* delim)
{
Static Char* last;
Int c, sc;
If ( s == NULL && (s = last) == NULL))
{
Return NULL;
}
cont:
C = *sc++;
 For ( spanp  = (char*) delim, (sc = *spanp++) != 0;)
     If ( c == sc) goto cont;
}
If ( c == 0){
Last = Null:
Return NULL;
}
Tok = s - 1;
For (;;){
C = *sc++;
SC = (char*) delim;

 For ( spanp  = (char*) delim, (sc = *spanp++) != 0;)
{
 If (sc == c)
{
  If ( c == 0)
     s = Null;
  Else
     s [-1] == 0;
  Last = s;
  Return (tok);
}

}

}

// unreached

}

Courtesy: Apple


My back is hurting for some time now

Monday, October 13, 2014

#codingexercise
Int memcmp  (char* src, char* dest, size_t n)
{
  If (!src || !dest) return 0;
    If (src == dest) return 1;
    While (n)
      {
         If (*src != *dest) return 0;
         n--;
       }

    Return 1;
}
hi-level implementation of Inverted Lists in Key-Value Stores:
(Python and MongoDB)
import re
def getwords(doc):
   splitter = re.compile('\\W*')
   #split the words by non-alpha characters
   words = [s.lower() for s in splitter.split(doc)]
   return words;

class Vocabulary:
def makeInvertedIndexSortedByKey(doc):
  i = 0;
  for w in getwords(doc):
   Position = i++
   Frequency = (doc.id, freq(w)++, offset(Position))
   PostingsList = (Field, w, freq(doc.id), offset(Frequency))
   FieldDef = (Field,  indexed=true, stored=true, offset(PostingsList))




#codingexercise
Median of two sorted arrays

int GetMedian(List<int> list1, List<int> list2)
{
  if (list1 == null || list1.Length <= 0) throw new Exception();
  if (list2== null || list2.Length <= 0) throw new Exception();
  int first = 0; // index for first
  int second = 0; // index for second;
  var merged = new List<int>();

 while (first <list1.Length && second < list2.length)
     if (list1[first] < list2[second])
     {
       merged.Add(list1[first]);
       first++;
      }
     else
       {
        merged.Add(list2[second])
        second++;
       }

  while (first<list1.Length){
        merged.Add(list1[first]);
        first++;
   }

   while (second < list2.Length) {
        merged.Add(list2[second]);
        second++;
   }
   
 int mid = (first + second) / 2;
 if  (mid %2 == 0)
  {
     if (mid -1 > 0)  return (merged[mid-1] + merged[mid]) / 2;
  }
 return merged[mid];
}

Sunday, October 12, 2014

#codingexercise
TwoSum is a method that finds the indexes of two numbers in ascending order in an array that add upto a target. Numbers are positive integers.

List<int> TwoSum(List<int> numbers, int target)
{
if (numbers.Length <= 0 || target == 0) return null;
var ret = new List<int>();
for (int i = 0; i < numbers.Length, i++)
{
  int index = numbers.IndexOf(i+1, target-numbers[i]);
  if (index != -1)
 {
   ret.Add(i);
   ret.Add(index);
 }
}
return ret;
}

DBExplorer investigates the use of keyword search in relational databases.  It mentions several interesting problems that are encountered in keyword search as opposed to structured queries. For example, SQL applications require knowledge of schema and the keyword search doesn’t. Secondly, given a set of keywords, a match is found by joining several tables on the fly. Thirdly, Indexes need to be leveraged for efficient keyword search. Fourthly, common data structures used for document collections such as inverted lists are replaced by symbol table. For each keyword, a list of rows is kept in the symbol table that contains the keywords. Alternatively, for each keyword, a list of columns can be kept that contains the keywords.  Search is performed across multiple tables using a graph where the nodes are the tables and the edges are the foreign-key relationships. When looking for a co-occurrence, the tables are joined on the fly by exploiting the schema as well as the content.

Saturday, October 11, 2014


#codingexercise
Text justification
This problem requires the text in sentences to be broken and laid out in a way where each line has L characters except for maybe the last line. Words that break on a line move on to the next and padding introduced evenly between the remaining words.
Let X and Y denote sentences with count of characters less than or equal to L and more than L respectively
The test cases are
X
Y
XX
XY
YX
YY

List<string> Justify(string text, int L)
{
 if (string.IsNullOrEmpty(text) || L < 1) return null;

var ret = new List<string>();
string candidate = string.Empty;

var words = text.split();
 for (int I = 0; I < words.Count; I++)
 {
   // add word  
   if (candidate.Length + words[I].Length  +  1 <= L)
   {
     candidate += words[I] + " ";
     continue;
    }

   // add padding
   if (candidate.Length > 0)
   {
   int padLen = L - candidate.Length;
   if (padLen > 0)
   {
   var w = candidate.Split();
   candidate = string.Empty;
   for (int k = 0; k < w.Count; k++)
   {
       candidate += w[k] + " ";
       for (int l = 0; l <  padLen / w.Count && candidate.Length < L; l++)
          candidate += " ";    
   }

   if (w.Count > 0 && padLen > 0)
   for (int l = 0; l < padLen % w.Count && candidate.Length < L; l++)
         candidate += " ";
   ret.Add(candidate);
   candidate = string.Empty;
   }

    if (candidate.Length + words[I].Length  +  1 <= L)
   {
     candidate += words[I] + " ";
    }
   else
   {
     candidate += words[I].substring(0,L-1) + "-";
     ret.Add(candidate);
     candidate += words[I].substring(L) + " ";
   }

 }
ret.Add(candidate);
return ret;
}
 

Friday, October 10, 2014

I mentioned a SSH library called pexpect which let's you pipe stdin and stdout. Similarly psppass can be invoked from a bash script as well :



#!/bin/bash

# CRUD.sh of resource over SSH

#

if [ -z "$1" ]

then

echo "Usage: `basename $0` [create|update|delete|list] [resource] [rule]"

exit $E_NOARGS

fi


share_=$2_

rule_=$3_

create_Command="create resource"

delete_Command="delete resource"

list_Command="list resource"


case $1 in

"create" ) Command=""$create_Command"" ;;

"update" ) Command=""$update_Command"" ;;

"delete" ) Command=""$delete_Command"" ;;

"list"   ) Command=""$list_Command"" ;;

* ) echo "Usage: `basename $0` [create|update|delete|list] [resource] [rule]" ;;

esac


export SSHPASSpassword # use keys instead

sshpass -e ssh -oBatchMode=no root@xyz.com""$Command""

exit $?

OK now where did I leave the change from the bus ride today?

# coding exercise
We implement yet another way to print the area of the largest rectangle in a histogram of unit width heights. In addition to the simple approaches described in the previous posts, we look at recursive approach because there are overlapping solutions and iterative approaches using data structures
One approach is to use divide and conquer.
We can use the overlapping subproblems in terms of the areas computed between a range of indices. further if we can keep track of these indices and computations we dont have to redo them.  This tecnique is called memoization. Here we don't compute the maxarea for a range that we have already computed and stored the value. So we keep track of ranges and the max area computed by the increasing order of start and end.Each time we update a range we update the start or end and the corresponding max value or both. How we choose the range of indices is our choice. The goal is to exhaust all the indexes in the histogram so that we don't leave out any portion. One idea here as it has appeared on geeks for geeks website  is that we find the minimum value of the heights of bars in a histogram by dividing and conquering the entire range.
Given this minimum we find the max area as the maximum  of the following :
1) Maximum area to the left
2) maximum area to the right
3) number of bars multiplied by minimum value
Note that the minimum in a range of bars does not guarantee the maximum area unless the same is applied for all ranges including the one bar that may shoot out of the chart.
Another way of choosing the indexes is to find it progressively to the right as we traverse range from  start to end of indexes. In this method, The range that we have already covered, we have exhausted each bar as contributing to the final answer .
Yet another approach would be to divide and conquer the indexes and combine them so we calculate the max area of (a,b,c) as
Max (a, b, c ).
Specifically, given two bars in adjacent ranges, the area is the maximum of
1) minimum common height times number of bars in the combined range
2) maximum area of one range
3) maximum area of the other range.
If we chose the latter approach,
We keep track of the areas computed in the data structure discussed in memoization.
First without memoization, the solution is
Int MaxArea (ref int [] h, int start, int end, ref int min)
{
If (start == end)
{
min = h [start];
return min × 1;
}
If (start < end)
{
Int mid = (start + end)/2;
Int minleft = 0;
Int minright = 0;
Int left = MaxArea (c, ref h, start, mid, ref minleft);
Int right = MaxArea (c,ref h, mid +1, end, ref minright);
min = min (minleft, minright) ;
Int minArea= min × (end-start+1);
Return max (left,right, minArea);
}
Return 0;
}

Thursday, October 9, 2014

I came across a method implemented in a library that gives a programmatic way to establish SSH sessions (Python pexpect library - pxssh.py). This library had a method to compute the Levenshtein distance as follows:


        '''This calculates the Levenshtein distance between a and b.

        '''


        n, m = len(a), len(b)

        if n > m:

            a,b = b,a

            n,m = m,n

        current = range(n+1)

        for i in range(1,m+1):

            previous, current = current, [i]+[0]*n

            for j in range(1,n+1):

                add, delete = previous[j]+1, current[j-1]+1

                change = previous[j-1]

                if a[j-1] != b[i-1]:

                    change = change + 1

                current[j] = min(add, delete, change)

        return current[n]

As we may know already, Levenshein distance is a metric for measuring the difference between two string sequences. The distance is computed in terms of single-character edits specifically (insertions, deletions and substitutions). The method takes a shorter string and transforms it to the longer string. The delete is calculated from the position to the left in the sequence. The change is to the position on the left in the candidate. The add is at the immediate position in the candidate. The positions are iterated for both sequences. Match gets a value zero and a mismatch costs 1 by way of transformation at the specified positions.

Wednesday, October 8, 2014

#coding exercise
print the area of the largest rectangle in a histogram of unit width heights.
public static int MaxArea(ref List<int> h)
{
            if (h == null || h.Length <= 0) return 0;
            var areas = new List<int>();
            for (int i = 0; i < h.Length; i++)
            {
                areas.add(i, h[i]);
                for (int k = i+1; k < h.Length; k++)
                   {
                     if (h[k] >= h[i])
                         areas[i] += h[i];
                     else
                     {                      
                         break;
                     }
                   }            
            }
            return areas.Max();
}
public class hc
{
// height
Public int h { get; set;}
// frequency
Public int c { get; set;}
}
Public static int MaxAreaByStack (List <int> h)
{
            if (h == null || h.Length <= 0) return 0;
            Int MaxArea =0;
            var hts = new stack<int>();
            Var f = new List<int>();
             Hts.push (h [0]);
             F.Add (1);
            for (int i = 1; i < h.Length; i++)
             {
                       If (h [i] > h [i-1])
                       {

                           For (int k=0; k< hts.count; k++)
                           {
                                F [k] +=1;
                           }
                           Hts.Push (h [i]);

                            F.Add (1);
                       }

                      Else

                     {

                      For ( int k = hts.count -1; k >=0;k--)

                     {

                      If (h [k] <= h [i]) break;

                      Var last = hts.pop ();

                      Var count = f.Last ();

                      F.RemoveLast ();

                      If (last × count > maxArea)
                           MaxArea = last×count;
                     }
                  }
            }
Return maxArea;
}

Tuesday, October 7, 2014

A review of configuration requirements tips and tricks by Carlos Galeona for scale out NAS.
Cluster with OneFS ranging in size of filesystem from 18TB to 15.5 PB that's easy to manage with no RAIDs, LUNs, etc. and easy to grow is the target of the configuration. Minimum system configuration is a cluster with three nodes, two infiniband switches, the same version of OneFS, licenses for different modules. Common tasks involve upgrade, deployment, verification etc. cfengine deployment for host, and cfagent runs from cron for each node is the practice. Management API exists for OneFS 7.1.1  A traditional backup is an example use case. An NFS file system is mounted to a backup server or client and then a native backup is done. A large scale file system could mean multiple backups and parallel tree traversals.  The latter can take a long time. 7.1.1 API allows the creation of changelist that addresses latter.

Sample program to create a web API based administration session, create an NFS export and create a snapshot.

# import relevant libraries
import json
import urllib
#  for data = urllib.urlopen(url, params).read()
import httplib2
http = httplib2.Http()
# for http.request(url) instead of authenticated urlopen

# setup
api_user = 'Your api username'
api_pass = 'Your api password'

# create session
url = 'https://<cluster-ip-or-host-name>:8080/session/1/session'
params = urllib.urlencode({
'username': api_user,
'password': api_pass,
'services': ['NFS','Snapshot']
})
http.add_credentials(api_user, api_pass)
response, content = http.request(url, 'POST', params,
headers={'Content-type': 'application/x-www-form-urlencoded'}
)

# validate
url = url + '?isisessid'
http.add_credentials(api_user, api_pass)response,content = http.request(url)
data = json.loads(content.json())
if data["username"] != api_user:
   raise Error('bad session')

#create nfs export
url = 'https://<cluster-ip-or-host-name>:8080/platform/1/protocols/nfs/exports'
params = urllib.urlencode({
'description': 'sample mount',
'paths': ['/path','/path1'] # under /ifs
})
http.add_credentials(api_user, api_pass)
response, content = http.request(url, 'POST', params,
headers={'Content-type': 'application/x-www-form-urlencoded'}
)
data = json.loads(content.json())
if not data["id"] :
   raise Error('bad export')

# validate
id = data['id']
url = 'https://<cluster-ip-or-host-name>:8080/platform/1/protocols/nfs/exports/' + id + '?describe'
http.add_credentials(api_user, api_pass)
response,content = http.request(url)
data = json.loads(content.json())
if not data["id"] :
   raise Error('bad export')

#create a snapshot
url = 'https://<cluster-ip-or-host-name>:8080/platform/1/snapshot/snapshots'
params = urllib.urlencode({
'name': 'sample snapshot',
'path': '/ifs'
})
http.add_credentials(api_user, api_pass)
response, content = http.request(url, 'POST', params,
headers={'Content-type': 'application/x-www-form-urlencoded'}
)
data = json.loads(content.json())
if not data["id"] :
   raise Error('bad snapshot')

# validate
id = data['id']
url = 'https://<cluster-ip-or-host-name>:8080/platform/1/snapshot/snapshots/' + id + '?describe'
http.add_credentials(api_user, api_pass)
response,content = http.request(url)
data = json.loads(content.json())
if not data["created"] :
   raise Error('bad snapshot')

from datetime import datetime, timedelta
createddate = datetime.strptime(data['created'])
if  datetime.now  - createddate >  timedelta(hours=2):
    raise Error('old snapshot')

#coding exercise
Given a non-negative number represented as an array of digits, plus one to the number.

void plusOne(ref List<int> digits, int position)

{
if (position  >= digits.Length) return;
If (position < 0) return;
if (digits[position] + 1 > 9)
{
   digits [position] = 0;
if (position - 1 < 0) {// check INT_MIN underflow; digits.InsertAt(0, 1);}
Else
   plusOne(ref digits, position - 1);
}
else
    digits[position] += 1;
}



# coding exercise
Validate BST

Bool  isValidBST ( node root)
{
If (root == null) return true;
If (root.left && root.left.data >= root.data ) return false;
If (root.right && root.right.data < root.data ) return false;
Return isValidBST(root.left) && isValidBST(root.right);
}

Given a singly linked list L0->L1->L2->  Ln-1-> Ln
Return interleaved L0->Ln->L1->Ln-1...
List <Node> Interleave ( List <Node> items)
{

If ( items == null || items.length <= 2) return null;
Int n = items.length;
int mid = (n%2==0) ? n/2 : n/2+1;
Var  rem = root.GetRange (mid);
Int I = 1;
Int count = rem.length;
For (int k=0; k < count;k++)
{
  Node last = rem.last ();
  Rem.removeLast ();
  Root.insertat (i, last);
  I = I +2;
}
}
Node* interleave ( node* root, int  n)
{
If ( root == null || n <= 2) return;
int mid = (n%2==0) ? n/2 : n/2+1;
Node* start = root;
Node* last = start->next;

while(start && last)
{
Node* prev = start;
while(last && last->next)
{
  prev = last;
  last = last->next;
}
prev->next = null;
Node* next = start->next;
last->next = next;
start->next = last;
start = next != null ? next->next : null;
last = (start != null) ? start->next : null;
}
}

Monday, October 6, 2014



#coding exercise
Given a binary tree, check if its symmetric around the center
bool isSymmetric( Node root)
{
 if (root == null) return false;
List<List<Node>> items=  GetZigZag(root); // level wise items (already implemented earlier)
for (int i = 0; i < items.Count; i++)
   if (IsPalindrome(items[i]) == false)
        return false;
return true;
}

bool isPalindrome(List<Node> items)
{
  if (items == null) return false;
  int start = 0;
  int end = items.Count - 1;
  while ( start < end)
  {
     if (items[start] != items[end])
         return false;
    start++;
    end--;
   }
   return true;
}

Given a pair  of rectangles aligned along the axis in the positive quadrant and given by
class Rectangle
{
  Point tl; // TopLeft;
  Point br; // BottomRight;
}
class Point
{
 int x;
 int y;
}
Implement the following  methods
static bool IsIntersect( Rectangle r1, Rectangle r2);
static Rectangle Intersecting(Rectangle r1, Rectangle r2);

static bool IsIntersect( Rectangle r1, Rectangle r2)
{
bool isdisjoint = false;
// are  they disjoint along x axis ?
if (r1.tl.x < r2.tl.x)
   isdisjoint = r1.br.x < r2.tl.x;
else
  isdisjoint = r2.br.x < r1.tl.x;
if (isdisjoint == true) return false;

// are they disjoint along y-axis ?
if (r1.br.y < r2.br.y)
   isdisjoint = r1.tl.y < r2.br.y;
else
  isdisjoint = r2.tl.y < r1.br.y;
return isdisjoint == false;
}

static Rectangle Intersecting(Rectangle r1, Rectangle r2)
{
  if (!IsIntersect(r1, r2)) return null;
  Rectangle r = new Rectangle();
  Rectangle left = (r1.tl.x <= r2.tl.x) ? r1 : r2;
  Rectangle right = (r1.tl.x <= r2.tl.x) ? r2 : r1;
  Rectangle bottom  = (r1.tl.y <= r2.tl.y) ? r1 : r2;
  Rectangle top = (r1.tl.y <= r2.tl.y) ? r1 : r2;
  r.tl.x = right.tl.x;
  r.br.x = right.br.x <= left.br.x ? right.br.x : left.br.x;
  r.tl.y = bottom.tl.y;
  r.br.y  = bottom.br.y <= top.br.y ? top.br.y : bottom.br.y;
return r;
}