Cluster computing

Thursday, November 20, 2014

This is a writeup on how to collect and visualize data that indicates who talks to whom and represent it as a neat little graph alongside your favorite application with which you browse this communication. Let us take an example of an e-mail application that has mails sent to you. Inevitably you will be in all of the e-mails. That’s what pigeonholing might do to you. But even with this data, you may be seeing mails where you were included just to be in the loop. These from and to labeled people may give indications not only to who is chatty but also whose mail is more important than others especially when it comes from outside your organization. Although this is a complex domain and it has many subjective variables, we can begin with this use case.

Let us say we iterate through our list of mails and populate the following table:

- - - - - - - - - - - -

- From | To |

- - - - - - - - - - - -

- A B

- B C

- D E

This may start to look like a graph where we can quickly form nodes with participants such as A, B, C, D and E. We also form edges between them for each entry that we make in this table. Notice that the edges have weights and they are directed.

We can write this graph as an adjacency list or matrix. We could prefer Matrix but for our purposes we simply want to illustrate a visualization of the information we have learned. Notice though that although we have decided to layout a graph, it can be drawn in several different ways. Some can be messier than others. Perhaps a better way to draw them would be to have them spaced apart and try not to have the edges cross over too much. This will make it easier to see the graph.

Now let us try the same with this graph. We have a technique by which we can optimize this graph layout.

# The following program lays out a graph with little or no crossing lines.

# This is adapted from a sample in "Programming Collective Intelligence" by OReilly Media

from PIL import Image, ImageDraw

import math

import random

vertex = ['A','B','C','D','E']

links=[('A', 'B'),

('B', 'C'),

('C', 'D'),

('D', 'E'),

('E', 'A'),

('C', 'E'),

('A', 'D'),

('E', 'B')]

domain=[(10,370)]*(len(vertex)*2)

def randomoptimize(domain,costf):

best=999999999

bestr=None

for i in range(1000):

# Create a random solution

r=[random.randint(domain[i][0],domain[i][1]) for i in range(len(domain))]

# Get the cost

cost=costf(r)

# Compare it to the best one so far

if cost<best:

best=cost

bestr=r

return r

def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1):

# Initialize the values randomly

vec=[float(random.randint(domain[i][0],domain[i][1]))

for i in range(len(domain))]

while T>0.1:

# Choose one of the indices

i=random.randint(0,len(domain)-1)

# Choose a direction to change it

dir=random.randint(-step,step)

# Create a new list with one of the values changed

vecb=vec[:]

vecb[i]+=dir

if vecb[i]<domain[i][0]: vecb[i]=domain[i][0]

elif vecb[i]>domain[i][1]: vecb[i]=domain[i][1]

# Calculate the current cost and the new cost

ea=costf(vec)

eb=costf(vecb)

p=pow(math.e,(-eb-ea)/T)

# Is it better, or does it make the probability

# cutoff?

if (eb<ea or random.random( )<p):

vec=vecb

# Decrease the temperature

T=T*cool

return vec

def crosscount(v):

# Convert the number list into a dictionary of person:(x,y)

loc=dict([(vertex[i],(v[i*2],v[i*2+1])) for i in range(0,len(vertex))])

total=0

# Loop through every pair of links

for i in range(len(links)):

for j in range(i+1,len(links)):

# Get the locations

(x1,y1),(x2,y2)=loc[links[i][0]],loc[links[i][1]]

(x3,y3),(x4,y4)=loc[links[j][0]],loc[links[j][1]]

den=(y4-y3)*(x2-x1)-(x4-x3)*(y2-y1)

# den==0 if the lines are parallel

if den==0: continue

# Otherwise ua and ub are the fraction of the

# line where they cross

ua=((x4-x3)*(y1-y3)-(y4-y3)*(x1-x3))/den

ub=((x2-x1)*(y1-y3)-(y2-y1)*(x1-x3))/den

# If the fraction is between 0 and 1 for both lines

# then they cross each other

if ua>0 and ua<1 and ub>0 and ub<1:

total+=1

for i in range(len(vertex)):

for j in range(i+1,len(vertex)):

# Get the locations of the two nodes

(x1,y1),(x2,y2)=loc[vertex[i]],loc[vertex[j]]

# Find the distance between them

dist=math.sqrt(math.pow(x1-x2,2)+math.pow(y1-y2,2))

# Penalize any nodes closer than 50 pixels

if dist<50:

total+=(1.0-(dist/50.0))

return total

def drawnetwork(loc):

#create the image

img = Image.new('RGB', (400,400),(255,255,255))

draw=ImageDraw.Draw(img)

#create the position dict

pos=dict([(vertex[i],(loc[i*2],loc[i*2+1])) for i in range(0, len(vertex))])

#Draw Links

for (a,b) in links:

draw.line((pos[a],pos[b]), fill=(255,0,0))

#Draw vertex

for (n,p) in pos.items():

draw.text(p,n,(0,0,0))

img.save('graph.jpg', 'JPEG')

img.show()

sol=randomoptimize(domain,crosscount)

crosscount(sol)

sol=annealingoptimize(domain,crosscount,step=50,cool=0.99)

crosscount(sol)

drawnetwork(sol)

Now this draws the graph as :

This is a neat little graph and shows that the participant A is more popular.

Wednesday, November 19, 2014

#codingexercise
Int GetCountDuplicatesMode (int [] A)
{
if (A == null) return 0;
return A.CountDuplicatesMode();
}

In today's post we continue to look at the PHP Implementation of a class based views which are not tied to model. A PHP controller that takes a single input for say a string to do CRUD operation may look something like this:


<?php

require  __DIR__ . '/crud.php';

#require_once 'crud.php'

class ExportsController extends AppController {

        public $helpers = array('Html', 'Form');

        public $exports = NULL;

        public function index() {

            $orm = new ORM();

            $result = json_decode($orm->list(), true); 

            $exports = $result['exports'];

            $ret = array();

            foreach($exports as $export) {

                   array_push($ret, $export['input']); 

            }

            $this->set('exports', $ret);

        }

        public function add($export = null) {        

             if(isset($this->request->data['input'])) {            

                $orm = new ORM();            

                $result = json_decode($orm->create($this->request->data['input']));            

                if (strpos($result, 'success') !== false){                

                           $this->Session->setFlash(__('Your export has been saved.'));

                           return $this->redirect(array('action' => 'index'));            

                }


               $this->Session->setFlash(__('Unable to add your export.'));

             }        





}

: similar for modify or delete

}

Notice we are not using ids of the items in the list and this is usually not recommended. The RESTAPI url comes out to be /exports/modify/9 for example. This is preferable way for REST based API. It doesn't mean that we cannot get the exports working the way we have it. It merely suggests that conforming to the ID based helps with keeping track of the collection when it changes.


Int GetCountDuplicatesMedian (int [] A)

{

if (A == null) return 0;

return A.CountDuplicatesMedian();

}

Tuesday, November 18, 2014

#codingexercise

Int GetCountDuplicatesAvg (int [] A)

{

if (A == null) return 0;

return A.CountDuplicatesAvg();

}

In this post, we continue to read about PHP MVC pattern. A controller handles the logic usually for a single model. In Cake, the controller names are always plural. Generally each controller extends the AppController.
Methods include the following:
<?php
class SharesController extends AppController
{
function view($id)
{
}
function create($path)
{
}
function search($query)
{
}
}
postConditions is used to format this data to use in the model.
ControllerVariables are a few special variables inside of the controller such as $name, $uses, $helpers, $layout, $autoRender, $beforeFilter , $components etc.
ControllerParameters are available via $this->params in the current controller.

Cake follows a naming convention for models, views and controllers.

The Models are usually named singular and scaffolding is available for database. The standard way of getting to the data using a model include
findAll
string $conditions
array $fields
string $order
int $limit
int $page
int $recursive

Monday, November 17, 2014

#codingexercise
Int GetCountDuplicatesSum (int [] A)
{
if (A == null) return 0;
return A.CountDuplicatesSum();
}

Today I want to discuss PHP programming language as quickly as possible. Its useful in the Model View Controller or MVC which is a widely used pattern for software applications.
The basic syntax involves
<?php
echo 'Hello World'
...
The types involved are boolean, integer, float, string, array and object. The last two are referred to as compound types.
eg,
$a_bool = TRUE;
$a_str = "foo";
$an_int = 12;
$array = [
"foo" => "bar",
"bar" => "foo"
];
$b = array_values($array)
?>
There are two special types resource and NULL.
The resource type is a special variable, holding a reference to an external resource and its type is got by get_resource_type(). A resource variable holds special handlers to opened files, database connections and images. Resources are automatically collected.

mixed, number and callback are also used. These are called pseudo -types
mixed indicates that a parameter may accept multiple types,
number indicates that a parameter can be either integer or float.
callback and callable are used to denote functions as parameters and usually passed by its name as string.

classes are defined simply as :
<? php
class MyClass{
public $property = "Hello World";
public function Call()
{
call_user_func(array($this, 'Callback'));
}

public function Callback()
{
echo $this->property;
}
}
$m = new MyClass;
$m->call();
>
keywords such as interface, implements and extends are available. Scope resolution is with :: and objects can be cloned with the _clone method.

PHP does not require explicit type definition in variable declaration. The addition operator is a good example.
$ foo = "8"
$foo += 2
$foo += 1.3
$foo = 5 + "10 some_string_ignored"
Typecasting and strict === are checked.
Variables can have local or static scope
Data from get or post requests are found as follows:
echo $_POST['username'];
echo $_REQUEST['username'];
we also get $_COOKIE['count']

To make an MVC application we use say CakePHP framework.

We clone it from git at git://github.com/cakephp/cakephp.git

and the MVC application has a folder structure like this:
/path_to_document_root
/app
/lib
/plugins
/vendors
.htaccess
index.php
README

we may have to give folder permissions with
<?php echo exec('whoami'); ?>
chown -R www-data app/tmp
A database connection is simply a configuration
public $default = array(
'datasource' => 'Database/Mysql',
'persistent' => false,
'host' => 'localhost',
'port' => '',
'login' => 'mylogin',
'password' => 'mypassword',
'database' => 'location-volume-share',
'schema' => '',
'prefix' => '',
'encoding' => 'utf8'
);
Controllers can be defined by
// File: /app/Controller/ShareController.php
class shareController extends AppController {
    public $helpers = array('Html', 'Form');
    public function index() {
         $this->set('posts', $this->Post->find('all'));
    }
    public function edit($id = null) {
        if (!$id) {
            throw new NotFoundException(__('Invalid post'));
        }
        $params = explode(':', $this->request->data, 3)
        $location = $params[0]
        $volume = $params[1]
        $share = $params[2]
        $this->set($id, $location, $volume, $share);
    }
}

Now we will continue our discussion on preconditioning in conjugate gradients. We saw that we could use transformation to make the iterations quicker. The effectiveness of a preconditioner is determined by the condition number of M inverse applied to A as discussed earlier and occassionally by clustering of its eigenvalues. The simplest preconditioner is a diagonal matrix whose diagonal entries are identical to those of A.

Sunday, November 16, 2014

#codingexercise

Int GetCountDuplicatesCount (int [] A)

{

if (A == null) return 0;

return A.CountDuplicatesCount();

}

#codingexercise

Int GetCountDuplicatesAvg (int [] A)

{

if (A == null) return 0;

return A.CountDuplicatesAvg();

}

We continue to discuss CG method. We were talking about starting and stopping the iterations. We next discuss preconditioning. Preconditioning is a technique for improving the condition number of a matrix. Let us say M is a symmetric, positive definite matrix that approximates A, but is easier to invert. Then applying M inverse to both sides of the linear equation we solve with Steepest Descent or CG method, we can solve it more quickly because the eigenvalues of Minverse applied to matrix A on the left hand side of the linear equation are better clustered than those of A. However, the problem is M-inverse applied to A doesn't always come out to be positive symmetric matrix. We could avoid this difficulty by taking a pair of matrices E and E-Transpose whose product is M because for every positive symmetric M there exists such. Now the M-inverse applied to A and Einverse applied to A applied to E-Transponse-inverse have the same eigenvalues. By transforming, the latter is now symmetric and positive definite This process of using CG to solve this system is called Transformed Preconditioned Conjugate Gradient Method.

Today we look at the complexity in the convergence of the CG method. The dominating operations during an iteration are the matrix-vector multiplication which requires O(m) operations, where m is the number of non-zero entries in the matrix. In many cases however, A is sparse and m belongs to O(n). If we want to do just enough iterations to reduce the norm of the error in each iteration by a factor of epsilon applied to the initial error term, then we find this maximum number of iterations required to achieve it. In the case of Steepest Descent method, this maximum number of iterations is bounded by a constant k times the natural logarithm of the inverse of epsilon. And in the case of Conjugate Gradient method, it is bounded by the square root of the constant times a similar natural logarithm of the inverse of epsilon. The steepest descent has a time complexity of O(mk) where as the Conjugate Gradient has a time complexity of O(m square-root(k)). Both algorithms have a space complexity of O(m).
By going to polynomials of higher degree, we can generalize this to second order elliptic boundary value problems. There are finite difference and finite element approximation methods used for solving these problems on d - dimensional domains. Recall that the finite difference method applies a local Taylor expansion to approximate the differential equations. The method is used to solve Partial differential equations in a numerical way where a topological square network of lines is used to construct the discretization of PDE. Partial Differential Equations have the property that they are of the form : elliptic, parabolic and hyperbolic. This gives us the convenience of applying Taylor expansions. When there are multiple dimensions however, this topology does not approximate well. Instead, finite element technique is used. The Finite element method discretizes the region of interest into N-1 subdomains and assumes that the approximation is represented by an expansion basis which are continuous polynomials which we piecewise break and describe with one of a few possible shape functions. As an aside, a Finite volume method is another such improvement where we take the region of integration to be a control volume associated with a point of co-ordinate xi and taken around its immediate vicinity. Here the integral is taken over this small volume stretched on the unit length around xi which can then be translated in a semi discrete form. Peiro and Sherwin mention that this approach produces a conservative scheme if the flux on the boundary of one cell equals flux on the boundary of the adjacent cell. In our case, we look at elliptic boundary value problems and both finite difference and finite element methods have a time complexity where the constant k is of the order of O(n^2 raised to the inverse of d). Thus for two dimensional problems, Steepest Descent has a time complexity of O(n ^2 ) and Conjugate gradient method has a time complexity of O(n ^ ( 3/2)) for Conjugate Gradient method. For three dimensional problems, Steepest Descent has a time complexity of O(n ^ 5/3) while Conjugate Gradient method has a time complexity of O(n ^ 4/3) .
When discussing iterations using these methods, it might be helpful to discuss starting and stopping of such iterations. Starting is usually a good guess. If we know the approximate value of our variable, then we can use it as our starting value. Otherwise, we begin at zero and both Steepest Descent method and Conjugate Gradient method can find the solution to the linear system. The same cannot be said for non-linear systems because they may or may not converge and they may have several local minima. For stopping, both Steepest Descent method and Conjugate gradient method stops when it reaches the minimum point. We took this as when the residual becomes zero. However, there's a possibility that accumulated round off errors in the recursive formulation of the residual (using step-lengths and matrix-vector product) may lead to a false zero. This may be corrected by restarting with the residual.

Friday, November 14, 2014

#codingexercise

Int GetCountDuplicatesMin (int [] A)

{

if (A == null) return 0;

return A.CountDuplicatesMin();

}

We continue to look at picking perfect polynomials for the convergence of conjugate gradients. We have seen that at each step of CG, we can express the error term as a polynomial in a matrix or scalar applied to the initial error term and with Steepest Descent method, the initial error term is a linear combination of orthogonal unit eigenvectors. We also noticed that with each iteration, the convergence is only as good as the worst eigenvector. For the first iteration, we have a horizontal line for the the polynomial to evaluate to a constant 1. At this time we have two eigenvalues. In the second iteration, we have a slope for the line equal to the ratio of the two eigenvalues. Let us say that in the third iteration, the eigenvalue is 1 and the iterations terminate. We now have a parabola for the polynomial of degree two to fit these three points. The computed error terms may be plotted somewhere close to these eigenvalues on these polynomial representations.
We see that the polynomial converges after n iterations and quicker if there are duplicated eigenvalues. If we had a floating point precision that was infinite, then it would take at most n iterations to converge to an exact solution. In fact, it might be quicker if the initial position vector is already A-orthogonal to some of the eigenvectors of A. But floating point round off errors may re-introduce these eigenvectors. We also find that when the eigenvalues are clustered together, it is quicker to converge. If we know something about these eigenvalues, then we could construct the polynomial faster. The more general case is when the eigenvalues are evenly distributed between Lambda min and Lambda max, the number of distinct eigenvalues is large, and the floating point roundoff occurs.
This range leads to another interesting idea called the Chebyshev polynomial. This is a polynomial that minimizes the sum of squares of error terms over a range Lambda min and Lambda max rather than at a finite number of points. The polynomial is expressed in radial directions of at most unit length and have the property that they result in something smaller than a unit. It is also the maximum on all such polynomials in the domain of the chosen variable. When we use a Chebyshev polynomial to minimize the sum of square of error terms from the CG, we get an equation in terms of a condition variable that can be plotted as an increasing parabolic curve. This further shows that the first step of CG is identical to a step of Steepest Descent. After that the convergence is usually faster than that of Steepest Descent because of a good eigenvalue distribution or good starting points. This does not imply that every step of the iteration enjoys faster convergence. That said, the rate of convergence is much higher initially and gradual much later