Cluster computing

Wednesday, May 7, 2014

We look at a few more coding questions today. First we look at breadth first traversal and depth First traversal. Depth first traversal we can write recursively as follows with no additional data structure.
DFS-Visit(Root)
{
mark root as partially visited
for each vertex v adjacent to Root
if color of v = unvisited
set parent of v
DFS-visit(v)
mark root as visited
}
Next we look at breadth first search
In BFS, we maintain an additional data structure to store the siblings at any level.
BFS(G, s)
{
// initialize
ENQUEUE(Q,s)
While Q is not null
u = DEQUEUE(Q)
for each v belonging to Adj[u]
do if color of v = unvisited
then color it partially visited and
ENQUEUE(Q)
mark u as visited
}

I'm going to cover a few more questions but first I wanted to take a minute. In particular I want to cover graph based coding questions and revisit to distributed computing.

I will take a small example for edit distance first :
Find the minimum edit distance between two input strings. Edit distance is the minimal number of edit operations to modify a string from one state to the other.
For eg : Saturday can be changed to Sunday using the following transformations:
Saturday -> Sturday
Sturday -> Surday
Surday -> Sunday
if we denote the edit distance by a function f(i,j) between the substring of the first string ending with the j'th character and the substring of the second string ending with the i'th character. Then we know that f(i,0) = i since the because we need to delete that many characters to get an empty string. Similarly f(0,j) = j because the empty string is at the beginning of that same substring ending with the 0th character.
If the jth character of the first string is different from the ith character of the second string, then
there are three different techniques involved in the transformations and finding the edit distances. These are insertion, deletion and substitution.
We note that we can make the lengths equal by deleting the characters from the longer string or inserting the characters into the shorter string. We will shortly see how.
If the jth character of the first string is the same as the ith character of the second string, then the edit distance f(i,j) is the same as f(i-1,j-1)
If there is substitution involved then the edit distance f(i,j) = f(i-1,j-1) + 1
if we insert the ith character of the second string into the first, then edit distance f(i,j) = f(i - 1, j) + 1
If we delete the jth character of the first string, then the edit distance f(i,j) = f(i, j-1) + 1
Thus if the i != 0 and j != 0 and string [i] != string [j], then the edit distance is the minimum of
f (i-1,j) + 1 , f(i, j - 1) + 1 and f(i - 1, j - 1) + 1
With the above formula comprising of case i = 0 case j = 0 case string[i] == string[j] and string[i] != string[i,j] we can create a table of edit distances by enumerating the characters of the first string as columns and those of the second string as rows and using the substrings to compute edit distances for any pair of substrings. Note that we introduce a column at the beginning of all rows and we introduce a row at the beginning of all columns to hold the position for the empty string and we initialize the value of this first cell as zero. we can computer the edit distance for other pairs of substrings starting from here. At each additional row or column, we use the edit distance from the previous row or column and we find the minumum value from the three operations listed above .At the diagonal end of the first cell which is the last cell we will compute the edit distance of the entire first string and the entire second string.

Algorithm: The function to implement the above has the following loops
first loop to set the value of the first row ( corresponding to empty string )
second loop to set the value of the first column ( corresponding to the empty string)
then for each pair of characters from the first and the second string,
if they are equal we set it to the previous diagonal value
else (they are different and) we find the costs for insertion, deletion and substitution and choose the minimum with which we update the corresponding position in the table

We quickly review some graph algorithms next:

Minimum spanning trees:
Kruskal's algorithm:

// Make-set of all vertices
sort the edges of E into non-decreasing order by weight w
for each edge (u,v) belonging to E taken in that order
connect the edge if they belong to different sets

Prim's algorithm:
Put a vertex in the min-priority Q sorted by the weights
While Q is not empty,
extract the minimum
for each vertex v belonging to adjacents of u
if v belongs to Q and w(u,v) < key(v)
connect v to u and set its key

Shortest paths are found by relaxing the edges

Djikstra's shortest path:
Initialize-single-source(G,s)
Initialize Q with V[G]
While Q is not empty
u = extract-min(Q)
add u to shortest path vertices
for each vertex adjacent to u
relax (u,v,w)

In this post we look at a few more coding questions,. In particular we look at a technique called backtracking.
Backtracking helps over brute force method.
Take the alphabet matrix as follows:
a b c e
s f c s
a d e e
There is a path for the string 'bcced' but not for path abcb. Re-entering is not possible
This technique is called backtracking because it returns back to the previous state and uses pop operations from a stack. That is when it finds a character it pushes it on to the stack and for all the rejection where the character is not the same as the corresponding character, it backtracks and pops from the stack.
Since all entries can be entered only once, we keep a boolean flag pertaining to each entry.
bool hasPath(char* matrix, int rows, int cols, char* str)
{
// parameter validation
// bool* visited for the matrix of used flags
int pathLength = 0;
for (int row = 0; row < rows; row++){
for (int col = 0; col < cols; col++){
if (hasPathStack(matrix, rows, cols, row, col, str, pathLength, visited))
return true;
}
}
//cleanup
return false;
}

bool hasPathStack(char*matrix, int rows, int cols, char* str, int& pathLength, bool* visited)
{
// if the boundary conditions are satisfied &&
// if the path has not been found yet &&
// if the element has not been visited already
++ pathLength;
// visited set to true
// visit one of the four neighboring cells
bool hasPath = hasPathStack(matrix, rows, cols, row, col - 1, str, pathLength, visited) ||
hasPathStack(matrix, rows, cols, row - 1, col, str, pathLength, visited) ||
hasPathStack(matrix, rows, cols row, col + 1, str, pathLength, visited) ||
hasPathStack(matrix, rows, cols row + 1, col, str, pathLength, visited) ;
if (!hasPathStack) {
--pathLength; // backtracking
// visited set to false
}
}
[from the coding interviews book]

Tuesday, May 6, 2014

In this post we cover some coding questions based on string operations, bitwise manipulation, recursion, permutations and combinations.
Suppose we were to write down all the combinations of the indices of numbers in a given list of numbers that add up to a particular sum:
The algorithm would look something like this assuming we have a sorted list of numbers and we declare a type IndexedNumber that stores the value and the index corresponding to each entry in the input list
public static void combine(ref List<IndexedNumber> numbers, ref List<IndexedNumber> candidate>, ref List<List<IndexedNumber>> sequences, int level, int start, int n)
{
for (int i = start; i < numbers.Count; i++)
{
if(candidate.Contains(numbers[i]) == false)
{
candidate[level] = numbers[i];
if (candidate.Sum() == n)
sequences.Add(new List<IndexedNumber>(candidate));
if (i < numbers.Count - 1)
combine (ref numbers, ref candidate, ref sequences, level + 1, start + 1, n);
candidate[level] = new IndexedNumber() { Number = 0, Index = -1};
}
}
}
Let's now look at a bit manipulation problem.
We define the problem this way: Write a function to convert an integer x into an integer y and find the number of bits that will need to be changed. For example, given 12 and 16, it should return 3
because 12 is 0xC or 1100 and 16 is 0x10 or 10000 and we have to change 3 bits
int ConvertXtoY(int x, int y)
{
int a = x;
int b = y;
int count = 0;
while ( a || b)
{
if ( (a & 0x1) ^ (b& 0x1)) count ++;
a >>= 1;
b >>= 1;
}
return count;
}
String operation question:
find the longest repeating substring:
char* getMatch(const char* str)
{
char* match = NULL;
int len = 0;
char* first = str;
while (first)
{
char* second = str++;

// case 2
while (second)
{
    char* lMatch = NULL; //local match
    int llen = 0; // local match len
    char* orig = first;
    char* local = second;

    //case 0
    while ( local && *local == *orig)
    {
      if (llen == 0) lMatch = orig;
      llen++;
      local++;
      orig++;
    }

    if (llen > len ) { match = lMatch; len = llen;}

    //case 1
    second++;
}
first++;
}

return match;
}

Monday, May 5, 2014

Today we discuss some more of the .Net APIs again from the Framework class library. Here the we first take a brief look at System.Timers The System.Timer namespace provides a component that allows us to raise an event on a specified interval. This is a server based timer that allows us to specify a recurring interval at which the Elapsed event is raised. Then this event can be used to provide regular processing. Where these server timers differ from windows timers is that these timers can move among threads to handle the Elapsed event. The arguments to the Elapsed event provide data for handling it.
When we look at the implementation of this, it uses the underlying System.Threading.Timer class and GetSystemTimeAsFileTime method for the time arithmetic. On each update of the enabled, it updates the existing underlying timer. If a zero value was provided, it disposes the existing timer and sets the enabled to value. By updating the timer, it changes the value of the interval.
Let us now look at the System.Threading.Timer.
The System.Threading.Timer takes as initialization parameters : callback, state, dueTime and period. For initialization it makes use of an object called the TimerHolder. On each update of the timer it changes the dueTime and period. The TimerHolder in turn takes a TimerQueueTimer with the same initialization parameters for its initialization. The TimerQueue maintains a list of active timers in the AppDomain. It uses a single native timer supplied by the VM to actively manage all the timers.
The data structure used to maintain all the active timers is a doubly linked list because it requires fast insert and delete while traversal can be linear. The FireNextTimers method on this queue traverses each timer to see whether the elapsed is greater than the dueTime and deletes it if it's not periodic. If its the first time, it will be fired on this thread otherwise it will be fired on the thread pool. If the timer hasn't fired yet, it just updates the next time the native timer fires. By updating a timer, not only is its dueTime, period and startTicks updated but also its prev and next in the list.
Given the ease of programmability in C# with the Framework APIs, writing or even rewriting an entire application in C# would be interesting.
We will continue to review the rest of the Framework APIs but I want to take a brief break to solve a puzzle.
How do we decode Kanjii characters ? Let's treat Kanji as Unicode. We know that Unicode gives a codepoint to each of the characters represented in the standard. UTF-16 encoding uses the two 8bit bytes for the regular ISO characters and upto two 16 bit units to handle each additional characters. Essentially a text represented in UTF-16 encoding could have variable number of bytes for each character. We therefore walk the text a character at a time by identifying the byte boundaries much the same way as we carefully walk t the headers of a packet. In code this could translate as an iterator over the bytes that appends to a text a character deciphered by reading one or more bytes at a time. Each character could be represented as a codepoint in the usual notation of U+hex or U+four_digit_hex or U+five_digit_hex or more depending on the blocks and planes used as defined in the Unicode standard.
We will also briefly review another problem in which given an inorder and a preorder traversal of a tree, we will construct the tree and return the root.
So given an inorder traversal of 1,2,3,4,5,6,7
and a preorder traversal of 4,2,1,3,6,5,7.
This has already been treated earlier but the idea is reinforced with the following principle.
Given a node in the preorder list we find the root and the size of the left subtree by simply counting the number of nodes that appear before it in the inorder. So take 4 and the number of nodes before it is three in inorder. Therefore the fourth node from 4 in the pre-order is the right child. To determine the left child, take the immediately adjacent node in the pre-order.
In the above example, if we take a tree with the nodes 3 and 5 missing,
we have the inorder as 1,2,4,6,7
and preorder as 4,2,1,6,7
Here the left child of 6 is null and the right child of 2 is null.
We determine the missing left child by noticing that the item before in the inroder is the parent or null
We determine the missing right child by noticing that the item after in the inorder is the parent or null.
Always review and test all code.
We now look at some more .NET APIs
System.Threading also has other primitives typically available from the operating system.
These include locking and thread c intro primitive
While we are on the topic of semaphores and threads, let us look at a producer consumer problem. A producer consumer can use a semaphore for buffering a count of jobs whereas they can use a mutex if there is only one job to mutually exclude.

Today we discuss some more of the .Net APIs from the framework class library. Specifically we discuss the task parallel library. We first look at data and task parallelism and then we discuss some of the potential pitfalls we need to watch out for.
Data parallelism refers to scenarios in which the same operation is performed concurrently on say elements in a source collection. The collection is partitioned so that multiple threads can work on it. The for loop can be partitioned with a lambda expression.
Parallel.ForEach( sourceCollection, item => Process(item));
There are options available to stop or break loop execution, monitor the state of the loop, maintain thread local state, finalize thread-local objects, control the degree of concurrency and so on.
Task Parallelism is based on the concept of a task which represents an asynchronous operation and resembles a thread or thread pool without the corresponding OS primitives but as an abstraction
This enables more efficient and more scalable use of system resources.
It also enables more programmatic control than is possible in a thread or work item.
Parallel.Invoke() method enables creating and running tasks implicitly. As an example:
Parallel.Invoke(() => DoSomeWork(), () => DoSomeOtherWork());
The number of task instances for the above statement does not have to match the number of parameters to Invoke method.
For greater control over the tasks, we can use Task.Wait and Task.Run method.
The potential pitfalls in Data and Task Parallelism are:
we cannot assume that parallel is always faster. If the parallel loops have few iterations and fast user delegates, then it is unlikely to speed up much.
We should not write to shared memory locations because they increase the race conditions and contention.
We should avoid doing over-parallelization. Each partitioning incurs a cost and there is cost to synchronize the worker threads.
We should avoid calls to non-thread safe methods because it can lead to data corruption which may go undetected.
We should also limit the calls to thread safe methods because there can be significant synchronization involved.
Delegates used with Parallel.Invoke should not have waiting because it affects the execution of others and can result in deadlocks when the tasks are inlined to the currently executing thread.
We should also not assume that iterations in the loop specified always execute in parallel.
We should avoid executing parallel tasks on UI threads.

Sunday, May 4, 2014

Revisit of some TSQL
Create Table example:
CREATE TABLE Demographics(
id INT IDENTITY (1,1),
name VARCHAR(15) NOT NULL,
country VARCHAR(2) DEFAULT 'US'
CONSTRAINT country_not_null NOT NULL
CONSTRAINT country_check
CHECK (country IN ('CA', 'US')),
indexed_name VARCHAR(15),
CONSTRAINT demographics_pk
PRIMARY KEY (id)
CONSTRAINT demographics_fk01
FOREIGN KEY (name, country)
REFERENCES parent_example (name, country) );

ALTER TABLE Demographics ADD
age INT,
CONSTRAINT age CHECK (age > 0 AND age < 150);

Regular expressions can be specified with one or more of
%, _, [...] and [^...]
Common formatting syntax use CASE, COALESCE and ISNULL()

Hierarchical queries can be written with common table expressions:
To select direct reports for a manager, we write
WITH directreports
( manager_id, employee_id, employee_name, level)
AS
(SELECT e.manager_id, e.employee_id, e.employee_name, 0 as level FROM dbo.Employee e
WHERE e.manager_id is NULL
UNION ALL
SELECT d.manager_id, d.employee_id, d.employee_name, level + 1 FROM directreports d
ON e.manager_id = d.employee_id)
SELECT manager_id, employee_id, name from directreports

SELECT d.name as department_name
COUNT(*) as employee_count
From employee e inner join department d
ON e.department_id = d.id
GROUP by d.name WITH ROLLUP
-- generates summary row

WITH CUBE to generate summaries for all possible combinations and a grand total
GROUPING(group_by_col_name) returns 1 or 0 when the associated columns in the result set contain a null.

For string functions use CHARINDEX or SUBSTRING or LENGTH or +
DATETIME functions involve GETUTCDATE(), DATEADD(datepart, interval, date) and DATEDIFF(datepart, startdate, enddate) eg. DATEADD(day, 1, GETDATE());

use CONVERT for datetime

Next BCL coverage of .Net

FileSystem Interface:
Classes:
File : methods, Create, Delete, Open, Read, Write, GetTime
Directory: methods: Create, Delete, Exists, EnumerateFiles, GetDirectories, GetFiles, GetTime
FileInfo
DirectoryInfo
FileStream
Path - static methods.
FileWatcher

System.Messaging has the following classes and methods:
MessageQueue class has different methods as follows: 1) Send 2) Receive 3) Peek 4) , GetPublicQueuesByMachine, GetPublicQueuesByLabel and GetPrivateQueuesByMachine.

MessageQueueCriteria, MessageQueueEnumerator, MessageQueuePermission and MessageQueueTransaction are other classes

We write a few more methods.
Node * Tree_Successor(Node* root)
{
Assert (root != null);
if (root->right)
{return Tree_Minimum(root->right);}
Node* x = root;
// find the lowest ancestor of x whose left child is also an ancestor of x
Node* y = root->parent;
while ( y && x == y->right)
{
x = y;
    y = y->parent;
   }
return y;
}
Node* Tree_Minimum(Node* root)
{
Node *x = root;
while (x && x->left)
   x = x->left;
return x;
}

Node* Tree_Predecessor(Node* root)
{
Assert (root != null);
if (root->left)
{return Tree_Maximum(root->left);}
Node* x= root;
Node* y =   root->parent;
while (y && x == y->left)
{
x = y;
y = y -> parent;
}
return y;
}

File - System organization
There are primarily two design considerations:
1) exposure of file system to user via definitions of file, attributes operations
2) overlay of logical filesystem on top of physical file system via data structures and algorithms
The file system stack consists of
application programs
logical file system
file-organization module
basic file system
I/O control
devices
a typical open file table will have
index, file_name, permissions, access dates, and pointer to disk block.
A file is represented by an inode (short for index node) in Unix. The in ode contains the user and group identifiers of the file, last file modification and access, count of number of hard links, and the type of the file.In addition, the inode contains 15 pointers to the disk blocks containing the data contents of the file. Inodes also contain pointers that point to direct blocks. There can be pointers to single indirect and double indirect as well as triple indirect blocks. File structures are inherited by the child process after a fork so several processes may share the same offset location for a file.