Cluster computing

Tuesday, May 10, 2016

Today we continue to read up on security zones. A security zone is nothing more than a network segment with protected ingress. We said that the external facing security zones are typically prevented from talking to one another. Each is a separate VLAN and no routing is allowed between them. These edge switches typically trunk to an L2 aggregation switch. Aggregating external traffic allows implementation of single point packet, session and network behaviour monitoring. The L2 switch trunks to a L3 Q Switch. An Intrusion Prevention system may be placed on this trunk. Q Switches provide dynamic port configuration. This allows a switch to configure port as an access port or a trunk port.They dynamic trunking protocol automatically creates trunk port between Q switches. A DTP can be compromised if the there are two VLAN tags in the packet. This concept is similar to VPN. When an attacker sends a packet with two VLAN tags, the first Q-switch strips the packet one tag and send it out to the other switch which treats it as a regular packet and forwards to all appropriate ports. The double tagging must be disabled to mitigate this. This is also another reason why don't have the edge switches talk to one another. A private VLAN extends this restriction further by configuring the ports in a way where none of the ports in the VLAN set can communicate with each other. For example, database servers are put in a private VLAN because they don't need to communicate with each other. Finally access is only one of the instruments for control. ACLs, separation of duties, least privilege and need to know are all applied to improve security.
Some examples of securing packet traffic with advanced techniques can be seen as in the case of IP Packet security with IPSEC protocols that authenticate and encrypt IP Packets. The IPSEC protocol uses the concept of security association in order to secure the IP. A security association is a bundle of algorithms and parameters (such as keys) that is used to authenticate and encrypt a particular flow in one direction. A flow is usually denoted as a five tuple <src address, src port, destination address, destination port, protocol> While TLS and SSH operate in the upper layers of the Application which involve data and transport, only IPSec works on Internet layer. The link layer is secured by 802.1x, Point to Point protocols PPP and Point to Point Tunneling protocols.

Is Partitioning a good idea ?
In computing industry, there is a widespread understanding that partitioning is best done only when it is absolutely necessary. In other words we don't involve any partitioning of resources because such planning doesn't always scale to future needs and often is more binding with lot more work to undo if business needs change.
take ulimits for example and almost everyone will understand it is more onerous than it does good. Some operating systems do not provide this option.
Take disk partitioning and we only find ourselves shooting in the foot when we partition up front and discover that we have improperly managed resources. A large unfragmented disk does not cause such symptoms.
Even Solaris introduced partitions for different flavors and soon applications would start failing in one partition because they didn't have enough resources on this as they earlier had on the undivided host.
Quotas often cause resource starvation and non-deterministic behavior of otherwise sound applications. Soft limits complicate the matter even further.
Containers are useful when they are elastic.

#coding exercise
Write a function that takes two parameters n and k and returns the binomial coefficient C(n, k)
int C(int n, int k)
{
if ( k == 0 || k == n) return 1;
return C(n-1,k-1) + C(n-1,k);
}

Given a boolean expression with the following symbols
'T' - true
'F' - false and logical operators
& - AND
| - OR
^ - XOR
Count the number of ways we can paranthesize the expression so that the value remains true.
We have to do this operator by operator basis:
Let T(i,j) represent the number of ways to paranthesize the symbols between i and j ( such that the subexpression between i and j evaluates to true
Let F(i,j) represent the number of ways to paranthesize the symbols between i and j ( both inclusive) such that the subexpression between i and j evaluates to false.

Total(i,j) = T(i,j) + F(i,j)
T(i,j) = sum k from I to j-1 [ T (i,k) × T (k+1,j) for the AND operation
Total(i,k) x Total (k+1,j) - F (i,k) × F (k+1, j) for the OR operation
T (i,k) × F (k+1, j) + F (i,k) x T (k+1, j) for the XOR operation

The reverse holds for F(i,j)

Monday, May 9, 2016

Today we discuss network segmentation from Tom Olzak's book : Enterprise security : a practitioner's guide.
The book argues that segmentation is important.Segmentation breaks a network into multi layer attack surface that hinders threat agents / actions that are both internal as well as external. Segmentation is not just about basic switch security but also how to control packets at the network and data link layer.
External threats can be mitigated by perimeter defenses which establish a trust boundary either on or outside the data center perimeter such as with a DMZ or SSL VPN appliance. Perimeter controls prevent unauthorized access to system attack surfaces. An authenticated user has unlimited access internally. But the perimeter defence does not always prevent unauthorized access. The flat data center network is one large broadcast domain. Any device sending an ARP broadcast looking for an IP address in the data center will receive a reply if the address is assigned to an active server. This lower layer penetration also poses a risk.
Another advantage of segmentation is protocol separation. IPX or AppleTalk can get their own VLANs. This limits traffic in each VLAN. Finally the use of VLANs enable secure, flexible user mobility. A user may be assigned to a specific VLAN and will connect to it regardless of whether he is coming in via the wireless network so long as the intermediary 802.1x is used for port authentication. The authentication server has the advantage that it can let the user groups be in active directory and assign the VLAN dynamically. If it were static, an attacker could simply plugin to the network.
VLANs are configured using Layer 2 switches.The medium access control (MAC) address has two parts - the first part being organizationally unique identifier (OUI) and the second being Network Interface Controller identifier (NIC). ARP broadcasts help resolve connectivity between devices but these increases traffic on the same network. Switches use content addressable memory (CAM) table to track MAC address / port pairs. From the update at the time of the first packet forwarding through the entry's aging period, the switch forwards all packets the same way. A VLAN is a set of switch ports. Devices connected to the same VLAN can talk to one another but are logically isolated from devices that are not connected to the same VLAN. When there are more than one switch in an organization, a trunk is configured between them using a port on each switch.
A datacenter may have one or more internal security zones. A security zone is merely a network segment with protected ingress.For example, one zone can provide a secure bridge between the internet and the data center. Usually external facing data zones do not communicate with one another. Essentially this aggregates all traffic to come through the same trunk. An intrusion prevention system (IPS) may be setup on this trunk.
#codingexercise
Convert a binary tree into doubly linked list in spiral order.
// same as level order but with odd levels having reversed listing
void spiralLevelOrder(Node* root)
{
if (root == null) return;
deque<Node*> q;
q.push_front(root);
stack<Node*> stk;
int level = 0;
while (!q.empty())
{
int nodecount = q.size();
if (level %2 != 0)
{
while (nodecount > 0)
{
Node* node = q.front();
q.pop_front();
stk.push(node);
if (node->left != null)
q.push_back(node->left);
if (node->right!= null)
q.push_back(node->right);
nodecount --;
}
}
else{
while (nodecount > 0)
{
Node* node = q.back();
q.pop_back();
stk.push(node);
if (node->left != null)
q.push_front(node_left);
if (node->right != null)
q.push_front(node_right);
nodecount--;
}
}
level++;
}
Node* head = null;
while (!stk.empty())
{
push (&head, stk.top());
stk.pop();
}
for(Node* cur = head; cur; cur = cur->next)
printf("%s", cur->data);
}

# print root to leaf paths in a binary tree
void PrintRootToLeaves(Node root)
{
if (root == null) return;
var stk = new Stack<Node>();
var parent = new Dictionary<Node, Node> ();
parent[root] = null;
stk.push(root);
while (stk.empty() == false)
{
var cur = stk.top()
stk.pop();
if (cur.left == null && cur.right == null) printStack( current, parent);
if (cur.right) {parent[current.right] = current; stk.push(cur.right)}
if (cur.right) {parent[current.left] = current; stk.push(cur.left)}
}
}

Sunday, May 8, 2016

#coding exercises
1) Find minimum number of merge operations to make a palindrome.
Ex: {15, 4, 15} => 0
{1, 4, 5, 1} => 1 because 4 and 5 are merged.
{11, 14, 15, 19} => 2 because all elements need to be merged.

int GetMergeCount(List<int> numbers)
{
int start = 0;
int end = numbers.Count;
while (start < end)
{
if (numbers[start] != numbers[end])
{
count += 1;
}
start++;
end--;
}
return count;
}

2) Print diagonal traversal of binary tree

void PrintDiagonal(Node root, int level. ref Dict<int,List<int>> levelValues)
{
if (root == null) return;
levelValues[level].Add(root.data);
PrintDiagonal(root.left, level+1, ref levelValues);
PrintDiagonal(root.right, level, ref levelValues);
}

void PrintDiagonal(ref Dict<int, List<int>> levelValues)
{
foreach (kvp in levelValues)
{
var level = kvp.Key;
console.WriteLine('Level='+level);
var items = kvp.Value;
for (int i = 0; i < items.Count; i++)
{
Console.Write(items[i].toString());
}
console.writeLine();
}
}

3) bool isPower(intN, int X)
{
return IsInteger(Math.Log(N)/Math.Log(x));
}

4) void Boggle(List<char> input, List<string> knownwords, ref List<char>candidate, int start, int level)
{
for (int I =start; I< input.Count; I++)
{
candidate[level] = input[I];
if (knownwords.Contains(candidate.toString()) console.writeline(candidate.toString());
if (i < input.Count)
{
Boggle(input, knownwords, ref candidate, start+1, level+1);
}
candidate[level] = '/0';
}
}

5) To determine if there is a backedge from any of the vertex to its ancestors in a graph other than the tree it is part of, we can maintain two arrays : a discovery time array which measures the time it takes to discover this vertex from root aka levels and a low value array that for a vertex u is the minimum of the discovery time for the vertex u and that of its ancestor w
low[u] = min(disc[u], disc[w])
if ( low[v] >= disc[u]) then there is a backedge.
Also note that by definition of a tree, every node has a parent p

6) void ConnectNodesAtSameLevel(Node root)
{
List<Node> levels = LevelWiseTraversal(root); // using a queue described earlier, append NULL for delimiters
for (int i = 0;i < levels.Count -1; i++)
if (levels[i])
levels[i].side = levels[i+1];
}

7) Print top view of a binary tree ( nodes that are on the periphery)

void PrintTopView(Node root)
{
// similar to level wise traversal
var ret = List<Node> ();
int left = 0;
int right = 0;
var q = new Queue();
q.enqueue(root);
root.dist = 0; // or use a hashtable
q.enqueue(null);
while (q.empty() == false)
{
var item = q.dequeue();
if (item == null){
q.enqueue(null);
}else
{
if (root.dist == 0 { ret.Add(root);}
if (root.dist < left ) { left = root.dist; ret.Add(root);}
if (root.dist > right) { right = root.dist;. ret.Add(root);}
root.left.dist = root.dist-1;
q.enqueue(root.left);
root.right.dist = root.dist+1;
q.enqueue(root.right);
}
}
ret.ForEach( x => console.write(x.toString()));
}

Friday, May 6, 2016

#coding questions
1) Given a cost matrix cost[,] with positive integers find the minimum cost between a target cell and origin. The permitted moves are right, down and down right.
void GetMinCost(int[, ] cost, int rows, int cols, int x, int y, ref sum )
{
if ( x < 0 || y < 0 || x >= rows || y>= cols) {sum += INT_MAX; return;}
sum += cost[x,y];
if (x == 0 && y == 0) return;
sum += min( GetMinCost(cost, rows, cols, x-1,y-1, ref sum),
GetMinCost(cost, rows, cols, x,y-1, ref sum),
GetMinCost(cost, rows, cols, x-1,y, ref sum);
}

2) Buy or sell 1 stock many times on different days to maximize profit from a daily price list

int GetMaxProfit(List<int> prices)
{
int start = prices[0];
profit = 0;
for (int i = 1; i < prices.Count; i++)
{
if ( prices[i] < prices[i-1])
{
profit += prices[i-1]-start;
start = prices[i];
}
}
if (prices.Count > 0)
{int last = prices[price.Count-1] ;
if (last > start)
profit += last -start;
}
return profit;
}

Reverse a linked list
Node* reverse(node* root)
{
if (root == null) return null;
Node* prev = null;
Node* cur = root;
while(cur)
{
Node* next = cur->next;
cur->next = prev;
prev = cur
cur = next;
}
return prev;
}

1->2->3
3->2->1

Today we are going hiking down the Snoqualmie falls.

Design of an Email Campaign system:

Email Campaigns are often required in the lines of work that involves customers for the resources they request. Such campaigns often involve a subset of the people who are chosen based on some criteria. The mails to them also have different format and content depending on the nature of the campaign. Furthermore, email templates may need to be filled before they can be sent out and there may be retries involved.

A system that enables campaigns to be generated is a good candidate for automation as a background job. Let us look at this job in a bit more detail:

1) First we need a way to specify the criteria for selecting the email recipients. This can be done with a set of logical conditions using ‘or’ and ‘and’ operators. Each condition is a rule and the rules may have some order to it. Consequently the rules do not necessarily fit in a table. They are best expressed as a stored procedure with versioning. If the data to filter resides in a table such as say the customers table, then the stored procedure resides as close to the data as possible.

2) The message may need to be formatted and prepared for each customer and consequently these can be put in a queue where they are appropriately filled and made ready. A message broker comes very useful in this regard. Preparation of emails is followed with sending them out consequently there may be separate queues for preparation and mailing out and they may be chained.

3) The number of recipients may be quite large and the mails for each of them may need to be prepared and sent out. This calls for parallelization. One way to handle this parallelization would be to have workers spawned to handle the load on the queue. Celery is a good example of such a capability and it works well with a message broker.

4) A web interface to generate campaigns can be useful for administrators to interact with the system.

The data flow begins with the administrator defining the campaign. This consists of at the very least the following: a) the email recipients b) the mail template c) the data sources from which to populate the databases and d) the schedule in which to send out the mails. 

The email recipients need not always be specified explicitly especially if they number in millions. On the other hand, the recipients may already be listed in a database somewhere. There may only be selection criteria for filtering the entire list for choosing the recipients. Such a criteria is best expressed in the form of a stored procedure. The translation of the user defined criteria into a stored procedure is not very hard. The user is given a set of constraints, logical operators and value inputs and these can be joined to form predicates which are then entered as is into the body of a stored procedure. Each time the criteria are executed through the stored procedure, the result set forms the recipients list. When the criteria change, the stored procedure is changed and this results in a new version.  Since the criteria and stored procedure are independent of the message preparation and mailing, they can be offline to the mailing process.  

The mailing process commences with the email list determined as above.  The next step is the data that needs to be acquired for each template. For example, the template may correspond to the resources that the recipients may have but the list of resources may need to be pulled from another database. It would be ideal if this could also be treated as SQL queries which provide the data that a task then uses to populate the contents of the email. Since this is per email basis, it can be parallelized to a worker pool where each worker grabs an email to prepare.  An email receives a recipient and content.  Initially the template is dropped on the queue with just the email recipient mentioned. The task then manages the conversion of the template to the actual email message before putting it on the queue for dispatch. The dispatcher simply mails out the prepared email with smtp.  

The task parallel library may hide the message broker from the parallelization. Celery comes with its own message broker that also allows the status of the enqueued items to be logged. However, a fully fledged message broker with a worker pool is preferred because it gives much more control over the queue and the messages permitted on the queue. Moreover, journaling and logging can with the automation. Messages may be purged from the queue so that the automation stops on user demand. 

Therefore data flows from data sources into the emails that are then mailed out.  The task that prepares the emails needs to have access to the database tables and stored procedures that determine who the recipients are and what the message is. Since they act on individual email basis, they are scalable.   

Conclusion: An Email campaign system can be written using a database, a message broker and celery.

1->2->3->4->5->6
1->6->2->5->3->4

void Interleave(Node root){
Node start = root
Node end = find_last(start);
int count = 0;
for(Node cur = start; cur; cur=cur.next) count++;
while ( count > 0)
{
Remove(start, end);
Insert(start, end);
if (start.next)
{
start = start.next.next;
}
end = find_last(start);
count -= 2;
}
}
Node find_last(Node start)
{
Node end = start;
for(Node cur = start; cur; cur=cur.next){
end = cur;
}
return end;
}

void Remove(Node start, node end)
{
Node prev = start;
Node target = start.next;
for(Node cur = target; cur && cur.next; cur=cur.next){
prev = cur;
}
assert(cur == end);
if (prev && prev.next == end)
prev.next = end ? end.next : null;
}

void Insert(Node position, node end)
{
if (end)
{
end.next = position.next;
}
position.next=end;
}

Thursday, May 5, 2016

Today we revisit a few graph algorithms.
The breadth first search
BFS(G,s)
for each vertex, initialize color, parent and distance
set the start vertex color to gray
enqueue the start vertex
while the queue is not empty
       for each of the adjacent vertices of the dequeued vertex
        if its not been visited
               color the vertex as gray and set its parent and distance
               enqueue this vertex
       color the vertex as visited

DFS(G)
for each vertex u belonging to the graph, initialize it
for each vertex u belonging to the graph,
      if color of vertex is white,
          DFS-Visit(the vertex)

DFS-Visit(G,u)
for each adjacent vertex
     DFS-Visit(G,u)

MST-Kruskal
// This joins two distinct trees
A = null
for each vertex v in G
     Make-Set(v)
sort the edges of G, E into non-decreasing order by weight
for each such edge (u,v)
     if Find-set(u) <> Find-set(v)
        A = A union (u,v)
         Union(u,v)
return A

MST-Prim
// this grows a tree
A = null
for each vertex v in G, initialize the key and the parent
Initialize a min-priority queue Q with vertices
while the Queue is not empty
       extract the vertex with the minimum edge distance connecting it to the tree
       for each adjacencies v of this vertex u
              set the key to the weight(u,v) and parent

Dijkstra's shortest path algorithm
Other than source vertex, mark all as max distance no parent
Initialize a min-priority queue Q with vertices
while the Queue is not empty
     extract the minimum vertex u of Q
             add it to the shortest path
             for all vertices adjacent to u,
                   relax the vertices

All-Pairs-shortest-path (predecessor matrix, src, dest)
if src==dest
    print src
else if parent(dest) == null
    print "no path"
else all-pairs-shortest-path(predecessor matrix, I, parent-srcdest)
    print dest

Floyd-Warshall gives shortest path weighs matrix
dij with no intermediary vertices = edge weight
        = min(dij k-1 , dik k-1 + djk k-1 ) when # intermediary vertices k >=1

for k = 1 to n
   D for k be a new nxn matrix
   for I = 1 to n
      for j = 1 to n
         dij(k) = min(dij(k-1), dik(k-1) + dkj(k-1))

return Dn

int BuyOrSellOnce(List<int> prices, ref int start, ref int end)
{
int min = 0;
int min_index = start;
int max_index = start;
int max_profit = -1;
for (int i = start; i < end; i++)
{
if (prices[i] < min)
     min = prices[i]
if (prices[i] - min > max_profit){
    max_profit = prices[i] - min;
    max_index = i;
   }
}
start = max_index+1;
end = prices.Count -1;
return max_profit;
}
int BuyOrSellManyTimes(List<int>prices)
{
int start = 0;
int end = prices.Count-1;
int max_profit = -1;
int profit = BuyOrSellOnce(prices, start, end);
while (profit != -1) {
max_profit += profit;
profit == BuyOrSellOnce(prices, start, end);
}
return max_profit;
}

int main() {
//code
int* p = new int[5];
p[0] = 3;
p[1] = 1;
p[2] = 2;
p[3] = 4;
p[4] = 5;
printf(getminindex(p,5))
return 0;
}

int getminindex(int[] p, int n)
{
int min = 0;
for (int i = 0; i < n; i++ )
for (int j = i+1; j < n; j++)
{
if (p[j] >= p[i] && j - i > min)
min = j-i;
}
return min;

}

Wednesday, May 4, 2016

We review a few more coding problems.
Node TreeSuccesor (Node root)
{
If root.right return TreeMinimum (root.right);
Node x = root;
Node y = root.parent;
while ( y && x == y.right)
{
x = y;
    y = y.parent;
   }
return y;
}
Another one is about merging a few time intervals
Collapse(List<Tuple<int,int>> intervals)
{
var cmp = new IntervalComparator();
intervals.sort(cmp); //
var s = new Stack<Tuple<int,int>>();
s.Push(interval[0]);
for (int I = 0; I< intervals.count; I++)
{
var top = s.top();
if (top.end < interval[I].start)
    s.push(interval[I];
else if (top.end < interval[I].end){
        top.end = interval[I].end;
        s.pop();
        s.push(top);
}
}
while (!s.empty())
{
Interval t = s.top();
console.writeline("{0}-{1}",t.first, t.second);
s.pop();
}
}

Another one is about finding max subarray:
int GetMaxSubarray(List<int> items)
{
int x = 0; // sum
int y = 0; // minimum sum
int z = 0; // maximum sum
for( int I =0 ; I < items.Length; I++)
{
x = x+ items[I];
if (x < y)
     y = x;
if (x-y > z)
    z = x-y;
}
return z;
}

Another one is about beating the stock market. We are given an array for which the ith element is the price of a given stock on day i. If we were only permitted to buy one share of the stock and sell one share of the stock, we have to find the best times to buy and sell to get the best deal.
int GetBestDeal(List<int> prices)
{
int min = 0;
int deal = -1;
for (int I = 1; I < prices.Count; i++)
{
if (prices[i] < prices[min])
    min = i;
if (prices[i] - prices[min] > deal)
    deal = prices[i] - prices[min];
}
return deal;
}

Find all pair paths in a graph
We use backtracking for finding all the paths.
void allpaths(int[,] graph, int n, int src, int dest, int threshold, ref List<int> candidatePath ...
for (int i = 0; i < adjacencies(src).Count; i++)
{
   // avoid cycles
   // avoid those exceeding threshold
   candidatePath.Add(adjacencies(src)[i]);
   candidateDist.Add(distances[i]);
   allpaths(graph, n, adjacencies(src)[i], dest, threshold, candidatePath ...
   candidatePath.Remove(adjacencies(src)[i]);
   candidateDist.RemoveAt(candidateDist.Count - 1);
}