Cluster computing

Sunday, May 28, 2017

#codingexercise
Given a room and a robot that can move in one of four directions, find the size of the room.
Int GetSize(int x, int y) // current co-ordinate of GetSize
{
// if one of the directions faces the wall, use it as the start
// else walk in one of four directions that is available
// or start from top-left and start the walk along the boundary of the room
// noting the four corners
// use it to calculate width and length
Int wallstartx=INT_MIN;
Int wallstarty=INT_MIN;
int i = x;
int j = y;
//start at top left
For (; Move(i,j,0);I--);
Wallstartx=i;
Wallstarty=j;
If (wallstartx==INT_MIN || wallstarty==INT_MIN)
{
Int I = x;
Int j = y;
// move left to wall
While(Move(i,j,0) == true))
i--;
Wallstartx = i;
Wallstarty =j;
}
// we have a position on the wall, now we run along the boundary
I = wallstartx;
For (int j=wallstarty; Move(I, j,1); j++);
Int topleftx=I;
Int toplefty = j;
For (int i=wallstartx; Move(I, j,2); i++);
Int toprightx=I;
Int toprighty = j;
For (int j=toprighty; Move(I, j,3); j--);
Int bottomrightx=I;
Int bottomrighty=j;
For (int i=bottomrightx; Move(I, j,0); I--);
Int bottomleft=I;
Int bottomleft = j;
For (int j = bottomlefty; Move(I,j,1) && I != wallstartx && j != wallstarty; j--);
Assert(wallstartx == I && wallstarty == j);
Int width = Math.Abs(bottomrightx-bottomleftx);
Int height = Math.Abs(toplefty-bottomlefty);
Return width*height;
}

If the room was irregular and not a rectangle, we would cover each coordinate horizontally and count each postion occupied as a unit square, when we cover all the squares in one row, we proceed to the next. In this case, we could always try to begin from top left.

Saturday, May 27, 2017

Today we continue the system design discussion on the online retail store
We talk about API services from the store. We already saw that we organized each of the store activity in its own service. Consequently the store API is a collection of APIs from each of the services listed earlier. The APIs may be offered through public cloud endpoints and they are monitored with CloudWatch like services. Previously there was an appeal to use private datacenter services with the API hosted behind the firewall along with caching. This was relevant for Master Data Management MDM services where it was important for the company to safeguard its master data and organize it behind its own services. This also facilitated proxy services that can collect statistics and troubleshooting information. The public cloud offerings are equally capable if not better. Besides, public APIs are easily accessible by both end user desktops and their mobile devices. Native applications for the mobile devices can be developed independently from the site core. Caching services from the public cloud can not only facilitate easier content retrieval but also store full business objects. There's a perception that cloud based services suffer from networking delays but the performance is good and with abilities to re-route and tunnel. Similarly privacy and security standards for data at rest and in transit is inproved with standards. From allowing the web application back access to the company's VPN to all the way hosting each and every service on the public cloud, an organization may choose to have any degree of presence in the cloud.
For a business owner considering an online store, there are many options. First, There are online shopping stores that allow joining as a partner on seller forums. Second, there are vendors that provide seller services as a package with customizations for the company. Third, the company may choose to implement its own services.

The following are standard inference algorithms for probabilistic graph models:

Sum product message passing algorithm also called belief propagation

Variable elimination method.

Junction-tree algorithm which is belief propagation applied on a modified graph guaranteed to be a tree.

The first method is used as the forward backward algorithm for computing marginal distributions and as the Viterbi algorithm for the most probable assignments.

The sum-product message parsing calculates marginal distributions for each unobserved node, conditional on observed nodes. The marginal distribution of a single random variable is the summation of the mass function over all other variables. The mass function is merely the enumeration of probabilities of different states. For example, mass function is half for the toss of a coin arriving at one of the two outcomes and zero otherwise.

This algorithm is best illustrated with a factor graph where the variables are on one side and the factors are on the other side. A function associating a variable node 'v' to a factor node 'a' and vice versa is each called a message. A message is a way of saying that a sender variable thinks that another variable belongs in these states with various likelihoods. A message from v to a is the product of messages from all other neighboring factor nodes except the recipient. A message from a to v is the product of the factor with messages from all other nodes, marginalized over all variables except the except the one associated with v. The complete marginalization is reduced to a sum of the products of simpler terms than the ones appearing in the full joint distribution. The complete marginalization is reduced to a sum of the products of simpler terms than the ones appearing in the full joint distribution.

Upon convergence, the estimated marginal distribution of each node is proportional to the product of all messages from adjoining factors. The estimated joint probability distribution is proportional to the product of the factor and the messages from the variables. Since the marginal distribution of each variable node is called belief, this method determines the beliefs of each variable. Viterbi algorithm is a special case of max product or min sum algorithm that is used for most probable explanation.

Variable Elimination is used to estimate the marginal distribution over a subset of variables.

Courtesy: wikipedia

#coding exercise

Given a tree of boolean values where the nodes result from logical and operation of the children, fix the tree when a given leaf is inverted.

void Invert(Node root,Node leaf)
{
if (leaf.left != null || leaf.right != null) return;
leaf.data = ! leaf.data;
Fix(root);
}
void Fix(Node root)
{
if (root.left != null && root.right != null){
Fix(root.left);
Fix(root.right);
root.data = root.left.data && root.right.data;
}
}

Friday, May 26, 2017

Today we discuss another design question - this time for an online retail store
An online retail store allows customers to have a great shopping experience. They are able to browse the inventory, select the items and their quantity for purchase, put them in a shopping cart and checkout with one of their payment methods. The orders they place are then fulfilled and shipped to their address. In the process they collect rewards and participate in various campaigns, loyalty programs and surveys. This shopping is enabled on a variety of devices such as desktop and handheld devices. The primary use case is for selecting an item and purchasing it on desktop. We can estimate the traffic to the website as over a million transactions a day under peak load or anniversaries and primarily in one geographic region. The number of web requests is orders of magnitude more than the transactions as customers browse the site for items. The The store must have a high availability and with performance criteria established with Service Level Agreements.
Clearly we have a combination of many independent services to power the shopping experience. As the store grows its customer base, the demands on the system will be expected to increase over time. Consequently each component and service must scale and the more decoupled they are, the easier it is to make changes or upgrade with little or no disruption. Consequently a service oriented architecture or better yet a microservices architecture helps with such a modular organization. Some of the services involved include The User Interface, the services and the storage layer can now be focused and specific to each of the service as the overall website provides a unified experience over these services. This website issues requests to each of the services and is also organized as a consolidation over these services. The site itself is designed as a Model View Controller with an object relational mapping and providing standard enterprise application blocks such as Logging, Caching, Exception etc. The configuration and resources are also maintained in separate repositories. The API services also allow mobile optimized websites and the services themselves can be external monitored and mentioned via application gateway.
Some of these services include:
1) Authentication services : provides access tokens based on one or more methods to acquire them, relying on
a) password b) client id and user id c) authentication code, d) client credentials. Refresh tokens are also provided to keep the customers signed in
2) Loyalty service : The loyalty service provides Licenses with a way to integrate their loyalty systems
3) Account service: such as to create US Store account and non-US Store account, to validate and recover credentials, to create account profiles and to update social profile data.
4) Card service: Provides customer's card information to keep track of their purchases. This gives information such as Card URL, Card details, Card balances, and the ability to reload a store card.
5) Payment methods : Enables collection of sums from the customer using a payment method such as a credit card or PayPal
6) Rewards service: provides rewards summary and history information based on application user, points needed for next level, points needed for next free reward, points needed for re-evaluation etc.
7) Barcode service: The barcode service generates Barcodes to communicate with in-store scanners and applications.
8) Content service: These are used to retrieve localized contents for any web client. Content is stored and retrieved in site core.
9) Transaction History service: These are used to retrieve the history data for a specified user, for all cards.
10) Device Relationship service: These are used to insert or update a device, to get reporting actions such as touch-to-pay to display the barcode for scanning and touch-when-done to report when completing the barcode for scanning.
11) Product service : to browse the offerings of the store. This may become a full fledged MDM with terabytes of data.
#codingexercise
Enter a two digit number, find the year closest to this year.
Example input 99, return 1999;
Input 15, return 2015.
public int getClosestYear(int input, int pivot=50, int current2DigitYear){
if (y < pivot+ current2DigitYear %100)
return 2000+y;
else
return 1900+y;
}
We can also work this by taking the remaining from the 100 to see if the offset needs to be 1900 or 2000.

Thursday, May 25, 2017

For the next few days, I will write about System Design problems which are increasingly becoming popular in interviews. These problems delve into planning, design, architecture and deployment considerations and are not as focused as the coding exercises. Let us take the example of URL shortening service.
The use cases for shortening service include the following:
shortening - take a url and return a much smaller url generally with an endpoint and a hash
redirection - take a short url and redirect to the original url.
With the primary use cases, we can start the design. Although it may not be necessary to include use cases in the design, we just list some others as custom url, analytics, automatic link expiration, manual link removal and providing interface options such as UI, API and command line.
The use cases help define the problem we are trying to solve. Spending some time up front on the use cases, helps us plan for the next steps.
Any system can solve a given use case but the design gets more specific as we introduce constraints in the system.
These constraints include high availability with high Service Level Agreements. There may be 100 million urls per month with 1 billion requests per month. The traffic may be over 400 requests per second. 99% of these requests need to be responded. 10% of the requests may be for shortening and the other 90% maybe for redirection. Adding them up, we get over 6 billion urls in 5 years.
With this estimation, we proceed to the storage and processing requirements for the design. Each URL may take say 1000 bytes and its hash may take twelve bytes for unicode. This may result in over 6 TBs for all urls.
The storage acts like a huge hash table which stores new mappings and retrieves a url for a given hash.
The hash itself is computed as the last six characters of the base64 of the md5 of the original url taken with random salt.
In order to scale the shortening and the redirection, we can create separate services for these so that each can scale independently. Services may be take by several web servers that are identical and behind a load balancer. The data may be a common server that stores all the states. Databases can easily scale to the data we have estimated. Caches can improve response times and the most used redirections can easily be served from cache. Other bottlenecks may similarly be improved.
courtesy: HiredInTech
#codingexercise
Shuffle :
static void shuffle(ref List<int> A)
{
Random randObj = new Random( 1 );
for (int i = A.Count/2; i < A.Count; i++)
{
var j = randObj.Next(A.Count/2);
int temp = A[i];
A[i] = A[j];
A[j] = temp;
}
}
Determine cycle in graph:
Bellman-Ford(G, w, s)
Initialize-single-source(G,s)
for i = 1 to number of vertices -1
for each edge (u,v) belongs to edges E in G
relax(u,v,w)
for each edge(u,v) belongs to edges E in G
if (v.d > u.d + w(u,v))
return False
return True

The topological sort can also detect cycles but this works with negative edges too.

Wednesday, May 24, 2017

Markov networks and probabilistic graphical models continued
Inference is the step performed both during training and for predicting the labels on new inputs. There are two inference problems 1) during training we compute the marginal distributions over subsets of labels, such as over node marginals and edge marginals. Both these have the same common computations except that we switch to maximization in one instead of the summation in the other. These inference tasks are performed by well known algorithms such as the forward-backward algorithm for computing distributions and Viterbi algorithm for computing the most probable assignment. These algorithms are special cases of general belief propagation algorithms for tree-structured graphical models. For more complex models, approximate inference is necessary. Any inference algorithm for a graphical model is applicable to a CRF with the considerations that it will be called repeatedly for parameter estimation
Pgmpy provides a way to convert Markov Networks to Bayesian graph models involving both random variables and factors. The reverse conversion is also facilitated. Pgmpy can be extended to add more post-inference methods based on conventional graph operations. These include connected components, degree counting, graph coloring, k-core, label-propagation, pagerank, shortest path, and triangle counting.
Out of these methods, perhaps, the pagerank is used more often to selectively filter the results. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on. PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized link matrix.
Centrality can also be measured in terms of pagerank once we have a connectivity matrix based on the intra vector cosine similarity.
#codingexercise
Get K Top movies by ratings when their similarity graph is given
List<Movie> GetMovies(Movie movie, int k)
{
Var related = BFS(movie);
Var ret = related.OrderByDescending(x => x.getRating()).ToList();
If (ret.Count < K)
Return ret;
Else
Return ret.GetRange(0,K).ToList();
}
If we were to get the movies by similarity distance alone, we could say :

List<Movie> GetMoviesByDistance(Movie movie, int k)
{
Var connected = BFS(movie);
Var ret = connected.ToList();
If (ret.Count < K){
Return ret;
}
Else{
Return ret.GetRange(0,K).ToList();
}
}

Tuesday, May 23, 2017

Markov networks and probabilistic graphical models

Probabilistic graphical model use graphs to represent relationship between random variables or relationship between random variables and factors. Factors are compatibility functions which are smaller in number than the variable set and give more meaningful information to what a probability distribution represents. A factor graph is constructed as a bipartite graph between the random variables and the factors and with the help of a partition function that normalizes the compatibility functions.

Markov network focuses only on the first part. It is constructed by all pairs of variables that share a local function with the nodes as the random variables. Although Factors and not the distributions, play a role in the connectivity in the Markov Network, the connections in a Markov network is the conditional independence relationships in a multivariate distribution. It is therefore simpler than a factor graph.

When data is available and the random variables are defined, a Markov network can be used with the fit and predict methods. The fit method takes the incoming data as training and learns the parameters from it. The predict method accepts the data as test and predicts all the missing variables using the model.

Markov models make use of conditional random fields. They do not represent dependencies between the variables in a joint distribution. Instead they allow the factors to depend on the entire observation vector without breaking the linear graphical structure. The feature functions can then be examined with all the input variables together. The extension from linear chains to more general graphs follows an expansion of the same relations.

There are several graphical method libraries available to fit and predict probabilistic graphical models. But one package by name pgmpy is gaining popularity because it is implemented in Python. Pgmgpy is cited with an introduction in the reference 1

MicrosoftML packages in python do not seem to cover probabilistic graphical models. Instead they make trees, forests and ensembles available.

Pgmpy provides a way to convert Markov Networks to Bayesian graph models involving both random variables and factors. The reverse conversion is also facilitated. Pgmpy can be extended to add more post-inference methods based on conventional graph operations. These include connected components, degree counting, graph coloring, k-core, label-propagation, pagerank, shortest path, and triangle counting.

Reference:

Pgmpy: Probabilistic Graphical Models using Python – Ankur Ankan, Abhinash Panda

http://conference.scipy.org/proceedings/scipy2015/pdfs/ankur_ankan.pdf

An introduction to Conditional Random Fields - Sutton and McCallum
Turi documentation

#codingexercise

There are N gas stations along a circular route, where the amount of gas at station i is gas[i].

We have a car with an unlimited gas tank and it costs cost[i] of gas to travel from station i to its next station i+1. We can begin the journey with an empty tank at one of the gas stations. Return all the gas stations where we can travel around the circuit once.

public List<int> canCompleteCircuit(List<int>gas, List<int> cost)

{

var ret = new List<int>();

for (int i = 0; i < gas.Count; i++)

{

int reserve = gas[i];

if (CanCompleteTour(gas, cost, i, reserve, (i+1)%gas.Count))

{

ret.Add(i);

Console.WriteLine("can complete Tour starting at {0}", i);

}

Console.WriteLine();

}

return ret;

}

static bool CanCompleteTour(List<int> gas, List<int> cost, int start, int reserve, int i)

{

if (i == start) return true;

if (reserve + gas[i] < cost[i]) return false;

Console.WriteLine("Reserve={0} at {1}", reserve + gas[i] - cost[i], i);

return CanCompleteTour(gas, cost, start, reserve+gas[i]-cost[i], (i+1)%gas.Count);

}
The calculations are linear when we need to find where the reserve is the minimum because the gas[i]-cost[i] will remain the same at i.