Wednesday, May 31, 2017

We continue our discussion of system design for online store as mentioned in the previous post. We talk about securing data at rest and in transit. This is generally performed with layers and the principles of least privilege. For example, the lower levels of the software stack such at the operating system level operate at elevated and with internal account privilege and independent from the applications and services that run on it. The higher levels of the software stack may be external to the operating system and not require the same privilege as the lower levels. Their privilege may be secured with specific service accounts to run on the lower levels. The same considerations for lower and higher levels in the system side applies to user mode. Code executing on behalf of a user must have demarcated authentication and authorization lines of control before transitioning into internal context for execution. Every user access and interface to the system must be secured. Role based access control could be used to differentiate user access. This enables separation of concerns such as between user and administrator work. It also facilitates specifying security policies and changes to them without affecting the data. Access and privilege to user data should be as granular as possible so that changes to one may not affect another. As long as policies and system can be separated, we can change the policies without affecting the system. This comes in useful in throttling and controlling access to the server which may be under duress from excessive and unwanted connections. Authentication and encyrption protect the data in transit so we use it for all network connections. In the case of web traffic, we prefer https over http. But it is not just the higher privileged user that we want to secure. Connections coming with the lowest classification such as the anonymous or guest user must have the least privilege to run code anywhere in the system. This is particularly important in cloud computing where we rely on authentication tokens as a form of user credentials. If a token issuer, malfunctions it could allow anonymous user the same privilege as a regular user. Therefore security reviews with Strength-Weakness-Opportunities-Threats need to be conducted on all flows - both control and data through the system. Fail fast mechanism is used to prevent unwanted access or execution. Consequently most routines at every level and entrypoint check the parameters validate their assumptions. The Cloud Security Alliance announced ten major challenges in big data security and privacy and suggest best practices for their mitigations. These include:
1. Secure code  - check authentication, authorization, audit and security properties, mask personally identifiable information
2. Secure non-relational data stores - use fuzzy techniques for penetration testing.
3. Secure data storage and logs using Secure Untrusted Data Repository (SUNDR) methods
4. Secure devices - use conventional endpoint protection discussed earlier along with anomaly detection
5. Secure based on monitoring - This applies to cloud level, cluster level, application levels etc. Compliance standard violations should also be monitored.
6. Protecting Data privacy - Code does not differentiate which data to operate on if they are both valid. Policies are used instead. Similarly if data exists in two or more sources, they could be related by a malicious user.
7. Secure data with encryption - perform encryption without sharing keys
8. Secure with restrictions implemented by granular access control and federation.
9. Secure with granular audits - use a Secure Information and Event Management SIEM solution
10. Secure provenance data that describes how the data was derived even if it bloats size.
#codingexercise


Tuesday, May 30, 2017

We continue our discussion of System design for online store as mentioned here and here. We now discuss the data storage aspects across all services from the point of scalability. We assume the store will have infinite users at some point and plan accordingly.  The services will need to store large volumes of data. This data will be both user data and logs.  The user portal may be composed of data from many different data sources. Images and large static content will likely be served from storage that is optimized for blobs. Most of the per-user information is stored from sharded relational databases.  High volume short text such as from community feedback forums, social engineering transcripts, chats and messages, ticket and case troubleshooting conversations will likely be stored in a large distributed key-value store. A conventional relational database may be used as a queuing system on top of this store. Almost all of this data is still corresponding to a user by user basis. It is the data generated by user.  Unlike user data, log data is generated by the system from the various operations of the services in the form of log events. Log Events help with analytics. For example, the log events may be used for correlation and as feedback which then leads to improvements in the operations of the services. This feedback-improvement virtuous cycle can go on and on regardless of which user is using the system. Log Events translate to feedback only with analytics. For example, users may be shown the trending bestsellers or newcomers to the store. This may require correlation and collaborative filtering to provide a ranked list. Analytics come with beautiful charts. User and Log data may also be used in many other ways. Data may appear in the form of feeds to users to improve the shopping experience around a product. Data may come in the form of recommendations such as people who liked this also liked that. Data may be represented as graph and used with search. Data may also be used for the integrity of the site.  Ads, reviews and insights may also appear as additional data while being separate and distinct in their purpose or usage. Data expands possibilities for the business and hence it eventually becomes the center of gravity. For example, Logging may be used to channel all logs to a central repository which may grow with time in a time series database on a dedicated cluster or a data warehouse. As data expands, scalability concerns grow. Systems may become mature but when size grows, even architectures change. Embrace of Big Data over relational is a trend that comes directly from scalability. Developers may find it enticing to use SQL statements instead of map-reduce to get to the same result. Consequently they may require additional stack over data. Visualization will pull data and will also come with its own stack.  Data tools may evolve over the data stack. And the tools and stack will both evolve to better suit the scalability and functionality. The design discussion here is borrowed from what has already been shown to work in companies like Facebook that have grown significantly.
#codingexercise
input [2,3,1,4]
output [12,8,24,6]

Multiply all fields except it's own position.
List<int> GetSumProduct(List<int> A)
{
assert(A.Any(x => x == 0) == false);
var product = 1;
A.ForEach( x => {product *= x;});
var ret = new List<int>();
A.ForEach( x => { ret.Add(product/x); });
return ret;
}
if we were to avoid division, we could use multiply for every entry other than itself in each iteration. If we were to make it linear and without division, we would keep track of front and rear products in separate passes and combine for the results. We need to start with 1.

Monday, May 29, 2017

Today we continue the system design discussion on the online retail store  and  look at some of the services in detail. For example, the card payment services is a popular service for all the transactions since they are billed to the card. card numbers are not stored as is because it is a sensitive information. Instead, it is encrypted and decrypted. The function to Encrypt str using pass_str as the password gives a result that is a binary string of the same length as str.  The Decrypt function the encrypted string crypt_str using pass_str as the password.  The card information is associated with the customer which is usually stored in the accounts table. Charges to the card are facilitated through the programmatic access to a third party bank services that enables posting a charge to the card.  Cards can also be specific to the store and these may also have monetary values in which case the numbers for the card are generated randomly. These special cards are stored with their balance Order Processors are usually implemented as message queues processors which are written specific to handle different kinds of transactions based on the incoming messages. The message queue may be its own cluster with journaling and retries specified.  The performance of the message processor must be near realtime and support millions of transactions a day.  It provides an async mechanism to deduct from the card or to reload the card. Point of sale registers send transaction requests to the order processors through API services. The orders flowing through the processors on fulfilment automatically update the  account and the card associated which is ready by the application and the point of sale register. There are some interesting edge cases to be considered for the spend and reload activities on the card such that the balance updated remains accurate. Usually the payment activity on the card is the slowest transaction as it is routed through the third party vendor and with errors handled with user interaction. But this is generally overcome by keeping it as the last step of the user interaction. Many orders are executed within a transaction scope but with states being maintained that move progressively forward as intiated, processed and completed or failed, the operations can be structured for retries and re-entrancy. The stores or the purchases from the store are added to the account holder's history with the help of transactions table. Much of the online shopping experience continues to be driven by simple relational database as the choice data platform. The adoption of cloud services has increased the use of such data storage options in the cloud. Accounts data enables authentication module to keep the user signed in across different services. This is usually achieved with the some form of API authentication and portal membership provider that facilitates a session for the account holder.  Popular mechanisms include OAuth, JWT and such others.  Tokens are issued on successful authentication which are refreshed on expiry.  The transactions with the store are important for customer satisfaction. A variety of machine learning techniques such grouping and collaborative filtering are used to make recommendations to the users in the form of customers who purchased this also purchased those. 
#codingexercise
Calculate the H-index of a sorted array. 
This is the index of the array whose values is greater than or equal to the index.
int GetHIndex(int[] A)
{
var t = new int[A.Length+1];
for (int i = 0; i < A.Length; i++)
      t[Math.min(A.Length, A[i])]++;
int sum = 0;
for (int i = t.Length-1; i >=0; i--){
sum += t[i];
if (sum >=i)
    return i;
}
return -1;
}

Sunday, May 28, 2017

#codingexercise
Given a room and a robot that can move in one of four directions, find the size of the room. 
Int GetSize(int x, int y) // current co-ordinate of GetSize 

// if one of the directions faces the wall, use it as the start 
// else walk in one of four directions that is available 
// or start from top-left and start the walk along the boundary of the room 
// noting the four corners 
// use it to calculate width and length 
Int wallstartx=INT_MIN; 
Int wallstarty=INT_MIN; 
int i = x;
int j = y;
//start at top left
For (; Move(i,j,0);I--);
Wallstartx=i; 
Wallstarty=j
If (wallstartx==INT_MIN || wallstarty==INT_MIN) 

    Int I = x; 
    Int j = y; 
   // move left to wall
   While(Move(i,j,0) == true)) 
            i--; 
    Wallstartx = i; 
    Wallstarty =j; 

// we have a position on the wall, now we run along the boundary 
I = wallstartx; 
For (int j=wallstarty; Move(I, j,1); j++); 
Int topleftx=I; 
Int toplefty = j; 
For (int i=wallstartx; Move(I, j,2); i++); 
Int toprightx=I; 
Int toprighty = j; 
For (int j=toprighty; Move(I, j,3); j--); 
Int bottomrightx=I; 
Int bottomrighty=j; 
For (int i=bottomrightx; Move(I, j,0); I--); 
Int bottomleft=I; 
Int bottomleft = j; 
For (int j = bottomlefty; Move(I,j,1) && I != wallstartx && j != wallstarty; j--); 
Assert(wallstartx == I && wallstarty == j); 
Int width = Math.Abs(bottomrightx-bottomleftx); 
Int height = Math.Abs(toplefty-bottomlefty); 
Return width*height; 


If the room was irregular and not a rectangle, we would cover each coordinate horizontally and count each postion occupied as a unit square,  when we cover all the squares in one row, we proceed to the next. In this case, we could always try to begin from top left. 

Saturday, May 27, 2017

Today we continue the system design discussion on the online retail store 
We talk about API services from the store. We already saw that we organized each of the store activity in its own service. Consequently the store API is a collection of APIs from each of the services listed earlier. The APIs may be offered through public cloud endpoints and they are monitored with CloudWatch like services. Previously there was an appeal to use private datacenter services with the API hosted behind the firewall along with caching. This was relevant for Master Data Management MDM services where it was important for the company to safeguard its master data and organize it behind its own services. This also facilitated proxy services that can collect statistics and troubleshooting information. The public cloud offerings are equally capable if not better. Besides, public APIs are easily accessible by both end user desktops and their mobile devices. Native applications for the mobile devices can be developed independently from the site core. Caching services from the public cloud can not only facilitate easier content retrieval but also store full business objects. There's a perception that cloud based services suffer from networking delays but the performance is good and with abilities to re-route and tunnel. Similarly privacy and security standards for data at rest and in transit is inproved with standards. From allowing the web application  back access to the company's VPN to all the way hosting each and every service on the public cloud, an organization may choose to have any degree of presence in the cloud. 
For a business owner considering an online store, there are many options. First, There are online shopping stores that allow joining as a partner on seller forums. Second, there are vendors that provide seller services as a package with customizations for the company. Third, the company may choose to implement its own services.
The following are standard inference algorithms for probabilistic graph models: 
  1. Sum product message passing algorithm also called belief propagation 
  1. Variable elimination method. 
  1. Junction-tree algorithm which is belief propagation applied on a modified graph guaranteed to be a tree. 
The first method is used as the forward backward algorithm for computing marginal distributions and as the Viterbi algorithm for the most probable assignments.  
The sum-product message parsing calculates marginal distributions for each unobserved node, conditional on observed nodes.  The marginal distribution of a single random variable is the summation of the mass function over all other variables. The mass function is merely the enumeration of probabilities of different states. For example, mass function is half for the toss of a coin arriving at one of the two outcomes and zero otherwise. 
This algorithm is best illustrated with a factor graph where the variables are on one side and the factors are on the other side. A function associating a variable node 'v' to a factor node 'a' and vice versa is each called a message. A message is a way of saying  that a sender variable thinks that another variable belongs in these states with various likelihoods. A message from v to a is the product of messages from all other neighboring factor nodes  except the recipient. A message from a to v is the product of the factor with messages from all other nodes, marginalized over all variables except the except the one associated with v. The complete marginalization is reduced to a sum of the products of simpler terms than the ones appearing in the full joint distribution.  The complete marginalization is reduced to a sum of the products of simpler terms than the ones appearing in the full joint distribution. 
Upon convergence, the estimated marginal distribution of each node is proportional to the product of all messages from adjoining factors. The estimated joint probability distribution is proportional to the product of the factor and the messages from the variables. Since the marginal distribution of each variable node is called belief, this method determines the beliefs of each variable. Viterbi algorithm is a special case of max product or min sum algorithm that is used for most probable explanation.  
Variable Elimination is used to estimate the marginal distribution over a subset of variables.  
Courtesy: wikipedia 
#coding exercise
Given a tree of boolean values where the nodes result from logical and operation of the children, fix the tree when a given leaf is inverted.
void Invert(Node root,Node leaf)
{
if (leaf.left != null || leaf.right != null) return;
leaf.data = ! leaf.data;
Fix(root);
}
void Fix(Node root)
{
if (root.left != null && root.right != null){
    Fix(root.left);
    Fix(root.right);
    root.data = root.left.data && root.right.data;
}
}


Friday, May 26, 2017

Today we discuss another design question - this time for an online retail store 
An online retail store allows customers to have a great shopping experience.  They are able to browse the inventory, select the items and their quantity for purchase, put them in a shopping cart and checkout with one of their payment methods. The orders they place are then fulfilled and shipped to their address. In the process they collect rewards and participate in various campaigns, loyalty programs and surveys. This shopping is enabled on a variety of devices such as desktop and handheld devices. The primary use case is for selecting an item and purchasing it on desktop. We can estimate the traffic to the website as over a million transactions a day under peak load or anniversaries and primarily in one geographic region. The number of web requests is orders of magnitude more than the transactions as customers browse the site for items. The The store must have a high availability and with performance criteria established with Service Level Agreements.
Clearly we  have a combination of many independent services to power the shopping experience. As the store grows its customer base, the demands on the system will be expected to increase over time. Consequently each component and service must scale and the more decoupled they are, the easier it is to make changes or upgrade with little or no disruption. Consequently a service oriented architecture or better yet a microservices architecture helps with such a modular organization. Some of the services involved include  The User Interface, the services and the storage layer can now be focused and specific to each of the service as the overall website provides a unified experience over these services. This website issues requests to each of the services and is also organized as a consolidation over these services. The site itself is designed as a Model View Controller with an object relational mapping and providing standard enterprise application blocks such as Logging, Caching, Exception etc.  The configuration and resources are also maintained in separate repositories. The API services also allow mobile optimized websites and the services themselves can be external monitored and mentioned via application gateway.
 Some of these services include:
1) Authentication services : provides access tokens based on one or more methods to acquire them, relying on
a) password b) client id and user id c) authentication code, d) client credentials. Refresh tokens are also provided to keep the customers signed in
2) Loyalty service : The loyalty service provides Licenses with a way to integrate their loyalty systems
3) Account service: such as to create US Store account and non-US Store account, to validate and recover credentials, to create account profiles and to update social profile data.
4) Card service: Provides customer's card information to keep track of their purchases. This gives information such as Card URL, Card details, Card balances, and the ability to reload a store card.
5) Payment methods : Enables collection of sums from the customer using a payment method such as a credit card or PayPal
6) Rewards service: provides rewards summary and history information based on application user, points needed for next level, points needed for next free reward, points needed for re-evaluation etc.
7) Barcode service: The barcode service generates Barcodes to communicate with in-store scanners and applications.
8) Content service: These are used to retrieve localized contents for any web client. Content is stored and retrieved in site core.
9) Transaction History service: These are used to retrieve the history data for a specified user, for all cards.
10) Device Relationship service: These are used to insert or update a device, to get reporting actions such as touch-to-pay to display the barcode for scanning and touch-when-done to report when completing the barcode for scanning.
11) Product service : to browse the offerings of the store. This may become a full fledged MDM with terabytes of data.
#codingexercise
Enter a two digit number, find the year closest to this year. 
Example input 99, return 1999; 
Input 15, return 2015. 
public int getClosestYear(int input, int pivot=50, int current2DigitYear){
if  (y < pivot+ current2DigitYear %100)
   return 2000+y;
else
   return 1900+y;
}
We can also work this by taking the remaining from the 100 to see if the offset needs to be 1900 or 2000.

Thursday, May 25, 2017

For the next few days, I will write about System Design problems which are increasingly becoming popular in interviews. These problems delve into planning, design, architecture and deployment considerations and are not as focused as the coding exercises. Let us take the example of URL shortening service.
The use cases for shortening service include the following:
shortening - take a url and return a much smaller url generally with an endpoint and a hash
redirection - take a short url and redirect to the original url.
With the primary use cases, we can start the design. Although it may not be necessary to include use cases in the design, we just list some others as custom url, analytics, automatic link expiration, manual link removal and providing interface options such as UI, API and command line.
The use cases help define the problem we are trying to solve. Spending some time up front on the use cases, helps us plan for the next steps.
Any system can solve a given use case but the design gets more specific as we introduce constraints in the system.
These constraints include high availability with high Service Level Agreements. There may be 100 million urls per month with 1 billion requests per month. The traffic may be over 400 requests per second.  99% of these requests need to be responded. 10% of the requests may be for shortening and the other 90% maybe for redirection. Adding them up, we get over 6 billion urls in 5 years.
With this estimation, we proceed to the storage and processing requirements for the design.  Each URL may take say 1000 bytes and its hash may take twelve bytes for unicode.  This may result in over 6 TBs for all urls.
The storage acts like a huge hash table which stores new mappings and retrieves a url for a given hash.
The hash itself is computed as the last six characters of the base64 of the md5 of the original url taken with random salt.
In order to scale the shortening and the redirection, we can create separate services for these so that each can scale independently. Services may be take by several web servers that are identical and behind a load balancer. The data may be a common server that stores all the states. Databases can easily scale to the data we have estimated. Caches can improve response times and the most used redirections can easily be served from cache. Other bottlenecks may similarly be improved.
courtesy: HiredInTech
#codingexercise
Shuffle :
        static void shuffle(ref List<int> A)
        {
            Random randObj = new Random( 1 );
            for (int i = A.Count/2; i < A.Count; i++)
            {
                var j = randObj.Next(A.Count/2);
                int temp = A[i];
                A[i] = A[j];
                A[j] = temp;
            }
        }
Determine cycle in graph:
Bellman-Ford(G, w, s)
Initialize-single-source(G,s)
for i = 1 to number of vertices -1
    for each edge (u,v) belongs to edges E in G
          relax(u,v,w)
for each edge(u,v) belongs to edges E in G
     if (v.d  > u.d  + w(u,v))
         return False
return True
The topological sort can also detect cycles but this works with negative edges too.

Wednesday, May 24, 2017

Markov networks and probabilistic graphical models continued
Inference is the step performed both during training and for predicting the labels on new inputs.  There are two inference problems 1) during training we compute the marginal distributions over subsets of labels, such as over node marginals and edge marginals. Both these have the same common computations except that we switch to maximization in one instead of the summation in the other. These inference tasks are performed by well known algorithms such as the forward-backward algorithm for computing distributions and Viterbi algorithm for computing the most probable assignment. These algorithms are special cases of general belief propagation algorithms for tree-structured graphical models. For more complex models, approximate inference is necessary. Any inference algorithm for a graphical model is applicable to a CRF with the considerations that it will be called repeatedly for parameter estimation
Pgmpy provides a way to convert Markov Networks to Bayesian graph models involving both random variables and factors. The reverse conversion is also facilitated. Pgmpy can be extended to add more post-inference methods based on conventional graph operations. These include connected components, degree counting, graph coloring, k-core, label-propagation, pagerank, shortest path, and triangle counting.  
Out of these methods, perhaps, the pagerank is used more often to selectively filter the results. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on.  PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized link matrix. 
Centrality can also be measured in terms of pagerank once we have a connectivity matrix based on the intra vector cosine similarity. 
#codingexercise
Get K Top movies by ratings when their similarity graph is given
List<Movie> GetMovies(Movie movie, int k)
{
Var related = BFS(movie);
Var ret = related.OrderByDescending(x => x.getRating()).ToList();
If (ret.Count < K)
   Return ret;
Else
   Return ret.GetRange(0,K).ToList();
}
If we were to get the movies by similarity distance alone, we could say : 


List<Movie> GetMoviesByDistance(Movie movie, int k)
{
Var connected = BFS(movie);
Var ret = connected.ToList();
If (ret.Count < K){
   Return ret;
}
Else{
   Return ret.GetRange(0,K).ToList();
}
}

Tuesday, May 23, 2017

Markov networks and probabilistic graphical models 
Probabilistic graphical model use graphs to represent relationship between random variables or relationship between random variables and factors. Factors are compatibility functions which are smaller in number than the variable set and give more meaningful information to what a probability distribution represents. A factor graph is constructed as a bipartite graph between the random variables and the factors and with the help of  a partition function that normalizes the compatibility functions. 
Markov network focuses only on the first part. It is constructed by all pairs of variables that share a local function with the nodes as the random variables. Although Factors  and not the distributions, play a role in the connectivity in the Markov Network, the connections in a Markov network is  the conditional independence relationships in a multivariate distribution.  It is therefore simpler than a factor graph. 
When data is available and the random variables are defined, a Markov network can be used with the fit and predict methods.  The fit method takes the incoming data as training and learns the parameters from it. The predict method accepts the data as test and predicts all the missing variables using the model. 
Markov models make use of conditional random fields. They do not represent dependencies between the variables in a joint distribution. Instead they allow the factors to depend on the entire observation vector without breaking the linear graphical structure. The feature functions can then be examined with all the input variables together. The extension from linear chains to more general graphs follows an expansion of the same relations. 
There are several graphical method libraries available to fit and predict probabilistic graphical models. But one package by name pgmpy is gaining popularity because it is implemented in Python. Pgmgpy is cited with an introduction in the reference 1 
MicrosoftML packages in python do not seem to cover probabilistic graphical models. Instead they make trees, forests and ensembles available.  
Pgmpy provides a way to convert Markov Networks to Bayesian graph models involving both random variables and factors. The reverse conversion is also facilitated. Pgmpy can be extended to add more post-inference methods based on conventional graph operations. These include connected components, degree counting, graph coloring, k-core, label-propagation, pagerank, shortest path, and triangle counting.  
Reference: 
  1. Pgmpy: Probabilistic Graphical Models using Python – Ankur AnkanAbhinash Panda 
http://conference.scipy.org/proceedings/scipy2015/pdfs/ankur_ankan.pdf 
  1. An introduction to Conditional Random Fields - Sutton and McCallum
  2. Turi documentation
#codingexercise
There are N gas stations along a circular route, where the amount of gas at station i is gas[i].
We have a car with an unlimited gas tank and it costs cost[i] of gas to travel from station i to its next station i+1. We can begin the journey with an empty tank at one of the gas stations. Return all the gas stations where we can travel around the circuit once.
        public List<int> canCompleteCircuit(List<int>gas, List<int> cost)
        {
            var ret = new List<int>();

            for (int i = 0; i < gas.Count; i++)
            {
                int reserve = gas[i];
                if (CanCompleteTour(gas, cost, i, reserve, (i+1)%gas.Count))
                 {
                    ret.Add(i);
                    Console.WriteLine("can complete Tour starting at {0}", i);
                }
                Console.WriteLine();
            }
           return ret;
        }

        static bool CanCompleteTour(List<int> gas, List<int> cost, int start, int reserve, int i)
        {
            if (i == start) return true;
            if (reserve + gas[i] < cost[i]) return false;
            Console.WriteLine("Reserve={0} at {1}", reserve + gas[i] - cost[i], i);
            return CanCompleteTour(gas, cost, start, reserve+gas[i]-cost[i], (i+1)%gas.Count);
        }
The calculations are linear when we need to find where the reserve is the minimum because the gas[i]-cost[i] will remain the same at i.