Sunday, June 4, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added  serverless computing to the mix. Today we continue the discussion.
The serverless architecture may be standalone or distributed.  In both cases, it remains an event-action platform to execute code in response to events. In the latter case, it can be offered as a managed service on IBM Bluemix. The console to this service gives a preview of all the features on OpenWhisk. We can execute code written as functions in many different languages. The BlueMix takes care of launching the functions in its own container.  Because this execution is asynchronous to the frontend and backend, they need not perform continuous polling which helps them be more scaleable and resilient. OpenWhisk introduces event programming model where the charges are only for what is used. Moreover it scales on a per-request basis. Together these three features of serverless deployment, granular pricing and scaling make OpenWhisk an appealing event driven framework. Even the programming model is improved. Developers only need to focus on Triggers, Rules and Actions.  Invocations can be blocking, non-blocking and periodic, different languages are supported, and it allows parameter binding, chaining and debugging. Both the engine and the interface are open and implemented in scala. In fact, it is better than PaaS because not only the runtime but the deployment is managed too.
All requests pass through an API gateway because it facilitates security, control, mediation, parameter mapping, schema validation and supports different verbs. Routes and actions can both be defined. CLI, UI and API are also available

Internally OpenWhisk uses a database to store actions, parameters and targets.  This database can be standalone or distributed and is usually couchdb or cloudant respectively. 
It uses a Message Bus like to enable interactions between load balancers, activators and invokers. Activators process events produced by triggers. An activator can call all actions bound by a rule to a particular trigger. Invokers perform actions against a pool of containers. Actions are invoked on containers from a pool that are maintained warm.
#codingexercise
Reverse a linked list in O(1) storage
null
1
1-2
1-2-3
void Reverse(ref Node head)
{
Node current = head;
Node prev = null;
while (current)
{
var next = current.next;
current.next = prev;
prev = current;
current = next;
}
head = prev;
}

Saturday, June 3, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Today we add serverless computing to the mix.
Serverless computing is about Backend as a service as well as Function as a service. A Backend as a service isone which allows a portion of the backend activities to be expressed as functions that execute elsewhere on container. These can be both synchronous and asynchronous. This allows the backend services to be lighter. It can involve any number of operations  and as granular as appropriate. A Function as a Service allows the fat client and single page applications to be lighter as the don't necessarily have to do all the computations in one page. They can be broken down into functions that evaluate elsewhere. In a model-view-controller architecture, we had multi-page applications but these allow super efficient and rich single page applications.
In the case of the store, the functions may be listed as follows:
1) Authentication function - Most authentication mechanism is universal and consolidated for the application to allow users to sign into membership providers. These can be offloaded to their own functions instead of being performed by the same server that responds to shopping experience.
2) The database access - Much of the data is in the form of relational data that requires the same amount of data translations but they need not be done in the server and can be fine granular .
3) MVC becomes a single page again. As discussed, the front-end allows single page applications to be composed of hundreds of smaller functions when appropriate.
4) Search Function - Some of the methods such as search are orthogonal to the shopping experience but equally important. Therefore they can be offloaded.
5) Purchase Function - This is probably the most used function but it is almost independent of the user or the products since it involves card activity. Consequently it is a separate function in intself.
Not just functions but messages can also be translated. Functions can be queued with Kafka.
Thus functions instead of modules become an appealing design.
#codingexercise
int IndexOfUnusedInUnordered(unordered, item, ref h, ref used)
{
assert(h.Contains(item));
int index = unordered.IndexOf(item 0);
while (index != -1 && used[index] != true && index < unordered.Count)
{
if (used[index] != true){
    used[index] = true;
    h[item] -= 1;
    break;
 }
index = unordered.IndexOf(item, index+1);
}
return index;
}

Friday, June 2, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. Today we continue to look at some more specialized infrastructure.
EMC dedicated cloud is an on-demand dedicated cloud and managed completely by Dell EMC and run on dedicated servers from the Virtustream datacenter. Virtustream is a storage cloud that can secure, manage and store exponentially. EMC dedicated cloud is a service offering where they run on one or more instances of Elastic Cloud Storage and run it on a single tenant basis from Virtustream datacenters. This can be used in one of two ways- a hybrid flow fashion when some sites are running on premise and others from the VirtuStream flow or all of the sites can be run from the Virtustream flow. This finds appeal among users because on-premise data is considered by some to be management overhead.  At the same time others like the agility of the public cloud computing but they really want complete control of their data as if its running on-premise.
This is a significant win for customers who have their own patching and management routines because they now don't have to suffer any downtime or other costs as their applications and services seamlessly work from the Virtustream. At the same time, all aspects of the dedicated cloud such as networking, storage and compute are offered with no sharing so the users can feel confident that they are not at risk of letting others see their data and apps.  Moreover, the dedicated cloud brings them some form of unparalleled capabilities For example they can host more than one site in one datacenter and the others in another datacenter - both from Virtustream. 
Where does this fit in ?
Customers recognize the success of public clouds where someone else manages their services. Customers also recognize the success of private clouds because they often have a whole lot of non 12-factor apps and because private cloud keeps the lights on. The missing piece was the managed services capability at scale. In other words, Users want Federation Enterprise Hybrid Cloud, they also want to support Enterprise workloads but they just don't want to run it. This is where this fits in. In fact, Virtustream is considered the leader in Hosted Private Cloud Solutions.
In fact scale and hosting are recognized as important challenges not only for the customer but also for the cloud provider. Take the case of Midfin systems which offers intelligent software solution that powers the limitless data center. It stitches together storage, compute and network from different data centers into a single unified fabric. This enables a whole lot of use case including providing centralized and unified management of remote locations.
#codingexercise
Divide a list of numbers into groups of consecutive numbers but preserve their original order
Input: 8,2,4,7,1,0,3,6
Output: 2,4,1,0,3 and 8,7,6
List<List<int>> GetContiguousGroups(List<int> unordered)
{
var ret = new List<List<int>>();
var ordered = unordered.Select(item => (int)item.Clone()).ToList();
ordered.Sort();
var used = new bool[unordered.Count]{False};
var groups = ordered.getGroups();
foreach (var group in groups)
{
var h = new HashTable<int,int>();
foreach (int i in group)
    AddToHash(i, ref h);
var seq = new List<int>();
foreach (var item in h.Keys)
{
// for all repetitions
int count = h[item];
for(int i = 0;i< count; i++)
{
int index = IndexOfUnusedInUnordered(unordered, item, ref h, ref used);
seq.Add(unordered[index]);
h[item] -= 1;
}
}
ret.Add(seq);
}
return ret;
}
List<List<int>> GetGroups(List<int> ordered)
{
var groups = new List<List<int>>();
var ret = new List<int>();
if (ordered == null || ordered.Count == 0) return groups;
ret.add(ordered[0]);
For( int i = 1; i < ordered.Count; i++)
{
if (ordered[i] != ordered[i-1]+1){
     groups.add(ret);
     ret = new list<int>();
      ret.Add(ordered[i]);
}else{
ret.Add(ordered[i]);
}
}

groups.Add(ret);

return groups;
}

Thursday, June 1, 2017

We talked about data infrastructure and data security in a system design case in previous posts. Today we look at a very specialized but increasingly popular analytics framework to use with data sources. I'm bringing up Apache Spark in this regard. Apache Spark is a fast and general engine for large scale data processing.  Public clouds offer fully managed cloud Apache Hadoop with analytics cluster as Spark or MapReduce and R-Server. However, this document brings out the benefits of provisioning managed Spark services together with different kind of data storage.
Apache Spark is similar to MapReduce but major public clouds offer Spark with Hadoop.  Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and adhoc queries. This shows that as an analytics cluster, Spark can handle diverse workflows including stream processing, graph processing, sql querying and machine learning operations. Spark does not have a file management system so it is integrated with Hadoop or other cloud based data platform such as S3 or shared volumes. Since it runs on Mesos, we can perform analysis against all data that is hosted on Mesos from applications and services.
This variety of data sources including the Mesos stack and their availability in existing deployments makes Spark appealing to use for analytics. Combined with in-memory processing and offering a variety of analytics, this technique becomes very popular.
The in-memory computations are made possible with a Resilient Distributed Datasets which is a memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Zaharia et al described their RDDs as a way to provide a restricted form of shared memory that is based on coarse-grained transformations rather than fine-grained updates to shared datasets.  This enables iterative algorithm and interactive data mining tools. By allowing Spark to be used with different data sources and with computations all in-memory, the speed up is much more than traditional map-Reduce clusters.
Data Transformations and automatic indexing also enables better use of the managed Spark instance. Since applications and services don't always use a shared repository or file share and maintain their own local data, it would help to have those accessible from where users can use it with Spark or have the data flow into the store used with Spark.
A managed instance for Spark also enables federation of data sources for analytics. This lets analysis to be performed independent of who owns the data sources.
#codingexercise
Given an unordered array of positive integers, create an Algorithm that makes sure no group of integers of size bigger than M have the same integers. 
Input: 2,1,1,1,3,4,4,4,5 
Output: 2,1,1,3,1,4,4,5,4 
List<int> GetSelectedGroups(List<int>A, int M) 

Var t = new SortedDictionary<int,int>(); 
for (int I = 0; I < A.Count; I++) 

If (t.Contains(A[i]) 
    t[A[i]]++; 
Else 
    t.Add(A[I], I); 

Var ret = new List<int>(); 
For (int I = 0; I < A.Count; I++) 

If (ret.Count <=A.Count && t[A[I]] > 0 && ret.Last()!=A[I]) 

 Int min = Math.min(M, t[A[I]]); 
 Ret.AddRange(Enumerable.Repeat(A[I], min).ToList()); 
 T[A[I]]-= min; 


// repeat the above for residues in the SortedDictionary, if desired 
// or print the keys and values as per the sorted order in the dictionary.
Return ret.ToList(); 

Wednesday, May 31, 2017

We continue our discussion of system design for online store as mentioned in the previous post. We talk about securing data at rest and in transit. This is generally performed with layers and the principles of least privilege. For example, the lower levels of the software stack such at the operating system level operate at elevated and with internal account privilege and independent from the applications and services that run on it. The higher levels of the software stack may be external to the operating system and not require the same privilege as the lower levels. Their privilege may be secured with specific service accounts to run on the lower levels. The same considerations for lower and higher levels in the system side applies to user mode. Code executing on behalf of a user must have demarcated authentication and authorization lines of control before transitioning into internal context for execution. Every user access and interface to the system must be secured. Role based access control could be used to differentiate user access. This enables separation of concerns such as between user and administrator work. It also facilitates specifying security policies and changes to them without affecting the data. Access and privilege to user data should be as granular as possible so that changes to one may not affect another. As long as policies and system can be separated, we can change the policies without affecting the system. This comes in useful in throttling and controlling access to the server which may be under duress from excessive and unwanted connections. Authentication and encyrption protect the data in transit so we use it for all network connections. In the case of web traffic, we prefer https over http. But it is not just the higher privileged user that we want to secure. Connections coming with the lowest classification such as the anonymous or guest user must have the least privilege to run code anywhere in the system. This is particularly important in cloud computing where we rely on authentication tokens as a form of user credentials. If a token issuer, malfunctions it could allow anonymous user the same privilege as a regular user. Therefore security reviews with Strength-Weakness-Opportunities-Threats need to be conducted on all flows - both control and data through the system. Fail fast mechanism is used to prevent unwanted access or execution. Consequently most routines at every level and entrypoint check the parameters validate their assumptions. The Cloud Security Alliance announced ten major challenges in big data security and privacy and suggest best practices for their mitigations. These include:
1. Secure code  - check authentication, authorization, audit and security properties, mask personally identifiable information
2. Secure non-relational data stores - use fuzzy techniques for penetration testing.
3. Secure data storage and logs using Secure Untrusted Data Repository (SUNDR) methods
4. Secure devices - use conventional endpoint protection discussed earlier along with anomaly detection
5. Secure based on monitoring - This applies to cloud level, cluster level, application levels etc. Compliance standard violations should also be monitored.
6. Protecting Data privacy - Code does not differentiate which data to operate on if they are both valid. Policies are used instead. Similarly if data exists in two or more sources, they could be related by a malicious user.
7. Secure data with encryption - perform encryption without sharing keys
8. Secure with restrictions implemented by granular access control and federation.
9. Secure with granular audits - use a Secure Information and Event Management SIEM solution
10. Secure provenance data that describes how the data was derived even if it bloats size.
#codingexercise


Tuesday, May 30, 2017

We continue our discussion of System design for online store as mentioned here and here. We now discuss the data storage aspects across all services from the point of scalability. We assume the store will have infinite users at some point and plan accordingly.  The services will need to store large volumes of data. This data will be both user data and logs.  The user portal may be composed of data from many different data sources. Images and large static content will likely be served from storage that is optimized for blobs. Most of the per-user information is stored from sharded relational databases.  High volume short text such as from community feedback forums, social engineering transcripts, chats and messages, ticket and case troubleshooting conversations will likely be stored in a large distributed key-value store. A conventional relational database may be used as a queuing system on top of this store. Almost all of this data is still corresponding to a user by user basis. It is the data generated by user.  Unlike user data, log data is generated by the system from the various operations of the services in the form of log events. Log Events help with analytics. For example, the log events may be used for correlation and as feedback which then leads to improvements in the operations of the services. This feedback-improvement virtuous cycle can go on and on regardless of which user is using the system. Log Events translate to feedback only with analytics. For example, users may be shown the trending bestsellers or newcomers to the store. This may require correlation and collaborative filtering to provide a ranked list. Analytics come with beautiful charts. User and Log data may also be used in many other ways. Data may appear in the form of feeds to users to improve the shopping experience around a product. Data may come in the form of recommendations such as people who liked this also liked that. Data may be represented as graph and used with search. Data may also be used for the integrity of the site.  Ads, reviews and insights may also appear as additional data while being separate and distinct in their purpose or usage. Data expands possibilities for the business and hence it eventually becomes the center of gravity. For example, Logging may be used to channel all logs to a central repository which may grow with time in a time series database on a dedicated cluster or a data warehouse. As data expands, scalability concerns grow. Systems may become mature but when size grows, even architectures change. Embrace of Big Data over relational is a trend that comes directly from scalability. Developers may find it enticing to use SQL statements instead of map-reduce to get to the same result. Consequently they may require additional stack over data. Visualization will pull data and will also come with its own stack.  Data tools may evolve over the data stack. And the tools and stack will both evolve to better suit the scalability and functionality. The design discussion here is borrowed from what has already been shown to work in companies like Facebook that have grown significantly.
#codingexercise
input [2,3,1,4]
output [12,8,24,6]

Multiply all fields except it's own position.
List<int> GetSumProduct(List<int> A)
{
assert(A.Any(x => x == 0) == false);
var product = 1;
A.ForEach( x => {product *= x;});
var ret = new List<int>();
A.ForEach( x => { ret.Add(product/x); });
return ret;
}
if we were to avoid division, we could use multiply for every entry other than itself in each iteration. If we were to make it linear and without division, we would keep track of front and rear products in separate passes and combine for the results. We need to start with 1.

Monday, May 29, 2017

Today we continue the system design discussion on the online retail store  and  look at some of the services in detail. For example, the card payment services is a popular service for all the transactions since they are billed to the card. card numbers are not stored as is because it is a sensitive information. Instead, it is encrypted and decrypted. The function to Encrypt str using pass_str as the password gives a result that is a binary string of the same length as str.  The Decrypt function the encrypted string crypt_str using pass_str as the password.  The card information is associated with the customer which is usually stored in the accounts table. Charges to the card are facilitated through the programmatic access to a third party bank services that enables posting a charge to the card.  Cards can also be specific to the store and these may also have monetary values in which case the numbers for the card are generated randomly. These special cards are stored with their balance Order Processors are usually implemented as message queues processors which are written specific to handle different kinds of transactions based on the incoming messages. The message queue may be its own cluster with journaling and retries specified.  The performance of the message processor must be near realtime and support millions of transactions a day.  It provides an async mechanism to deduct from the card or to reload the card. Point of sale registers send transaction requests to the order processors through API services. The orders flowing through the processors on fulfilment automatically update the  account and the card associated which is ready by the application and the point of sale register. There are some interesting edge cases to be considered for the spend and reload activities on the card such that the balance updated remains accurate. Usually the payment activity on the card is the slowest transaction as it is routed through the third party vendor and with errors handled with user interaction. But this is generally overcome by keeping it as the last step of the user interaction. Many orders are executed within a transaction scope but with states being maintained that move progressively forward as intiated, processed and completed or failed, the operations can be structured for retries and re-entrancy. The stores or the purchases from the store are added to the account holder's history with the help of transactions table. Much of the online shopping experience continues to be driven by simple relational database as the choice data platform. The adoption of cloud services has increased the use of such data storage options in the cloud. Accounts data enables authentication module to keep the user signed in across different services. This is usually achieved with the some form of API authentication and portal membership provider that facilitates a session for the account holder.  Popular mechanisms include OAuth, JWT and such others.  Tokens are issued on successful authentication which are refreshed on expiry.  The transactions with the store are important for customer satisfaction. A variety of machine learning techniques such grouping and collaborative filtering are used to make recommendations to the users in the form of customers who purchased this also purchased those. 
#codingexercise
Calculate the H-index of a sorted array. 
This is the index of the array whose values is greater than or equal to the index.
int GetHIndex(int[] A)
{
var t = new int[A.Length+1];
for (int i = 0; i < A.Length; i++)
      t[Math.min(A.Length, A[i])]++;
int sum = 0;
for (int i = t.Length-1; i >=0; i--){
sum += t[i];
if (sum >=i)
    return i;
}
return -1;
}