Wednesday, June 7, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added  serverless computing to the mix. Today we continue the discussion.
Applications have evolved with cloud computing. What used to be monolithic and deployed on mere separation between Application dedicated virtual machines and database dedicated storage, was made more modular and separate into deep vertical partitions with their own operating systems. With twelve factor applications, it was easier to take advantage of containers. This worked well with platform as a service and docker containers. It is possible however to go further towards decomposing the application modules into compute and data access intensive functions that can be offloaded into its own containers with both function as a service and backend as a service.  The ease of modifications is very appealing when we look at individual functions packaged in a container by itself.  Both public clouds currently support this form of computing. AWS Lambda and Azure Functions can be executed in response to events at any scale.
There are a few tradeoffs in the serverless computing that may be taken into perspective. First, we introduce latency in the system because the functions don't execute local to the application and require setup and teardown routines during invocations.Moreoever, debugging of serverless computing functions is harder to perform because the functions are responding to more than one applications and the callstack is not available or may have to be put together by looking at different compute resources. The same goes for monitoring as well because we now rely on separate systems. We can contrast this with applications that are hosted with load balancer services to improve availability. The services registered for load balancing is the same code on every partition. The callstack is coherent even if it is on different servers. Moreover, these share the same persistence even if the entire database server is also hosted on say Marathon with the storage on a shared volume. The ability of Marathon to bring up instances as appropriate along with the health checks improves the availability of the application. The choice of using platform as a service or a marathon cluster based deployment or serverless computing depends on the application.
  #codingexercise
Given a preorder traversal of a BST, find the inorder traversal
List<int> GetInOrderFromPreOrder(List<int> A)
{
if (A == null) return A;
return A.sort();
}
int power(uint base, uint exp)
{
int result = 1;
if (exp == 0) return result;
for (int i = 0; i < exp; i++)
      result = result * base;
return result;
}
int power(unit base, uint exp)
{
int result = 1;
while (exp> 0)
{
 if (exp & 1)
      result = result * base;
 base = base * base;
 exp == exp >> 1; 
}
return result;
}

Tuesday, June 6, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added  serverless computing to the mix. Today we continue the discussion. This time we focus on Docker support.
OpenWhisk supports Docker actions. This means we can execute binaries on demand without provisioning virtual machines. Docker actions are best suited where it is difficult to refactor an application into smaller set of functions. This is a common use case for existing applications and services. 
When we request images from Docker for executing the action, these take longer because the latency is high. It depends on the size of the image and the network bandwidth. Contrast this with the pool of warm containers that don't require a cold start.  Moreover, Docker images may not be posted on a public hub because the code to execute on them may be proprietary and it will violate security. These were mitigated with OpenWhisk providing a base image for Docker actions. Also, a Docker action can now receive a zip file with an executable.
The suggestion here is that we dont need to create custom images. This saves time on latency. A base image is already provided. Also, the executable can be switched. Without customizing images and not sharing them, we don't compromise on security. In addition, since only the executables are switched, the time it takes to execute the code is less.

#codingexercise
A bot is an id that visits the site m times in the last n seconds. Given a list of entries in the log sorted by time, return all the bots id.
Yesterday we solved this with iteration over the relevant window of the log. This is a typical question on logs and events both of which are stored in Time Series Database.
Time series database helps with specialized queries for the data. Unlike a relational data that serves an OLTP system, the time series is a continuous stream of events and often at a high rate.
In the logs, Bots generally identify themselves with their user agent string and they obey the rules in the robots.txt file of the site. Consequently, we can differentiate the bots from the logs into those that behave and those who don't. And the ones that do leave an identification string.
      count = 0;
      string pat = @"(?<bot_name>Google?)bot\W";
      Regex r = new Regex(pat, RegexOptions.IgnoreCase);
      foreach (var kvp in h)
      {
           Match m = r.Match(h[kvp.key]);
           if (m.Success)
               count++;
      }
one more:
count the number of ways elements add upto N using array elements with repetitions allowed:
int GetCount(List<int> A, int sum)
{
var counts = new int[A.Count + 1] {0};
for ( int i = 0; i < A.Count; i++)
    counts[i] = 0;
counts[0] = 1;
for (int i = 1; i <= sum; i++)
    for (int j = 0; j < A.Count; j++)
        if (i >= A[j])
              counts[i] += counts[i-A[j]];
return counts[sum];
}
Alternatively, this can be done with backtracking instead of dynamic programming as we showed with the help of the Combine method involving repetitions in the earlier posts.

Monday, June 5, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added  serverless computing to the mix. Today we continue the discussion.
Serverless computing is open by design.  The engine and the event emitter/consumer is open. The interface is open. Its components are Docker, Kafka, Consul which are all open The tools used with this are also open.  
Since the emphasis is on actions, triggers, rules and the deployment and runtime are managed, it is easiest to upload and use it. Actions are the event handlers. They can run on any platform. Typically they are hosted in a container. They can be changed to create sequences and to increase flexibility and foster reuse. An association of a trigger and an action is called a rule. Rules can be specified at the time the actions are registered. A package is a collection of actions and triggers. It allows you to outsource load and calculation intensive tasks. This allows share and reuse. The only drawback is that troubleshooting is more tedious now as there is more correlation to be done. However, actions can be both synchronous and asynchronous and expressed in their own language and runtime. This means we can get responses in blocking and non-blocking manner.  The runtimes are found in the container hosted. 
In the standalone mode, the containers are made available with VirtualBox. In the distributed environment, it can come from PaaS.  These actions do not require predeclared association with containers which means the and the infrastructure does not need to know what the container names are. The execution of the action is taken care of by this layer.

#codingexercise
A bot is an id that visits the site m times in the last n seconds. Given a list of entries in the log sorted by time, return all the bots id.
Hashtable<int, int> GetBots(Log[] logs, int m, int n)
{
var h = new Hashtable<int, int>();
var min = Math.min(logs[logs.Length-1]-n, 0);
for (int i = logs.Length - 1; i >= min; i--)
     if h.Contains(logs[i].id)
        h[logs[i].id] += 1;
     else
        h.Add(logs[i].id, 1);
foreach(var kvp in h)
      if (h[kvp.key] < m)
         h.Remove(kvp.key)
return h;
}

Sunday, June 4, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added  serverless computing to the mix. Today we continue the discussion.
The serverless architecture may be standalone or distributed.  In both cases, it remains an event-action platform to execute code in response to events. In the latter case, it can be offered as a managed service on IBM Bluemix. The console to this service gives a preview of all the features on OpenWhisk. We can execute code written as functions in many different languages. The BlueMix takes care of launching the functions in its own container.  Because this execution is asynchronous to the frontend and backend, they need not perform continuous polling which helps them be more scaleable and resilient. OpenWhisk introduces event programming model where the charges are only for what is used. Moreover it scales on a per-request basis. Together these three features of serverless deployment, granular pricing and scaling make OpenWhisk an appealing event driven framework. Even the programming model is improved. Developers only need to focus on Triggers, Rules and Actions.  Invocations can be blocking, non-blocking and periodic, different languages are supported, and it allows parameter binding, chaining and debugging. Both the engine and the interface are open and implemented in scala. In fact, it is better than PaaS because not only the runtime but the deployment is managed too.
All requests pass through an API gateway because it facilitates security, control, mediation, parameter mapping, schema validation and supports different verbs. Routes and actions can both be defined. CLI, UI and API are also available

Internally OpenWhisk uses a database to store actions, parameters and targets.  This database can be standalone or distributed and is usually couchdb or cloudant respectively. 
It uses a Message Bus like to enable interactions between load balancers, activators and invokers. Activators process events produced by triggers. An activator can call all actions bound by a rule to a particular trigger. Invokers perform actions against a pool of containers. Actions are invoked on containers from a pool that are maintained warm.
#codingexercise
Reverse a linked list in O(1) storage
null
1
1-2
1-2-3
void Reverse(ref Node head)
{
Node current = head;
Node prev = null;
while (current)
{
var next = current.next;
current.next = prev;
prev = current;
current = next;
}
head = prev;
}

Saturday, June 3, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Today we add serverless computing to the mix.
Serverless computing is about Backend as a service as well as Function as a service. A Backend as a service isone which allows a portion of the backend activities to be expressed as functions that execute elsewhere on container. These can be both synchronous and asynchronous. This allows the backend services to be lighter. It can involve any number of operations  and as granular as appropriate. A Function as a Service allows the fat client and single page applications to be lighter as the don't necessarily have to do all the computations in one page. They can be broken down into functions that evaluate elsewhere. In a model-view-controller architecture, we had multi-page applications but these allow super efficient and rich single page applications.
In the case of the store, the functions may be listed as follows:
1) Authentication function - Most authentication mechanism is universal and consolidated for the application to allow users to sign into membership providers. These can be offloaded to their own functions instead of being performed by the same server that responds to shopping experience.
2) The database access - Much of the data is in the form of relational data that requires the same amount of data translations but they need not be done in the server and can be fine granular .
3) MVC becomes a single page again. As discussed, the front-end allows single page applications to be composed of hundreds of smaller functions when appropriate.
4) Search Function - Some of the methods such as search are orthogonal to the shopping experience but equally important. Therefore they can be offloaded.
5) Purchase Function - This is probably the most used function but it is almost independent of the user or the products since it involves card activity. Consequently it is a separate function in intself.
Not just functions but messages can also be translated. Functions can be queued with Kafka.
Thus functions instead of modules become an appealing design.
#codingexercise
int IndexOfUnusedInUnordered(unordered, item, ref h, ref used)
{
assert(h.Contains(item));
int index = unordered.IndexOf(item 0);
while (index != -1 && used[index] != true && index < unordered.Count)
{
if (used[index] != true){
    used[index] = true;
    h[item] -= 1;
    break;
 }
index = unordered.IndexOf(item, index+1);
}
return index;
}

Friday, June 2, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. Today we continue to look at some more specialized infrastructure.
EMC dedicated cloud is an on-demand dedicated cloud and managed completely by Dell EMC and run on dedicated servers from the Virtustream datacenter. Virtustream is a storage cloud that can secure, manage and store exponentially. EMC dedicated cloud is a service offering where they run on one or more instances of Elastic Cloud Storage and run it on a single tenant basis from Virtustream datacenters. This can be used in one of two ways- a hybrid flow fashion when some sites are running on premise and others from the VirtuStream flow or all of the sites can be run from the Virtustream flow. This finds appeal among users because on-premise data is considered by some to be management overhead.  At the same time others like the agility of the public cloud computing but they really want complete control of their data as if its running on-premise.
This is a significant win for customers who have their own patching and management routines because they now don't have to suffer any downtime or other costs as their applications and services seamlessly work from the Virtustream. At the same time, all aspects of the dedicated cloud such as networking, storage and compute are offered with no sharing so the users can feel confident that they are not at risk of letting others see their data and apps.  Moreover, the dedicated cloud brings them some form of unparalleled capabilities For example they can host more than one site in one datacenter and the others in another datacenter - both from Virtustream. 
Where does this fit in ?
Customers recognize the success of public clouds where someone else manages their services. Customers also recognize the success of private clouds because they often have a whole lot of non 12-factor apps and because private cloud keeps the lights on. The missing piece was the managed services capability at scale. In other words, Users want Federation Enterprise Hybrid Cloud, they also want to support Enterprise workloads but they just don't want to run it. This is where this fits in. In fact, Virtustream is considered the leader in Hosted Private Cloud Solutions.
In fact scale and hosting are recognized as important challenges not only for the customer but also for the cloud provider. Take the case of Midfin systems which offers intelligent software solution that powers the limitless data center. It stitches together storage, compute and network from different data centers into a single unified fabric. This enables a whole lot of use case including providing centralized and unified management of remote locations.
#codingexercise
Divide a list of numbers into groups of consecutive numbers but preserve their original order
Input: 8,2,4,7,1,0,3,6
Output: 2,4,1,0,3 and 8,7,6
List<List<int>> GetContiguousGroups(List<int> unordered)
{
var ret = new List<List<int>>();
var ordered = unordered.Select(item => (int)item.Clone()).ToList();
ordered.Sort();
var used = new bool[unordered.Count]{False};
var groups = ordered.getGroups();
foreach (var group in groups)
{
var h = new HashTable<int,int>();
foreach (int i in group)
    AddToHash(i, ref h);
var seq = new List<int>();
foreach (var item in h.Keys)
{
// for all repetitions
int count = h[item];
for(int i = 0;i< count; i++)
{
int index = IndexOfUnusedInUnordered(unordered, item, ref h, ref used);
seq.Add(unordered[index]);
h[item] -= 1;
}
}
ret.Add(seq);
}
return ret;
}
List<List<int>> GetGroups(List<int> ordered)
{
var groups = new List<List<int>>();
var ret = new List<int>();
if (ordered == null || ordered.Count == 0) return groups;
ret.add(ordered[0]);
For( int i = 1; i < ordered.Count; i++)
{
if (ordered[i] != ordered[i-1]+1){
     groups.add(ret);
     ret = new list<int>();
      ret.Add(ordered[i]);
}else{
ret.Add(ordered[i]);
}
}

groups.Add(ret);

return groups;
}

Thursday, June 1, 2017

We talked about data infrastructure and data security in a system design case in previous posts. Today we look at a very specialized but increasingly popular analytics framework to use with data sources. I'm bringing up Apache Spark in this regard. Apache Spark is a fast and general engine for large scale data processing.  Public clouds offer fully managed cloud Apache Hadoop with analytics cluster as Spark or MapReduce and R-Server. However, this document brings out the benefits of provisioning managed Spark services together with different kind of data storage.
Apache Spark is similar to MapReduce but major public clouds offer Spark with Hadoop.  Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and adhoc queries. This shows that as an analytics cluster, Spark can handle diverse workflows including stream processing, graph processing, sql querying and machine learning operations. Spark does not have a file management system so it is integrated with Hadoop or other cloud based data platform such as S3 or shared volumes. Since it runs on Mesos, we can perform analysis against all data that is hosted on Mesos from applications and services.
This variety of data sources including the Mesos stack and their availability in existing deployments makes Spark appealing to use for analytics. Combined with in-memory processing and offering a variety of analytics, this technique becomes very popular.
The in-memory computations are made possible with a Resilient Distributed Datasets which is a memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Zaharia et al described their RDDs as a way to provide a restricted form of shared memory that is based on coarse-grained transformations rather than fine-grained updates to shared datasets.  This enables iterative algorithm and interactive data mining tools. By allowing Spark to be used with different data sources and with computations all in-memory, the speed up is much more than traditional map-Reduce clusters.
Data Transformations and automatic indexing also enables better use of the managed Spark instance. Since applications and services don't always use a shared repository or file share and maintain their own local data, it would help to have those accessible from where users can use it with Spark or have the data flow into the store used with Spark.
A managed instance for Spark also enables federation of data sources for analytics. This lets analysis to be performed independent of who owns the data sources.
#codingexercise
Given an unordered array of positive integers, create an Algorithm that makes sure no group of integers of size bigger than M have the same integers. 
Input: 2,1,1,1,3,4,4,4,5 
Output: 2,1,1,3,1,4,4,5,4 
List<int> GetSelectedGroups(List<int>A, int M) 

Var t = new SortedDictionary<int,int>(); 
for (int I = 0; I < A.Count; I++) 

If (t.Contains(A[i]) 
    t[A[i]]++; 
Else 
    t.Add(A[I], I); 

Var ret = new List<int>(); 
For (int I = 0; I < A.Count; I++) 

If (ret.Count <=A.Count && t[A[I]] > 0 && ret.Last()!=A[I]) 

 Int min = Math.min(M, t[A[I]]); 
 Ret.AddRange(Enumerable.Repeat(A[I], min).ToList()); 
 T[A[I]]-= min; 


// repeat the above for residues in the SortedDictionary, if desired 
// or print the keys and values as per the sorted order in the dictionary.
Return ret.ToList();