Cluster computing

Wednesday, June 7, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. we continued looking at some more specialized infrastructure including dedicated private cloud. Then we added serverless computing to the mix. Today we continue the discussion.
Applications have evolved with cloud computing. What used to be monolithic and deployed on mere separation between Application dedicated virtual machines and database dedicated storage, was made more modular and separate into deep vertical partitions with their own operating systems. With twelve factor applications, it was easier to take advantage of containers. This worked well with platform as a service and docker containers. It is possible however to go further towards decomposing the application modules into compute and data access intensive functions that can be offloaded into its own containers with both function as a service and backend as a service. The ease of modifications is very appealing when we look at individual functions packaged in a container by itself. Both public clouds currently support this form of computing. AWS Lambda and Azure Functions can be executed in response to events at any scale.
There are a few tradeoffs in the serverless computing that may be taken into perspective. First, we introduce latency in the system because the functions don't execute local to the application and require setup and teardown routines during invocations.Moreoever, debugging of serverless computing functions is harder to perform because the functions are responding to more than one applications and the callstack is not available or may have to be put together by looking at different compute resources. The same goes for monitoring as well because we now rely on separate systems. We can contrast this with applications that are hosted with load balancer services to improve availability. The services registered for load balancing is the same code on every partition. The callstack is coherent even if it is on different servers. Moreover, these share the same persistence even if the entire database server is also hosted on say Marathon with the storage on a shared volume. The ability of Marathon to bring up instances as appropriate along with the health checks improves the availability of the application. The choice of using platform as a service or a marathon cluster based deployment or serverless computing depends on the application.
#codingexercise
Given a preorder traversal of a BST, find the inorder traversal
List<int> GetInOrderFromPreOrder(List<int> A)
{
if (A == null) return A;
return A.sort();
}
int power(uint base, uint exp)
{
int result = 1;
if (exp == 0) return result;
for (int i = 0; i < exp; i++)
result = result * base;
return result;
}
int power(unit base, uint exp)
{
int result = 1;
while (exp> 0)
{
if (exp & 1)
result = result * base;
base = base * base;
exp == exp >> 1;
}
return result;
}

Tuesday, June 6, 2017

one more:
count the number of ways elements add upto N using array elements with repetitions allowed:
int GetCount(List<int> A, int sum)
{
var counts = new int[A.Count + 1] {0};
for ( int i = 0; i < A.Count; i++)
counts[i] = 0;
counts[0] = 1;
for (int i = 1; i <= sum; i++)
for (int j = 0; j < A.Count; j++)
if (i >= A[j])
counts[i] += counts[i-A[j]];
return counts[sum];
}
Alternatively, this can be done with backtracking instead of dynamic programming as we showed with the help of the Combine method involving repetitions in the earlier posts.

Monday, June 5, 2017

Sunday, June 4, 2017

Saturday, June 3, 2017

Friday, June 2, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. Today we continue to look at some more specialized infrastructure.
EMC dedicated cloud is an on-demand dedicated cloud and managed completely by Dell EMC and run on dedicated servers from the Virtustream datacenter. Virtustream is a storage cloud that can secure, manage and store exponentially. EMC dedicated cloud is a service offering where they run on one or more instances of Elastic Cloud Storage and run it on a single tenant basis from Virtustream datacenters. This can be used in one of two ways- a hybrid flow fashion when some sites are running on premise and others from the VirtuStream flow or all of the sites can be run from the Virtustream flow. This finds appeal among users because on-premise data is considered by some to be management overhead. At the same time others like the agility of the public cloud computing but they really want complete control of their data as if its running on-premise.
This is a significant win for customers who have their own patching and management routines because they now don't have to suffer any downtime or other costs as their applications and services seamlessly work from the Virtustream. At the same time, all aspects of the dedicated cloud such as networking, storage and compute are offered with no sharing so the users can feel confident that they are not at risk of letting others see their data and apps. Moreover, the dedicated cloud brings them some form of unparalleled capabilities For example they can host more than one site in one datacenter and the others in another datacenter - both from Virtustream.
Where does this fit in ?
Customers recognize the success of public clouds where someone else manages their services. Customers also recognize the success of private clouds because they often have a whole lot of non 12-factor apps and because private cloud keeps the lights on. The missing piece was the managed services capability at scale. In other words, Users want Federation Enterprise Hybrid Cloud, they also want to support Enterprise workloads but they just don't want to run it. This is where this fits in. In fact, Virtustream is considered the leader in Hosted Private Cloud Solutions.
In fact scale and hosting are recognized as important challenges not only for the customer but also for the cloud provider. Take the case of Midfin systems which offers intelligent software solution that powers the limitless data center. It stitches together storage, compute and network from different data centers into a single unified fabric. This enables a whole lot of use case including providing centralized and unified management of remote locations.
#codingexercise
Divide a list of numbers into groups of consecutive numbers but preserve their original order
Input: 8,2,4,7,1,0,3,6
Output: 2,4,1,0,3 and 8,7,6
List<List<int>> GetContiguousGroups(List<int> unordered)
{
var ret = new List<List<int>>();
var ordered = unordered.Select(item => (int)item.Clone()).ToList();
ordered.Sort();
var used = new bool[unordered.Count]{False};
var groups = ordered.getGroups();
foreach (var group in groups)
{
var h = new HashTable<int,int>();
foreach (int i in group)
AddToHash(i, ref h);
var seq = new List<int>();
foreach (var item in h.Keys)
{
// for all repetitions
int count = h[item];
for(int i = 0;i< count; i++)
{
int index = IndexOfUnusedInUnordered(unordered, item, ref h, ref used);
seq.Add(unordered[index]);
h[item] -= 1;
}
}
ret.Add(seq);
}
return ret;
}
List<List<int>> GetGroups(List<int> ordered)
{
var groups = new List<List<int>>();
var ret = new List<int>();
if (ordered == null || ordered.Count == 0) return groups;
ret.add(ordered[0]);
For( int i = 1; i < ordered.Count; i++)
{
if (ordered[i] != ordered[i-1]+1){
groups.add(ret);
ret = new list<int>();
ret.Add(ordered[i]);
}else{
ret.Add(ordered[i]);
}
}

groups.Add(ret);

return groups;
}

Thursday, June 1, 2017

We talked about data infrastructure and data security in a system design case in previous posts. Today we look at a very specialized but increasingly popular analytics framework to use with data sources. I'm bringing up Apache Spark in this regard. Apache Spark is a fast and general engine for large scale data processing. Public clouds offer fully managed cloud Apache Hadoop with analytics cluster as Spark or MapReduce and R-Server. However, this document brings out the benefits of provisioning managed Spark services together with different kind of data storage.
Apache Spark is similar to MapReduce but major public clouds offer Spark with Hadoop. Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and adhoc queries. This shows that as an analytics cluster, Spark can handle diverse workflows including stream processing, graph processing, sql querying and machine learning operations. Spark does not have a file management system so it is integrated with Hadoop or other cloud based data platform such as S3 or shared volumes. Since it runs on Mesos, we can perform analysis against all data that is hosted on Mesos from applications and services.
This variety of data sources including the Mesos stack and their availability in existing deployments makes Spark appealing to use for analytics. Combined with in-memory processing and offering a variety of analytics, this technique becomes very popular.
The in-memory computations are made possible with a Resilient Distributed Datasets which is a memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Zaharia et al described their RDDs as a way to provide a restricted form of shared memory that is based on coarse-grained transformations rather than fine-grained updates to shared datasets. This enables iterative algorithm and interactive data mining tools. By allowing Spark to be used with different data sources and with computations all in-memory, the speed up is much more than traditional map-Reduce clusters.
Data Transformations and automatic indexing also enables better use of the managed Spark instance. Since applications and services don't always use a shared repository or file share and maintain their own local data, it would help to have those accessible from where users can use it with Spark or have the data flow into the store used with Spark.
A managed instance for Spark also enables federation of data sources for analytics. This lets analysis to be performed independent of who owns the data sources.
#codingexercise
Given an unordered array of positive integers, create an Algorithm that makes sure no group of integers of size bigger than M have the same integers.
Input: 2,1,1,1,3,4,4,4,5
Output: 2,1,1,3,1,4,4,5,4
List<int> GetSelectedGroups(List<int>A, int M)
{
Var t = new SortedDictionary<int,int>();
for (int I = 0; I < A.Count; I++)
{
If (t.Contains(A[i])
t[A[i]]++;
Else
t.Add(A[I], I);
}
Var ret = new List<int>();
For (int I = 0; I < A.Count; I++)
{
If (ret.Count <=A.Count && t[A[I]] > 0 && ret.Last()!=A[I])
{
Int min = Math.min(M, t[A[I]]);
Ret.AddRange(Enumerable.Repeat(A[I], min).ToList());
T[A[I]]-= min;
}
}
// repeat the above for residues in the SortedDictionary, if desired
// or print the keys and values as per the sorted order in the dictionary.
Return ret.ToList();

}