Thursday, June 1, 2017

We talked about data infrastructure and data security in a system design case in previous posts. Today we look at a very specialized but increasingly popular analytics framework to use with data sources. I'm bringing up Apache Spark in this regard. Apache Spark is a fast and general engine for large scale data processing.  Public clouds offer fully managed cloud Apache Hadoop with analytics cluster as Spark or MapReduce and R-Server. However, this document brings out the benefits of provisioning managed Spark services together with different kind of data storage.
Apache Spark is similar to MapReduce but major public clouds offer Spark with Hadoop.  Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and adhoc queries. This shows that as an analytics cluster, Spark can handle diverse workflows including stream processing, graph processing, sql querying and machine learning operations. Spark does not have a file management system so it is integrated with Hadoop or other cloud based data platform such as S3 or shared volumes. Since it runs on Mesos, we can perform analysis against all data that is hosted on Mesos from applications and services.
This variety of data sources including the Mesos stack and their availability in existing deployments makes Spark appealing to use for analytics. Combined with in-memory processing and offering a variety of analytics, this technique becomes very popular.
The in-memory computations are made possible with a Resilient Distributed Datasets which is a memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Zaharia et al described their RDDs as a way to provide a restricted form of shared memory that is based on coarse-grained transformations rather than fine-grained updates to shared datasets.  This enables iterative algorithm and interactive data mining tools. By allowing Spark to be used with different data sources and with computations all in-memory, the speed up is much more than traditional map-Reduce clusters.
Data Transformations and automatic indexing also enables better use of the managed Spark instance. Since applications and services don't always use a shared repository or file share and maintain their own local data, it would help to have those accessible from where users can use it with Spark or have the data flow into the store used with Spark.
A managed instance for Spark also enables federation of data sources for analytics. This lets analysis to be performed independent of who owns the data sources.
#codingexercise
Given an unordered array of positive integers, create an Algorithm that makes sure no group of integers of size bigger than M have the same integers. 
Input: 2,1,1,1,3,4,4,4,5 
Output: 2,1,1,3,1,4,4,5,4 
List<int> GetSelectedGroups(List<int>A, int M) 

Var t = new SortedDictionary<int,int>(); 
for (int I = 0; I < A.Count; I++) 

If (t.Contains(A[i]) 
    t[A[i]]++; 
Else 
    t.Add(A[I], I); 

Var ret = new List<int>(); 
For (int I = 0; I < A.Count; I++) 

If (ret.Count <=A.Count && t[A[I]] > 0 && ret.Last()!=A[I]) 

 Int min = Math.min(M, t[A[I]]); 
 Ret.AddRange(Enumerable.Repeat(A[I], min).ToList()); 
 T[A[I]]-= min; 


// repeat the above for residues in the SortedDictionary, if desired 
// or print the keys and values as per the sorted order in the dictionary.
Return ret.ToList(); 

No comments:

Post a Comment