Cluster computing

Sunday, June 4, 2017

Saturday, June 3, 2017

Friday, June 2, 2017

We talked about the overall design of an online shopping store that can scale starting with our post here. Then we proceeded to discussing data infrastructure and data security in a system design case in previous posts. We started looking at a very specialized but increasingly popular analytics framework and Big Data to use with data sources. For example, Spark and Hadoop can be offered as a fully managed cloud offering. Today we continue to look at some more specialized infrastructure.
EMC dedicated cloud is an on-demand dedicated cloud and managed completely by Dell EMC and run on dedicated servers from the Virtustream datacenter. Virtustream is a storage cloud that can secure, manage and store exponentially. EMC dedicated cloud is a service offering where they run on one or more instances of Elastic Cloud Storage and run it on a single tenant basis from Virtustream datacenters. This can be used in one of two ways- a hybrid flow fashion when some sites are running on premise and others from the VirtuStream flow or all of the sites can be run from the Virtustream flow. This finds appeal among users because on-premise data is considered by some to be management overhead. At the same time others like the agility of the public cloud computing but they really want complete control of their data as if its running on-premise.
This is a significant win for customers who have their own patching and management routines because they now don't have to suffer any downtime or other costs as their applications and services seamlessly work from the Virtustream. At the same time, all aspects of the dedicated cloud such as networking, storage and compute are offered with no sharing so the users can feel confident that they are not at risk of letting others see their data and apps. Moreover, the dedicated cloud brings them some form of unparalleled capabilities For example they can host more than one site in one datacenter and the others in another datacenter - both from Virtustream.
Where does this fit in ?
Customers recognize the success of public clouds where someone else manages their services. Customers also recognize the success of private clouds because they often have a whole lot of non 12-factor apps and because private cloud keeps the lights on. The missing piece was the managed services capability at scale. In other words, Users want Federation Enterprise Hybrid Cloud, they also want to support Enterprise workloads but they just don't want to run it. This is where this fits in. In fact, Virtustream is considered the leader in Hosted Private Cloud Solutions.
In fact scale and hosting are recognized as important challenges not only for the customer but also for the cloud provider. Take the case of Midfin systems which offers intelligent software solution that powers the limitless data center. It stitches together storage, compute and network from different data centers into a single unified fabric. This enables a whole lot of use case including providing centralized and unified management of remote locations.
#codingexercise
Divide a list of numbers into groups of consecutive numbers but preserve their original order
Input: 8,2,4,7,1,0,3,6
Output: 2,4,1,0,3 and 8,7,6
List<List<int>> GetContiguousGroups(List<int> unordered)
{
var ret = new List<List<int>>();
var ordered = unordered.Select(item => (int)item.Clone()).ToList();
ordered.Sort();
var used = new bool[unordered.Count]{False};
var groups = ordered.getGroups();
foreach (var group in groups)
{
var h = new HashTable<int,int>();
foreach (int i in group)
AddToHash(i, ref h);
var seq = new List<int>();
foreach (var item in h.Keys)
{
// for all repetitions
int count = h[item];
for(int i = 0;i< count; i++)
{
int index = IndexOfUnusedInUnordered(unordered, item, ref h, ref used);
seq.Add(unordered[index]);
h[item] -= 1;
}
}
ret.Add(seq);
}
return ret;
}
List<List<int>> GetGroups(List<int> ordered)
{
var groups = new List<List<int>>();
var ret = new List<int>();
if (ordered == null || ordered.Count == 0) return groups;
ret.add(ordered[0]);
For( int i = 1; i < ordered.Count; i++)
{
if (ordered[i] != ordered[i-1]+1){
groups.add(ret);
ret = new list<int>();
ret.Add(ordered[i]);
}else{
ret.Add(ordered[i]);
}
}

groups.Add(ret);

return groups;
}

Thursday, June 1, 2017

We talked about data infrastructure and data security in a system design case in previous posts. Today we look at a very specialized but increasingly popular analytics framework to use with data sources. I'm bringing up Apache Spark in this regard. Apache Spark is a fast and general engine for large scale data processing. Public clouds offer fully managed cloud Apache Hadoop with analytics cluster as Spark or MapReduce and R-Server. However, this document brings out the benefits of provisioning managed Spark services together with different kind of data storage.
Apache Spark is similar to MapReduce but major public clouds offer Spark with Hadoop. Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and adhoc queries. This shows that as an analytics cluster, Spark can handle diverse workflows including stream processing, graph processing, sql querying and machine learning operations. Spark does not have a file management system so it is integrated with Hadoop or other cloud based data platform such as S3 or shared volumes. Since it runs on Mesos, we can perform analysis against all data that is hosted on Mesos from applications and services.
This variety of data sources including the Mesos stack and their availability in existing deployments makes Spark appealing to use for analytics. Combined with in-memory processing and offering a variety of analytics, this technique becomes very popular.
The in-memory computations are made possible with a Resilient Distributed Datasets which is a memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Zaharia et al described their RDDs as a way to provide a restricted form of shared memory that is based on coarse-grained transformations rather than fine-grained updates to shared datasets. This enables iterative algorithm and interactive data mining tools. By allowing Spark to be used with different data sources and with computations all in-memory, the speed up is much more than traditional map-Reduce clusters.
Data Transformations and automatic indexing also enables better use of the managed Spark instance. Since applications and services don't always use a shared repository or file share and maintain their own local data, it would help to have those accessible from where users can use it with Spark or have the data flow into the store used with Spark.
A managed instance for Spark also enables federation of data sources for analytics. This lets analysis to be performed independent of who owns the data sources.
#codingexercise
Given an unordered array of positive integers, create an Algorithm that makes sure no group of integers of size bigger than M have the same integers.
Input: 2,1,1,1,3,4,4,4,5
Output: 2,1,1,3,1,4,4,5,4
List<int> GetSelectedGroups(List<int>A, int M)
{
Var t = new SortedDictionary<int,int>();
for (int I = 0; I < A.Count; I++)
{
If (t.Contains(A[i])
t[A[i]]++;
Else
t.Add(A[I], I);
}
Var ret = new List<int>();
For (int I = 0; I < A.Count; I++)
{
If (ret.Count <=A.Count && t[A[I]] > 0 && ret.Last()!=A[I])
{
Int min = Math.min(M, t[A[I]]);
Ret.AddRange(Enumerable.Repeat(A[I], min).ToList());
T[A[I]]-= min;
}
}
// repeat the above for residues in the SortedDictionary, if desired
// or print the keys and values as per the sorted order in the dictionary.
Return ret.ToList();

}

Wednesday, May 31, 2017

We continue our discussion of system design for online store as mentioned in the previous post. We talk about securing data at rest and in transit. This is generally performed with layers and the principles of least privilege. For example, the lower levels of the software stack such at the operating system level operate at elevated and with internal account privilege and independent from the applications and services that run on it. The higher levels of the software stack may be external to the operating system and not require the same privilege as the lower levels. Their privilege may be secured with specific service accounts to run on the lower levels. The same considerations for lower and higher levels in the system side applies to user mode. Code executing on behalf of a user must have demarcated authentication and authorization lines of control before transitioning into internal context for execution. Every user access and interface to the system must be secured. Role based access control could be used to differentiate user access. This enables separation of concerns such as between user and administrator work. It also facilitates specifying security policies and changes to them without affecting the data. Access and privilege to user data should be as granular as possible so that changes to one may not affect another. As long as policies and system can be separated, we can change the policies without affecting the system. This comes in useful in throttling and controlling access to the server which may be under duress from excessive and unwanted connections. Authentication and encyrption protect the data in transit so we use it for all network connections. In the case of web traffic, we prefer https over http. But it is not just the higher privileged user that we want to secure. Connections coming with the lowest classification such as the anonymous or guest user must have the least privilege to run code anywhere in the system. This is particularly important in cloud computing where we rely on authentication tokens as a form of user credentials. If a token issuer, malfunctions it could allow anonymous user the same privilege as a regular user. Therefore security reviews with Strength-Weakness-Opportunities-Threats need to be conducted on all flows - both control and data through the system. Fail fast mechanism is used to prevent unwanted access or execution. Consequently most routines at every level and entrypoint check the parameters validate their assumptions. The Cloud Security Alliance announced ten major challenges in big data security and privacy and suggest best practices for their mitigations. These include:
1. Secure code - check authentication, authorization, audit and security properties, mask personally identifiable information
2. Secure non-relational data stores - use fuzzy techniques for penetration testing.
3. Secure data storage and logs using Secure Untrusted Data Repository (SUNDR) methods
4. Secure devices - use conventional endpoint protection discussed earlier along with anomaly detection
5. Secure based on monitoring - This applies to cloud level, cluster level, application levels etc. Compliance standard violations should also be monitored.
6. Protecting Data privacy - Code does not differentiate which data to operate on if they are both valid. Policies are used instead. Similarly if data exists in two or more sources, they could be related by a malicious user.
7. Secure data with encryption - perform encryption without sharing keys
8. Secure with restrictions implemented by granular access control and federation.
9. Secure with granular audits - use a Secure Information and Event Management SIEM solution
10. Secure provenance data that describes how the data was derived even if it bloats size.
#codingexercise

Tuesday, May 30, 2017

We continue our discussion of System design for online store as mentioned here and here. We now discuss the data storage aspects across all services from the point of scalability. We assume the store will have infinite users at some point and plan accordingly. The services will need to store large volumes of data. This data will be both user data and logs. The user portal may be composed of data from many different data sources. Images and large static content will likely be served from storage that is optimized for blobs. Most of the per-user information is stored from sharded relational databases. High volume short text such as from community feedback forums, social engineering transcripts, chats and messages, ticket and case troubleshooting conversations will likely be stored in a large distributed key-value store. A conventional relational database may be used as a queuing system on top of this store. Almost all of this data is still corresponding to a user by user basis. It is the data generated by user. Unlike user data, log data is generated by the system from the various operations of the services in the form of log events. Log Events help with analytics. For example, the log events may be used for correlation and as feedback which then leads to improvements in the operations of the services. This feedback-improvement virtuous cycle can go on and on regardless of which user is using the system. Log Events translate to feedback only with analytics. For example, users may be shown the trending bestsellers or newcomers to the store. This may require correlation and collaborative filtering to provide a ranked list. Analytics come with beautiful charts. User and Log data may also be used in many other ways. Data may appear in the form of feeds to users to improve the shopping experience around a product. Data may come in the form of recommendations such as people who liked this also liked that. Data may be represented as graph and used with search. Data may also be used for the integrity of the site. Ads, reviews and insights may also appear as additional data while being separate and distinct in their purpose or usage. Data expands possibilities for the business and hence it eventually becomes the center of gravity. For example, Logging may be used to channel all logs to a central repository which may grow with time in a time series database on a dedicated cluster or a data warehouse. As data expands, scalability concerns grow. Systems may become mature but when size grows, even architectures change. Embrace of Big Data over relational is a trend that comes directly from scalability. Developers may find it enticing to use SQL statements instead of map-reduce to get to the same result. Consequently they may require additional stack over data. Visualization will pull data and will also come with its own stack. Data tools may evolve over the data stack. And the tools and stack will both evolve to better suit the scalability and functionality. The design discussion here is borrowed from what has already been shown to work in companies like Facebook that have grown significantly.
#codingexercise
input [2,3,1,4]
output [12,8,24,6]

Multiply all fields except it's own position.
List<int> GetSumProduct(List<int> A)
{
assert(A.Any(x => x == 0) == false);
var product = 1;
A.ForEach( x => {product *= x;});
var ret = new List<int>();
A.ForEach( x => { ret.Add(product/x); });
return ret;
}
if we were to avoid division, we could use multiply for every entry other than itself in each iteration. If we were to make it linear and without division, we would keep track of front and rear products in separate passes and combine for the results. We need to start with 1.

Monday, May 29, 2017

Today we continue the system design discussion on the online retail store and look at some of the services in detail. For example, the card payment services is a popular service for all the transactions since they are billed to the card. card numbers are not stored as is because it is a sensitive information. Instead, it is encrypted and decrypted. The function to Encrypt str using pass_str as the password gives a result that is a binary string of the same length as str. The Decrypt function the encrypted string crypt_str using pass_str as the password. The card information is associated with the customer which is usually stored in the accounts table. Charges to the card are facilitated through the programmatic access to a third party bank services that enables posting a charge to the card. Cards can also be specific to the store and these may also have monetary values in which case the numbers for the card are generated randomly. These special cards are stored with their balance Order Processors are usually implemented as message queues processors which are written specific to handle different kinds of transactions based on the incoming messages. The message queue may be its own cluster with journaling and retries specified. The performance of the message processor must be near realtime and support millions of transactions a day. It provides an async mechanism to deduct from the card or to reload the card. Point of sale registers send transaction requests to the order processors through API services. The orders flowing through the processors on fulfilment automatically update the account and the card associated which is ready by the application and the point of sale register. There are some interesting edge cases to be considered for the spend and reload activities on the card such that the balance updated remains accurate. Usually the payment activity on the card is the slowest transaction as it is routed through the third party vendor and with errors handled with user interaction. But this is generally overcome by keeping it as the last step of the user interaction. Many orders are executed within a transaction scope but with states being maintained that move progressively forward as intiated, processed and completed or failed, the operations can be structured for retries and re-entrancy. The stores or the purchases from the store are added to the account holder's history with the help of transactions table. Much of the online shopping experience continues to be driven by simple relational database as the choice data platform. The adoption of cloud services has increased the use of such data storage options in the cloud. Accounts data enables authentication module to keep the user signed in across different services. This is usually achieved with the some form of API authentication and portal membership provider that facilitates a session for the account holder. Popular mechanisms include OAuth, JWT and such others. Tokens are issued on successful authentication which are refreshed on expiry. The transactions with the store are important for customer satisfaction. A variety of machine learning techniques such grouping and collaborative filtering are used to make recommendations to the users in the form of customers who purchased this also purchased those.
#codingexercise
Calculate the H-index of a sorted array.
This is the index of the array whose values is greater than or equal to the index.
int GetHIndex(int[] A)
{
var t = new int[A.Length+1];
for (int i = 0; i < A.Length; i++)
t[Math.min(A.Length, A[i])]++;
int sum = 0;
for (int i = t.Length-1; i >=0; i--){
sum += t[i];
if (sum >=i)
return i;
}
return -1;
}