Cluster computing

Monday, July 3, 2017

Today we continue to review the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. We discussed the infrastructure which consists of physical and virtual nodes. Data is partitioned based on consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. With multiple copies of data, we need consistency models. These include strict consistency, read your write consistency, session consistency, monotonic read consistency and eventual consistency. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates. The topology that helps with this can be classified as a single master model, a multi-master or no-master model or a Quorum based 2-PC model that provides strict consistency. The single master model is one that has each data partition managed by a single master and multiple slaves. All update requests go to the master where the update is applied and propagated to the slaves asynchronously. This scales out the reads to the slaves unless the most recent updates are required in which case they go to the master. The master is still a virtual node. When a physical node crashes, some of the masters may be lost in which case the most uptodate slave may be chosen as master. This model works very well for high read/write ratio which is the case for most data replication.Replication happens by way of state or operations transfer which is similar to the well known log shipping and mirroring options. State gets overwritten on the destinations where as operations are reexecuted on the destination.
Multimaster model comes useful for high write scenarios or when a key range has hot spots. Multi-master model allows the updates to be spread more evenly because it happens at any replica which is why it is sometimes called a no-master. With updates happening to any replicas, state synchronization becomes important. One such method is Quorum based 2-PC which uses the traditional 2PC protocol to bring all replicas to the same state at every update. There is a prepare phase when all the replicas are flushed to disk and the commit phase when each replica writes a log entry to commit the udpate.
Courtesy:nosql patterns from horicky
#codingexercise

1. Check divisibility by 8 for any permutation of a number

static string CheckDivisibilityBy8(string number)

{

var divs = new List<int>(Enumerable.Range(1, 999));

divs = divs.Where(x => x % 8 == 0).ToList();

divs.ForEach(x => Console.Write("{0},", x));

var d = new Dictionary<char, int>();

for (char c = '0'; c < '9'; c++) d.Add(c, 0);

for (int i = 0; i < number.Length; i++)

{

if (d.ContainsKey(number[i]))

d[number[i]]++;

else

d.Add(number[i], 1);

}

var reqLen = number.Length < 3 ? number.Length : 3;

for (int i = 0; i < divs.Count; i++)

{

if (divs[i].ToString().Length >= reqLen && divs[i].ToString().ToCharArray().All(x => d.ContainsKey(x) && d[x] >= divs[i].ToString().Count(t => t == x)))

return "YES";

}

return "NO";

}

2. If Similarity is defined as the length of the prefix matching two strings, find the sum of the similarities found from a string when it is matched with all its suffixes.

static int GetSimilarity(string A, string B)

{

int count = 0;

for (int i = 0; i < Math.Min(A.Length, B.Length); i++)

if (A[i] != B[i])

{

// Console.WriteLine(A.Substring(0, i));

return count;

}

else

{

count++;

}

// if (count != 0) Console.WriteLine(A);

return count;

}

static int GetSimilarities(string A)

{

// sum up similarity for each suffix of A beginning with A itself

int sum = 0;

for (int k = 0; k < A.Length; k++)

{

var suffix = A.Substring(k, A.Length-k);

var count = GetSimilarity(A, suffix);

if (count != 0)

{

Console.WriteLine("suffix={0},count={1}", suffix, count);

sum += count;

}

else

{

Console.WriteLine("suffix={0}", suffix);

}

return sum;

}

Sunday, July 2, 2017

Today we quickly review the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. This kind of system relaxes the data consistency requirement due to CAP theorem. The infrastructure comprises of a large number of cheap, commodity nodes in a network. Each node may have the same set of software configuration but different hardware configurations. Within each node, there may be a variable number of virtual nodes. Data is partitioned based on consistent hashing. We recall that this is a way to distribute the hashtable across many virtual nodes each of which picks it set of cache objects based on replicas of different cachepoints. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. Virtual nodes can join and leave the network without impacting the operation of the ring. When a new node joins the network, the joining node announces its presence and its id with broadcast. All the neighbor adjust the change of key ownership. The joining node starts to bulk copy the data from its neighbours. Other nodes receive the membership changes asynchronously. If a node does not have membership view updated yet, it will still find the right destination based on other neighbors of the joining nodes. However, the new node may still be in the process of downloading the data and not ready to serve yet. For this case, a vector clock is used to determine whether the new node is ready to serve the request. When an existing node leaves the network, it may not respond to the gossip protocol so the neighbours become aware. The neighbors will update the membership changes and copy data asynchronously. Generally the crashes happen on the physical nodes so a number of virtual nodes are affected.
With multiple copies of data, we need consistency models. These include strict consistency, read your write consistency, session consistency, monotonic read consistency and eventual consistency. The strict consistency model provides semantics as if there is only one copy of data. The read your write consistency model lets the client to see their updates immediately even on switching servers but not the updates are made by other clients. The session consistency provides read-write consistency for the client within the same session scope. The monotonic read consistency provides the time monotonicity guarantee that the client will only see more updated version of data in subsequent requests. The eventual consistency model allows the client to see an inconsistent view of an update in progress and is useful for concurrent access. The client will need to wait for same time if he needs to see his previous update. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates.

#codingexercise
Node getLCA( node root, int n1, int n2)
{
If (root == null) return null;
If (root.data == n1 || root.data == n2) return root;
Int left = getLCA(root.left, n1, n2);
Int right = getLCA(root.right, n1, n2);
If (left && right) return root;
If (left) return left;
If (right)return right;
Return null;
}

Find the maximum product of the elements of a subarray with a given minimum size
we sort the elements from left to right in an increasing manner.
Starting with a product of 1, we take multipliers from left and right in a greedy manner
If the multipliers are negative, we take two at a time.

Saturday, July 1, 2017

We continue with the discussion on Airbnb data architectures and store services. We were reviewing Payment Systems from Airbnb. Airbnb is a global online marketplace where payments are one of the largest in-house payment systems. They had three standalone systems - a backend billing API, a payment gateway to external processors, and a financial pipeline for Spark based financial reporting. Billing interface allowed addition of items other than the usual reservation and payment which includes reservation nights, fees, taxes, photography costs, coupons by representing them with abstract data models and attributes. The checkout flow now becomes general to all items and streamlined while publishing events to all interested subscribers.
The billing interface generates a series of platform events in a consistent schema which is distributed to processors via the Kafka message bus.The payment gateway provides a unified API that supports forward and backward transactions to and from the processors. With the unified API, a new payment model for a local or country specific payment method can be easily added and integrated.
The financial data pipeline is built on Spark and Scala. It tracks receivables, payables, revenues and taxes and how funds are transferred between these accounts. By the very nature of the funds involved, the bookkeeping is all double-entry where the amounts entering or exiting are all tallied. The events are converted into accounting information and then filtered into different accounts such as receivables, liabilities and revenue etc.
It should stand out that the rearchitecture emphasized a governance based model on the data architecture. The financial data pipeline was intended for an audience comprising of regulators, local governments and Airbnb investors. It is not customer centric or data warehouse. In other words events are not processed to update customer and billing in the billing backend. Also the data is not flowing into a warehouse where analysis from operational data could improve business intelligence with its own stack. This means that additional services can now be written as their own microservice. For example, a rewards service or a stored value service can be written behind the payment gateway. Also a payment routing service to route the payments through one or more external processors also can also be written as a microservice. Some microservices can be pure compute as those pertaining to storage such as customer and balances may be handled with storage APIs The second thing to observe is the separation between operational and reporting architecture. Events are generated from the operational side only and consumed by the reporting side only. An alternate way would be to use the data warehouse ingestion capabilities.
#codingexercise
public int maxProduct(List<int> input)
{
int lo = 1;
int hi = 1;
int result = 1;
for (int i = 0; i < input.Count; i++)
{
if (input[i] == 0)
{
lo = 1;
hi = 1;
}
else{
int temp = hi * input[i];
if (input[i] > 0)
{
hi = temp;
lo = Math.Min(lo * input[i], 1);
}
else
{
lo = temp;
hi = Math.Max(lo * input[i], 1);
}
}
if (result < hi)
result = hi;
}
return result;
}

The method can take a parameter for the size of the desired subset.

public int maxProductSubsetSum(List<int>input)
{
input.Sort();
int i = 0;
int j = input.Count - 1;
int product = 1;
while ( i < j )
{
// greedily take the next element or pair of elements for a product
to increase product
// increment i and decrement j depending on the choice above;
}
return product;
}