Cluster computing: June 2017

Friday, June 30, 2017

Today we continue discussing another System Design question. We have been covering data architectures and store services. We were reviewing Payment Systems from Airbnb. Airbnb is a global online marketplace where payments are one of the largest in-house payment systems. They had three standalone systems - a backend billing API, a payment gateway to external processors, and a financial pipeline for Spark based financial reporting. The platform events from billing and the payment events from gateway flow to the pipeline via Kafka. Billing interface allowed addition of items other than the usual reservation and payment which includes reservation nights, fees, taxes, photography costs, coupons by representing them with abstract data models and attributes. The checkout flow now becomes general to all items and streamlined while publishing events to all interested subscribers.
We were comparing their Kafka and data pipeline based solution with SQL database clusters. The SQL database cluster is usually used as a failover cluster not a performance or parallel cluster. Although it uses SAN, it is a shared storage and does not go offline. The clustering does not save space or efforts for backup or maintenance. And it does not scale out the reads for the database. Moreover, it does not give a 100% uptime. Similar considerations are true when the database server is hosted on a container and the database is on a shared volume. But a database server is capable of scaling up for large online transaction processing. Reporting stack is usually a pull and transformation operation on any database and is generally independent of the data manipulation from online transactions.
The analysis platform can now be built on top of the data pipeline. With Apache Spark for instance, we can run queries in both SQL as well as stream events.
#codingexercise
Check if a subset of an array has a given sum
bool HasSubsetSum(List<int> A, int n, int sum)
{
if (A == null) return false;
if (sum == 0)
return true;
if (n == 0 && sum != 0)
return false;
if (A[n-1] > sum)
return HasSubsetSum(A, n-1, sum);
return HasSubsetSum(A, n-1, sum) ||
HasSubsetSum(A, n-1, sum-A[n-1]);
}

Thursday, June 29, 2017

Today we continue discussing another System Design question. We have been covering data architectures and store services. We were reviewing Payment Systems from Airbnb. Airbnb is a global onlinemarketplace where payments are one of the largest in-house payment systems. They had three standalone systems - a backend billing API, a payment gateway to external processors, and a financial pipeline for Spark based financial reporting. The platform events from billing and the payment events from gateway flow to the pipeline via Kafka. Billing interface allowed addition of items other than the usual reservation and payment which includes reservation nights, fees, taxes, photography costs, coupons by representing them with abstract data models and attributes. The checkout flow now becomes general to all items and streamlined while publishing events to all interested subscribers.
At AirBnb, cash payments is common practice, so they developed reservation statuses as well as a timer to both parties. Their payment platform implements asynchronous payments which enables new payment methods to be added for different countries. The payment events have a well defined schema and flow through Kafka.
They developed a Spark and Scala based financial reporting application. This financial data pipeline system replaces the SQL based reporting system with the added benefits that it is easier to debug, maintain and extend than the legacy system and can scale horizontally. If we contrast this with the SQL Server on a Windows failover cluster as a high availability solution, then we see that we have added costs of a shared storage such as SAN. That shared storage allows the SQL Server instance to move to any other node since there is only one instance of the data. This was helpful if a host bus adapter failed or if the distributed transaction co-ordinator was to be failed over as well.
#codingexercise
Check if a subset of an array has a sum divisible by K
bool HasSubsetDivByK(List<int>A, int K)
{
if (A.Count > m) return true;
var dp = new bool[K];
for (int i = 0; i < A.Count; i++)
{
if (dp[0]) return true;
var n = new bool[k];
for (int j = 0; j < m; j++)
{
if (dp[j] == true &&
if (dp[(j + A[i]) % m] == false)
{
n[(j+A[i]) % m] = true;
}
}

for (int j = 0; j < m; j++)
{
if (n[j])
dp[j] = true;
}
dp[A[i]%j] = true;
}
return dp[0];
}
The same code above works for both positive and negative integers.

Wednesday, June 28, 2017

Today we discuss another System Design question. We have been covering data architectures and store services. We now look at Payments system which is central to most stores. We consider one such payment platform that has become truly global as it helps travelers throughout the world. Airbnb is a global onlinemarketplace where payments are one of the largest in-house payment systems.

Airbnb initially managed payments offline. It was done in the form of cash exchange and travelers and their host would often find themselves in awkward position if an ATM was not around. This is how the payments infrastructure was required as part of the Airbnb system. Bidirectional transfers were not immediately available in off the shelf payments so a solution was implemented through paypal where the host would receive a paper check.

Writing checks by hand does not scale as PayPal usages grew, so support for credit cards and automated bank transfers was added. This was added still difficult to connect the global travelers. For example, European guests in the United States had to pay an additional conversion fee. Credit cards were not so common in some countries. And hosts outside the United States did not want US dollars in their PayPal account. Brazil was particularly notorious for numerous competing payment networks which do not work internationally. Support for local credit cards and cash transfers were both required during the Rio Olympics. So a new payment form was introduced called Boletos which was an invoice for the amount owed to AirBnb. Guests could take the Boleto to their bank and pay the invoice and the reservation would show confirmed as status on AirBnb once the payment was received. By allowing Airbnb to be the facilitator of payments, the host and the guest were both freed from the hassle of direct exchange.

Boletos introduced asynchronous payments which now expanded support for new payment methods. In the beginning payment integrations were each done from scratch and added to the system. This presented more software maintenance as integrations expanded. Initially, ActiveRecord models were used with Rails to access the database. The payment objects were mutable and there was no change tracking enabled in the application. Consequently triggers were written in the database to capture the changes in a separate audit trail table. This custom change tracking and reverse-engineering of audit trails required elaborate SQL queries. When these queries grew more and more comples together with a dramatic increase in the volume of records, the reporting database started groaning for performance improvements. All these three issues of new integrations, billing and payment interface and reverse-engineering from audit trails were limiting. So a new platform was written which involved three standalone systems – a backend system that processed bills, a gateway that allowed payments to be processed by external processors and Kafka that streamed these two events to financial pipelines which enabled reporting using Spark.

This case study is interesting to compare SQL + MSMQ with Kafka and Spark. In the payments mechanism study at a retail company discussed earlier, we saw how events flew into message queues which were handled by dozens of processors and sometimes in series to update relevant records in transactional databases while reports would flow from a warehouse that drew data from operational side using traditional and perhaps classic methods. The capacity was increased based on clusters for queues and databases separately. Since the methods were traditional, their migration to cloud based services was well understood. The Apache stack is also portable and can run in a variety of modes from standalone to cloud and with a variety of data sources. Public clouds offer support for both open source and commercial stacks in addition to offering their own cloud native data pipelines and analysis services. The Airbnb method highlights standalone systems which can be deployed anywhere.
Courtesy : Airbnb blog
#codingexercise
In yesterdays coding exercise, we counted based on the number of subarrays sums that had the same modulus
we can also give counts of all possible subarrays that don't have a sum divisible by K

Get count of subarray divisible by K

int GetCountNotDivByK(List<int> A, int k)

{

var mod = new int[k];

int sum = 0;

for (int i = 0; i < A.Count; i++)

{

sum += A[i];

mod[((sum%k)+k)%k] += 1;

}

int result = 0;

for (int i =0; i < k; i++)

if (mod[i] > 1)

result += (mod[i] * (mod[i]-1)) / 2;

result += mod[0];

int all = 0;

for (int i = 0; i < mod.Length; i++)

{

all += mod[i];

}

return all - result;

}

Tuesday, June 27, 2017

Today we continue with the design discussion for transcription service. It involves the frontend user interface that is hosted from the cloud so that users can upload files directly to the service over a browser. The Service itself is implemented using WebAPIs that allow ingestion, extraction, merging of transcript as captions and indexing the transcript. Speech recognition software may be used to convert some or all of these audio files or files generated from other formats. Text generated from such software can be cleaned using edit distance algorithm to match the nearest dictionary word or corrected via the context. Text may also be substituted or enhanced with grammar corrections and suitable word from alterations. This may be optional for certain files as the enhancement may alter the original text which may not be acceptable for poems and ballads or other such forms of speech. Each enhancement could be a separate instance of the file and can be persisted as yet another document. This service should be capable of scaling out to service hundreds of thousands of customers who use personal assistant devices such as Alexa or Siri or their favorite telecom provider to record their conversations or meetings. Transcription service for workplace conference rooms will eliminate the need for keeping minutes if it can be used to convert the segment audio streams into transcripts. On the other hand, cleaned text can then be stored and indexed along with the location of the original file so that it can participate in retrievals. Data may be secured based on users and consequently they may be organized based on user information. This means we don't necessarily have to keep the documents in a single database or a NoSQL store. We also don't have to keep it in a public cloud database. The suggestion here is that files can be uploaded as S3 objects while their resource locators can be used in a database. The transcripts on the other hand along with the resource location can be handy inside a public cloud database. These databases will allow querying, document classification, ranking, sorting and collaborative filtering much better than if the data were in the form of files. Since the databases can be in public cloud, they can grow arbitrarily large and still should be able to scale. We don't have a need for NoSQL data but if the transcripts and the audio files can be translated to a bag of words and we want to use MapReduce, then we can use extract-transform-load from the database to a NoSQL store on a periodic basis . However, even machine learning algorithms are now executable inside a Python module within the database server. This means that depending on the analysis we want to perform and whether the processing is offline and batch, we can choose the data stores. Finally, the user accounts and metadata for transcripts are operational data and should belong in the database server.
#codingexercise
Get count of subarray divisible by K
int GetCountDivByK(List<int> A, int k)
{
var mod = new int[k];
int sum = 0;
for (int i = 0; i < A.Count; i++)
{
sum += A[i];
mod[((sum%k)+k)%k] += 1;
}
int result = 0;
for (int i =0; i < k; i++)
if (mod[i] > 1)
result += (mod[i] * (mod[i]-1)) / 2;
result += mod[0];
return result;
}
Testing for divisibility is simple
bool HasDivByK(List<int> A, int k)
{
var mod = new int[k];
int sum = 0;
for (int i = 0; i < A.Count; i++)
{
sum += A[i];
mod[((sum%k)+k)%k] += 1;
if (mod[((sum%k)+k)%k] > 1 || mod[0] > 0) return true;
}
return false;
}
#cc_document https://1drv.ms/w/s!Ashlm-Nw-wnWsB5CqwSCCxBs9bjc

Monday, June 26, 2017

Today we continue our discussion on system design. This time we cover Splunk. 1) Splunk – Splunk is a scaleable time series database arranged as events in buckets – hot/warm, cold or thawed. These are stored as index files together with some metadata. There are special buckets called introspection. The architecture consists of light weight forwarders, indexers and searchheads each with its own topology and built to scale. The forwarders are the collection agent that collect machine data from customer. The indexers receive the events and handle the bulk of the operations. The searchheads present analysis tools, charts and management interfaces. Splunk has recently added analysis features based on machine learning. Previously most of the search features were based on unix like command operators that became quite popular and boosted Splunk’s adoption s as the IT tool of choice among other usages. There are a variety of charting tools and their frontend is based on beautiful Javascript while the middle tier is based on Django. The indexers are written in C++ and come with robust capabilities. It is important to note that their database unlike convention relational data or NoSQL was designed primarily for specific usages. If they moved their database to commodity or platform options in the public cloud, they can evolve their frontend to be not restricted to a single enterprise based instance or local host based instance and provide isolated cloud based storage per customer on a subscription basis and Splunk as a cloud and browser based service.
Next, we cover Consistent Hashing – This is a notion that is quietly finding its way into several distributed systems and services. Initially we had a cache that was distributed among n servers as hash(o) modulo n. This had the nasty side-effect that when one or more servers went down or were added into the pool, all the objects in the cache would lose their hash because the variable n changed. Instead consistent hashing came up with the scheme of accommodating new servers and taking old servers offline by arranging the hashes around a circle with cache points. When a cache is removed or added, the objects with hashes along the circle are moved clockwise to the next cache point. It also introduced “virtual nodes” which are replicas of cache points in the circle. Since the caches may have non-uniform distribution of objects across caches, the virtual nodes have replicas of objects from a number of cache points.
Public class ConsistentHash<T>{
Private SortedDictionary<Int, T> circle = new SortedDictionary<Int, T>(); // Usually a TreeMap is used which keeps the keys sorted even with duplicates.
Since a number of replicas are maintained, the replica number may be added to the string representation of the object before it is hashed. A good example is memcached which uses consistent hashing.
#codingexercise
Find if there is a subset of numbers in a given integer array that when AND with the given number results in Zero.
static bool IsValid(List<int> items, int Z)
{
var res = new BitArray(new int[] { Z });
foreach(var item in items)
{
var b = new BitArray(new int[]{item});
res = res.And(b);
}
int[] result = new int[1];
res.CopyTo(result, 0);
return result[0] == 0;

}

Sunday, June 25, 2017

Closed Captioning – a software driven approach

Introduction: Web pages, Images, Videos are all created, uploaded and shared everywhere using online sites and services by well known companies. However, voice as a medium of expression, content and publishing has not found such popularity. This writeup briefly describes a software approach to making speech content relevant to web searches via transcription and closed captioning.

Description:

Data that is captured as speech or voice is usually an audio file that is more difficult to search than text for similar reasons as images are difficult to be searched when compared to text. The text representation of content whether audio or video in the form of metadata is a lot more helpful because it conforms to traditional format required for software based search. Native interpretation of image and voice files requires pattern recognition and speech recognition software respectively which are time consuming, costly and prone to failures. This makes it difficult to perform at the same time as the data becomes available. Consequently some out of band processing is involved to add relevant text data to such difficult representations of data. When voice is streamed over the internet as part of video, closed captioning packets can be overlaid over the data to render it at the users’ viewing device. This textual content can also be archived and searched just the same way as we search web pages or documents on a users’ computer. The same holds true for transcription and captioning services. Some captioning tools such as Camtasia, Captionate, Express Scribe, MAGpie2, Overstream allow a variety of features to help with textual representation of voice. They can be used to create transcripts or caption streams. There are also fully managed services available to do all stages of processing starting from taking your video, making a transcript and adding closed captions and returning the file to you. Transcripts can be created manually such as with ExpressScribe or WriteNaturally or voice recognition software such as from the consumer oriented digital companies. These can then be merged as captions using tools such as Camtasia. Youtube also had a feature to add closed caption to videos. However, the data from captions or transcripts can be just as usefully mined as metadata of the video or audio files. A background speech to text conversion service can automatically add such desirable content to existing collections and archives – some of which in the form of speech, songs, narrations and commentaries are considered golden and classics till date. Moreover, it is this ability to generate this data automatically for all new content as they become available going forward which makes it appealing to be combined with searches to produce relevant results.

Conclusion: While accuracy of speech recognition software is widely understood to be error prone, an offline service to automate derivation of transcripts and captions from existing and new content can prove to be valuable. Scenarios such as listening to a song by searching for a part of the lyrics whether or not it is available in video format is appealing to enhance the results from any web search. Similarly governments may find it useful to translate English conversations over mobile devices for text analysis and data mining. Also, consumer electronics such as Alexa or Siri that interact with users may keep track of these conversations with humans for analysis related to personalization based marketing. These therefore seem like the tip of an iceberg of data.

#codingexercise
Given a geometric progression series starting with 1 as the first term and values of r, S and p where :
r = common ration of the progression
S = sum of first N terms modulo p
p = a prime number
Find N
static int GetN(int r, int S, int p)
{
int res = -1;
long val = 0;
for (int k = 0; k < p; k++)
{
val += ((long)Math.Pow(r, k)) % p;
if (val % p == S)
{
res = k + 1;
break;
}
}
return res;
}

Saturday, June 24, 2017

Today we discuss Cloudera architecture – Cloudera is known for its data management software that can be deployed and run on any cloud. It offers an enterprise data hub, an analytics DB, and operational DB, data science and engineering and essentials. It is elastic and flexible, it has high performance analytics, it can easily provision over multicloud and it can be used for automated metering and billing. Essentially they allow different data models, real-time data pipelines and streaming applications with their big data platform. They enable data models to break free from vendor lockins and with the flexibility to let it be community defined. Moreover they let the database to scale by hosting it in the cloud. The data science workbench offered from Cloudera involves a console on a web browser that users can authenticate themselves with using Kerberos against the cluster KDC. Engines are spun-up and we can seamlessly connect with Spark, Hive, and Impala. The engines are spun up based on engine kernels and profiles. A command prompt and an editor are both available for interactive command execution. Code is executed in a session and we can quit from the session. This workbench also supports docker and kubernetes to manage containers. Cloudera Data Science workbench uses Docker and Kubernetes. Cloudera is supported on dedicated Hadoop hosts. Cloudera also adds a data engineering service called Altus. It’s a platform that works against a cloud by allowing clusters to be setup and torn down and jobs to be submitted to those clusters. Clusters may be Apache Spark, MR2 or Hive. Behind the data engineering service they provision these clusters using EC2 and S3 layers. The EC2 provides the compute layer and the S3 provides the storage layer. It may be interesting to note that neither of them Note that the clusters are not provided with Apache Mesos and Marathon stack and the storage is not provided on other file and database based technologies. But this can be expanded in future. Containerization technologies and Backend as a service aka lambda functions can also be supported. The main notion here is that Cloudera works with existing public clouds while it offers enterprise manager for on-premise solution. They provide great management capabilities, they are open-sourced and provide great platform support with the right mix of open source tools for analytics and operations. They even work well and draw parallels with Splunk which happens to be a machine data platform. Cloudera may likely benefit from the same practice around Sustaining Engineering and Support that endeared Splunk to its customers. It may be interesting to find out if Cloudera can slip as a data platform under the Splunk so that Splunk becomes less proprietary and embraces Cloudera’s community model. The time series database for Splunk can be left alone as only the cluster operations are migrated.
#codingexercise
Given a rod with price chart for different lengths, determine the cut lengths that result in the most price.
int GetMaxPrices(List<int> p, int n)
{
var dp = new int[n+1];
dp[0] = 0;
for (int i =1; i <= n; i++)
{
var price = INT_MIN;
for ( k = 1; k <= i; k++)
{
price = max(price, p[k] + dp[i-k]);
}
dp[i] = price;
}
return dp[n];
}
In order to determine the cut lengths, we can include a list for each position that includes the i-k which causes an update to the price. Then for the position n, we recursively enumerate the positions that contributed to n.