Tuesday, June 27, 2017

Today we continue with the design discussion for transcription service. It involves the frontend user interface that is hosted from the cloud so that users  can upload files directly to the service over a browser. The Service itself is implemented using WebAPIs that allow ingestion, extraction, merging of transcript as captions and indexing the transcript. Speech recognition software may be used to convert some or all of these audio files or files generated from other formats. Text generated from such software can be cleaned using edit distance algorithm to match the nearest dictionary word or corrected via the context. Text may also be substituted or enhanced with grammar corrections and suitable word from alterations. This may be optional for certain files as the enhancement may alter the original text which may not be acceptable for poems and ballads or other such forms of speech. Each enhancement could be a separate instance of the file and can be persisted as yet another document. This service should be capable of scaling out to service hundreds of thousands of customers who use personal assistant devices such as Alexa or Siri or their favorite telecom provider to record their conversations or meetings. Transcription service for workplace conference rooms will eliminate the need for keeping minutes if it can be used to convert the segment audio streams into transcripts. On the other hand, cleaned text can then be stored and indexed along with the location of the original file so that it can participate in retrievals. Data may be secured based on users and consequently they may be organized based on user information. This means we don't necessarily have to keep the documents in a single database or a NoSQL store. We also don't have to keep it in a public cloud database. The suggestion here is that files can be uploaded as S3 objects while their resource locators can be used in a database. The transcripts on the other hand along with the resource location can be handy inside a public cloud database. These databases will allow querying, document classification, ranking, sorting and collaborative filtering much better than if the data were in the form of files. Since the databases can be in public cloud, they can grow arbitrarily large and still should be able to scale. We don't have a need for NoSQL data but if the transcripts and the audio files can be translated to a bag of words and we want to use MapReduce, then we can use extract-transform-load from the database to a NoSQL store on a periodic basis . However, even machine learning algorithms are now executable inside a Python module within the database server. This means that depending on the analysis we want to perform and whether the processing is offline and batch, we can choose the data stores. Finally, the user accounts and metadata for transcripts are operational data and should belong in the database server.
#codingexercise
Get count of subarray divisible by K
int GetCountDivByK(List<int> A, int k)
{
var mod = new int[k];
int sum = 0;
for (int i = 0; i < A.Count; i++)

    sum += A[i];
    mod[((sum%k)+k)%k] += 1;
}
int result = 0;
for (int i =0; i < k; i++)
    if (mod[i] > 1)
          result += (mod[i] * (mod[i]-1)) / 2;
result += mod[0];
return result;
}
Testing for divisibility is simple
bool HasDivByK(List<int> A, int k)
{
var mod = new int[k];
int sum = 0;
for (int i = 0; i < A.Count; i++)
{
   sum += A[i];
   mod[((sum%k)+k)%k] += 1;
   if (mod[((sum%k)+k)%k]  > 1 || mod[0] > 0)  return true; 
}
return false;
}
#cc_document https://1drv.ms/w/s!Ashlm-Nw-wnWsB5CqwSCCxBs9bjc

Monday, June 26, 2017

Today we continue our discussion on system design. This time we cover Splunk. 1)      Splunk – Splunk is a scaleable time series database arranged as events in buckets – hot/warm, cold or thawed.  These are stored as index files together with some metadata. There are special buckets called introspection. The architecture consists of light weight forwarders, indexers and searchheads each with its own topology and built to scale. The forwarders are the collection agent that collect machine data from customer. The indexers receive the events and handle the bulk of the operations. The searchheads present analysis tools, charts and management interfaces.  Splunk has recently added analysis features based on machine learning. Previously most of the search features were based on unix like command operators that became quite popular and boosted Splunk’s adoption s as the IT tool of choice among other usages. There are a variety of charting tools and their frontend is based on beautiful Javascript while the middle tier is based on Django. The indexers are written in C++ and come with robust capabilities. It is important to note that their database unlike convention relational data or NoSQL was designed primarily for specific usages. If they moved their database to commodity or platform options in the public cloud, they can evolve their frontend to be not restricted to a single enterprise based instance or local host based instance and provide isolated cloud based storage per customer on a subscription basis and Splunk as a cloud and browser based service.
Next, we cover Consistent Hashing – This is a notion that is quietly finding its way into several distributed systems and services. Initially we had a cache that was distributed among n servers as hash(o) modulo n. This had the nasty side-effect that when one or more servers went down or were added into the pool, all the objects in the cache would lose their hash because the variable n changed. Instead consistent hashing came up with the scheme of accommodating new servers and taking old servers offline by arranging the hashes around a circle with cache points.  When a cache is removed or added, the objects with hashes along the circle are moved clockwise to the next cache point. It also introduced “virtual nodes” which are replicas of cache points in the circle. Since the caches may have non-uniform distribution of objects across caches, the virtual nodes have replicas of objects from a number of cache points.
Public class ConsistentHash<T>{
Private SortedDictionary<Int, T> circle = new SortedDictionary<Int, T>(); // Usually a TreeMap is used which keeps the keys sorted even with duplicates.
Since a number of replicas are maintained, the replica number may be added to the string representation of the object before it is hashed.  A good example is memcached which uses consistent hashing.
#codingexercise
Find if there is a subset of numbers in a given integer array that when AND with the given number results in Zero.
    static bool IsValid(List<int> items, int Z)
    {
        var res = new BitArray(new int[] { Z });
        foreach(var item in items)
        {
            var b = new BitArray(new int[]{item});
            res = res.And(b);
        }
        int[] result = new int[1];
        res.CopyTo(result, 0);
        return result[0] == 0;

    }

Sunday, June 25, 2017

Closed Captioning – a software driven approach
Introduction: Web pages, Images, Videos are all created, uploaded and shared everywhere using online sites and services by well known companies. However, voice as a medium of expression, content and publishing has not found such popularity. This writeup briefly describes a software approach to making speech content relevant to web searches via transcription and closed captioning.
Description:
Data that is captured as speech or voice is usually an audio file that is more difficult to search than text for similar reasons as images are difficult to be searched when compared to text.  The text representation of content whether audio or video in the form of metadata is a lot more helpful because it conforms to traditional format required for software based search. Native interpretation of image and voice files requires pattern recognition and speech recognition software respectively which are time consuming, costly and prone to failures. This makes it difficult to perform at the same time as the data becomes available. Consequently some out of band processing is involved to add relevant text data to such difficult representations of data. When voice is streamed over the internet as part of video, closed captioning packets can be overlaid over the data to render it at the users’ viewing device. This textual content can also be archived and searched just the same way as we search web pages or documents on a users’ computer. The same holds true for transcription and captioning services. Some captioning tools such as Camtasia, Captionate, Express Scribe, MAGpie2, Overstream allow a variety of features to help with textual representation of voice. They can be used to create transcripts or caption streams. There are also fully managed services available to do all stages of processing starting from taking your video, making a transcript and adding closed captions and returning the file to you. Transcripts can be created manually such as with ExpressScribe or WriteNaturally or voice recognition software such as from the consumer oriented digital companies. These can then be merged as captions using tools such as Camtasia. Youtube also had a feature to add closed caption to videos. However, the data from captions or transcripts can be just as usefully mined as metadata of the video or audio files.  A background speech to text conversion service can automatically add such desirable content to existing collections and archives – some of which in the form of speech, songs, narrations and commentaries are considered golden and classics till date.  Moreover, it is this ability to generate this data automatically for all new content as they become available going forward which makes it appealing to be combined with searches to produce relevant results.
Conclusion:  While accuracy of speech recognition software is widely understood to be error prone, an offline service to automate derivation of transcripts and captions from existing and new content can prove to be valuable. Scenarios such as listening to a song by searching for a part of the lyrics whether or not it is available in video format is appealing to enhance the results from any web search. Similarly governments may find it useful to translate English conversations over mobile devices for text analysis and data mining. Also, consumer electronics such as Alexa or Siri that interact with users may keep track of these conversations with humans for analysis related to personalization based marketing. These therefore seem like the tip of an iceberg of data.
#codingexercise
Given a geometric progression series starting with 1 as the first term and values of r, S and p where :
r = common ration of the progression
S = sum of first N terms modulo p
p = a prime number
Find N
    static int GetN(int r, int S, int p)
    {
        int res = -1;
        long val = 0;
        for (int k = 0; k < p; k++)
        {
            val += ((long)Math.Pow(r, k)) % p;
            if (val % p == S)
            {
               res =  k + 1;
               break;
            }
        }
        return res;
    }

Saturday, June 24, 2017

Today we discuss Cloudera architecture – Cloudera is known for its data management software that can be deployed and run on any cloud. It offers an enterprise data hub, an analytics DB, and operational DB, data science and engineering and essentials. It is elastic and flexible, it has high performance analytics, it can easily provision over multicloud and it can be used for automated metering and billing.  Essentially they allow different data models, real-time data pipelines and streaming applications with their big data platform. They enable data models to break free from vendor lockins and with the flexibility to let it be community defined. Moreover they let the database to scale by hosting it in the cloud. The data science workbench offered from Cloudera involves a console on a web browser that users can authenticate themselves with using Kerberos against the cluster KDC. Engines are spun-up and we can seamlessly connect with Spark, Hive, and Impala. The engines are spun up based on engine kernels and profiles. A command prompt and an editor are both available for interactive command execution.  Code is executed in a session and we can quit from the session. This workbench also supports docker and kubernetes to manage containers. Cloudera Data Science workbench uses Docker and Kubernetes. Cloudera is supported on dedicated Hadoop hosts.  Cloudera also adds a data engineering service called Altus. It’s a platform that works against a cloud by allowing clusters to be setup and torn down and jobs to be submitted to those clusters. Clusters may be Apache Spark, MR2 or Hive. Behind the data engineering service they provision these clusters using EC2 and S3 layers. The EC2 provides the compute layer and the S3 provides the storage layer. It may be interesting to note that neither of them  Note that the clusters are not provided with Apache Mesos and Marathon stack and the storage is not provided on other file and database based technologies. But this can be expanded in future. Containerization technologies and Backend as a service aka lambda functions can also be supported. The main notion here is that Cloudera works with existing public clouds while it offers enterprise manager for on-premise solution. They provide great management capabilities, they are open-sourced and provide great platform support with the right mix of open source tools for analytics and operations. They even work well and draw parallels with Splunk which happens to be a machine data platform. Cloudera may likely benefit from the same practice around Sustaining Engineering and Support that endeared Splunk to its customers. It may be interesting to find out if Cloudera can slip as a data platform under the Splunk so that Splunk becomes less proprietary and embraces Cloudera’s community model. The time series database for Splunk can be left alone as only the cluster operations are migrated.
#codingexercise
Given a rod with price chart  for different lengths, determine the cut lengths that result in the most price.
int GetMaxPrices(List<int> p, int n)
{
var dp = new int[n+1];
dp[0] = 0;
for (int i =1; i <= n; i++)
{
    var price = INT_MIN;
    for ( k = 1; k <= i; k++)
    {
         price = max(price, p[k] + dp[i-k]);
     }
    dp[i] = price;
}
return dp[n];
}
In order to determine the cut lengths, we can include a list for each position that includes the i-k  which causes an update to the price. Then for the position n, we recursively enumerate the positions that contributed to n. 


Friday, June 23, 2017

Today we discuss the design of Whatsapp messenger. This is a messaging architecture. It offers free texting in a SMS world. There are no ads, no gimmicks, no games and no feeds. They have hundreds of nodes, thousands of cores, hundreds of terabytes of RAM, serve billions of smartphones. Their server infrastructure is based on Erlang/FreeBSD.  The Erlang is a programming language that is used to build massively scalable soft real-time systems with requirements on high availability.  With this programming language they send out about 70 million messages per second.
This architecture is famous for “it just works” because only a phone number is used as the address to send messages to and messages can be voice, video, images or text.   It also features voice message, group chats and read-receipts.
The backend consists of Erlang, Jabber, FreeBSD, Yaws web server, PHP, BEAM, and custom XMPP. The Frontend consists of seven client platforms for the different mobile devices and SQLLite. Every user is an actor. Messages flow between actors when they are online or they are stored offline. When the actor comes online, he pulls the messages sent to him.  User and actor differ in that the actor notion or proxy exists only in the Whatsapp backend framework. Message sent from user goes to actor inbox which then sends the message to the other actor who then repeatedly sends to the user behind the actor. When that user comes online, the messages appear.  The Mnesia backend DB stores the actors and their inboxes. The message is encrypted with a key as a tag. They use tools for system activity monitoring on all their server operating systems as well as BEAM Processor hardware perf counters and dtrace. BEAM is the erlang VM that is used to process the messages.
The highlights of this architecture is its performance oriented design. Even similar feature from Facebook also written in erlang could not match the widespread adoption of this messenger.  They have proven that Erlang and FreeBSD have unmatched SMP scalability.
#codingexercise
Let C be the cost of a string S defined as the length of the longest substring of S which has all characters same.
int GetCost(string S)
{
int max = 0;
for (int i =0; i < S.Length; i++)
{
  int cost = 0;
  for (int j =i; j < S.Length; j++)
  {
     if (S[j] != S[i]) break;
     cost += 1;
   }
if (cost > max)
     max = cost;
}
return max;
}
Given a language L with c distinct characters and strings all of length n, find the expected value of cost as Sum(cost)/number_of_strings
void FormStringInL(List<char> c, int n, ref List<char> b, ref List<string> combinations)
{
if (b.Length == n) {combinations.Add(ConvertToString(b)); return;}
foreach(var a in c)
{
b.Add(a);
FormStringInL(c, n, ref b, ref combinations);
b.RemoveLast();
}
}
value = combinations.Sum(x => GetCost(x))/combinations.Count;
Console.WriteLine(value);
# other System Design : https://1drv.ms/w/s!Ashlm-Nw-wnWsBPfKqVUIHGkJ4Rf

#tree delete operation for node z
if either left or right child of z exists to delete for node
   find the tree successor as y or set it to z
if y has a child, let it take its place and fix the parent of that child
if parent of y was null then it was root, fix root otherwise fix parents left or right
if y is not the same as z
    then copy z to y
return y
Node TreeSuccessor(Node z)
(
if (z == null) return null;
if (z.right)
    return TREE_MINIMUM(z.right);
Node parent = z.parent;
while(parent && parent.right == z)
{
 z = parent;
 parent = parent.parent;
 }
return parent;
}

Thursday, June 22, 2017

We were reviewing DynamoDB and Cosmos DB from our discussion on the design of online store. Then we started discussing the design of YouTube. We now continue with this discussion.
We listed the design considerations as follows:
We keep track of Users and Videos in our inventory. Videos may also have images associated such as thumbnails. It may not be surprising to find images far exceeding the number of videos. The videos could themselves number under a billion. Images and videos are best served from a file storage cluster such as Isilon OneFS cluster although performance may need to be compared with that of partitioned datacenter storage. 
User Model has information about users credentials as well as profile and statistics. These can be maintained in a relational database and can be served from the cloud with a managed service database. Users authorization may also flow via OAuth and therefore API services for user and video management may help. A typical User Interface, API services and Data stores will help with making the service available on a variety of devices. 
The Videos are large in number. They will be organized in pool and repository. The pool videos will be the most watched or trending and will be available from a web server that implements an LRU like policy. Pool can also be maintained per user and aged. Users watched videos can also be eagerly loaded into user's pool on users page load. Dynamic queries will be honored with immediate fetches from disk while statistics are updated to enable them to be served again from cache. The storage of videos presents significant challenges. There can be an alternative to letting them be managed by a managed OneFS NAS Storage. In this case, we place them in data center storage separated by some partioning mechanism of videos. Streaming from the data center storage will continue to happen with one or more web servers dedicated for such purpose. Streaming services can maintain their own cache and performance improvements.
CDN will be used wherever possible to improve geographically close network. It will serve static resources and alleviate the load on the web servers. Caching will be offered at every level and as much as possible so that the backend is not hit for all purposes.
Video replication and availability from multiple regions may be offered to improve their availability and reliability. Indexes may be maintained for all video metadata including authoring and uploading by owners.  Index need not be in relational data. Such information or mapping along with statistics can also be in key-value stores and executed with scatter-gather operations.
Grouping and collaborative filtering may be provided to provide recommendations to the user. Search term based top few videos may also be maintained for the most common search terms which also determine the candidacy of videos available in the pool. Social engineering features to videos such as likes and comments can also be enabled with their own microservice and non-relational databases.
#codingexercise
void GeneratePrimesUpto(int N)
{
var multiples = new List<int>(Enumerable.Repeat(0,N+1));
multiples[0] = 1;
multiples[1] = 1;
for (int i =2; i < N; i++)
   for (int k = 2; k<=N && i*k<=N; k++)
        if (multiples[i*k] == 0)
             multiples[i*k] = 1;
for (int k =1; k<=N; k++)
       if (multiples[k] == 0)
           Console.WriteLine("{0}", k);
}

We were discussing another problem yesterday:
Find the minimum length for a repeated brackets of number N
if x is the number of opening brackets and y is the number of closing brackets in the minimum length repeated sequence
then x*y =N or x*y+() = N
and we can increment x by 1 or y by 1
http://ideone.com/s0dxCX
This strategy does not take into consideration minimization based on consecutive x and y multiples with x == y
which means
        static int GetCountADP(int N, int m, int n)
        {
            if (N <= 0) return 0;
            if (N == 0) return 0;
            if (N == 1) return 2;
            if (m * n == N) return m + n;
            if (m * n > N) return N + N + N*2;
            int count1 = GetCountADP(N - n, m + 1, n) + 1;
            int count2 = GetCountADP(N - m, m, n + 1) + 1;
            int count3 = GetCountADP(N - 1, m, n) + 2;
            int count4 = GetCountADP(N - m*n, m+1, n+1) + m + n;
            var ret = new List<int>() { count1, count2, count3, count4 };
            int min = ret.Min();
            return min;

        }

Wednesday, June 21, 2017

We were reviewing DynamoDB and Cosmos DB from our discussion on the design of online store. Today we look at another system design question. How to design YouTube ? There are quite a few blog posts that answer this question. This one is not expected to be very different, however we will benefit from our discussion of cloud databases in earlier posts.
YouTube serves videos over the web. They grew incredibly fast with over hundred million video views per day.  Their platform consists of Apache web server, Python web application on Linux stack, MySQL database server, pysco - a python to C compiler and lighttpd for video. They run Apache with mod_fast_cgi. The python web application mashes up all the content from different data sources. This application is usually not the bottleneck. Its the RPCs to get data sources that take time.  They use the pysco compiler for high cpu intensive activities and a lot of caching. Fully formed Python objects are cached.  Row level data in databases is cached. Static content is cached. They use NetScalar for load balancing and caching static content.
Data is constituted and sent to each application. The values are cached in local memory. An agent can watch for changes and send the data to each application.
Each video is hosted by a mini cluster The cluster comes in useful to scale and to serve video from more than one machines. There is also no downtime for backups. Servers use lighttpd web server for video becauses it uses epoll to wait on multiple fds. It can also scale with connections because there is little thrashing of content. CDNs are used to serve content wherever possible. Many of the videos are requested only a few times and a larger number of videos are requested. These translate to random disk reads which means caching may not always help. Therefore disks that are RAIDed are tuned.  Initially the database was just mysql over a monolithic RAID 10. They leased hardware, went from a single master with multiple read slaves to partitioning the database and using a sharded approach. No matter how the data was organized,  cache misses were frequent especially with slow replication. One of their solutions was to prioritize traffic by splitting the data into two clusters - a video watchpool and a general cluster. Subsequently they reduced cache misses and replication lag. Now they can scale the database arbitrarily. This is probably a good case study candidate for migrating data to public cloud because they were already storing data in mysql and most of the replication and caching was to distribute the load. They use data centers for file storage and CDN  They have a large collection of videos and images. Images suffer from latency. Videos are bandwidth dependent.  They use bigtable to lookup images in different data centers.
#codingexercise
Given a list of numbers, insert +,-* operators between them so that the result is divisible by a given number
http://ideone.com/7UlJn3
With these combinations we can now compute each of them to filter out the one we want.
http://ideone.com/ZeaznW

The solution for the operators can also include filters to select only those combinations that return a value divisible by the divisor.
Brackets are considered repeated when they have matching and closing paranthesis consecutively
For N=2, we have ()() or (()) and the length is 4
For N=3 we have ())) and the length is 4
Find the minimum length for a repeated brackets of number N
if x is the number of opening brackets and y is the number of closing brackets in the minimum length repeated sequence
then x*y =N or x*y+() = N
and we can increment x by 1 or y by 1