Cluster computing

Wednesday, June 21, 2017

We were reviewing DynamoDB and Cosmos DB from our discussion on the design of online store. Today we look at another system design question. How to design YouTube ? There are quite a few blog posts that answer this question. This one is not expected to be very different, however we will benefit from our discussion of cloud databases in earlier posts.
YouTube serves videos over the web. They grew incredibly fast with over hundred million video views per day. Their platform consists of Apache web server, Python web application on Linux stack, MySQL database server, pysco - a python to C compiler and lighttpd for video. They run Apache with mod_fast_cgi. The python web application mashes up all the content from different data sources. This application is usually not the bottleneck. Its the RPCs to get data sources that take time. They use the pysco compiler for high cpu intensive activities and a lot of caching. Fully formed Python objects are cached. Row level data in databases is cached. Static content is cached. They use NetScalar for load balancing and caching static content.
Data is constituted and sent to each application. The values are cached in local memory. An agent can watch for changes and send the data to each application.
Each video is hosted by a mini cluster The cluster comes in useful to scale and to serve video from more than one machines. There is also no downtime for backups. Servers use lighttpd web server for video becauses it uses epoll to wait on multiple fds. It can also scale with connections because there is little thrashing of content. CDNs are used to serve content wherever possible. Many of the videos are requested only a few times and a larger number of videos are requested. These translate to random disk reads which means caching may not always help. Therefore disks that are RAIDed are tuned. Initially the database was just mysql over a monolithic RAID 10. They leased hardware, went from a single master with multiple read slaves to partitioning the database and using a sharded approach. No matter how the data was organized, cache misses were frequent especially with slow replication. One of their solutions was to prioritize traffic by splitting the data into two clusters - a video watchpool and a general cluster. Subsequently they reduced cache misses and replication lag. Now they can scale the database arbitrarily. This is probably a good case study candidate for migrating data to public cloud because they were already storing data in mysql and most of the replication and caching was to distribute the load. They use data centers for file storage and CDN They have a large collection of videos and images. Images suffer from latency. Videos are bandwidth dependent. They use bigtable to lookup images in different data centers.
#codingexercise
Given a list of numbers, insert +,-* operators between them so that the result is divisible by a given number
http://ideone.com/7UlJn3
With these combinations we can now compute each of them to filter out the one we want.
http://ideone.com/ZeaznW

The solution for the operators can also include filters to select only those combinations that return a value divisible by the divisor.
Brackets are considered repeated when they have matching and closing paranthesis consecutively
For N=2, we have ()() or (()) and the length is 4
For N=3 we have ())) and the length is 4
Find the minimum length for a repeated brackets of number N
if x is the number of opening brackets and y is the number of closing brackets in the minimum length repeated sequence
then x*y =N or x*y+() = N
and we can increment x by 1 or y by 1

Cluster computing

Wednesday, June 21, 2017

No comments:

Post a Comment