Cluster computing

Friday, July 20, 2018

Let us discuss the design to serve videos from a video repository to customers:

We keep track of Users and Videos in our inventory. Videos may also have images associated such as thumbnails. It may not be surprising to find images far exceeding the number of videos. The videos could themselves number under a billion. Images and videos are best served from a file storage cluster such as Isilon OneFS cluster although performance may need to be compared with storing them on the public cloud.

User Model has information about users credentials as well as profile and statistics. These can be maintained in a relational database and can be served from the cloud with a managed service database. Users authorization may also flow via OAuth and therefore API services for user and video management may help. A typical User Interface, API services and Data stores will help with making the service available on a variety of devices.

The Videos are large in number. They will be organized in pool and repository. The pool videos will be the most watched or trending and will be available from a web server that implements an LRU like policy. Pool can also be maintained per user and aged. Users watched videos can also be eagerly loaded into user's pool on users page load. Dynamic queries will be honored with immediate fetches from disk while statistics are updated to enable them to be served again from cache. The storage of videos presents significant challenges. There can be an alternative to letting them be managed by a managed OneFS NAS Storage. In this case, we place them in data center storage separated by some partitioning mechanism of videos. Streaming from the data center storage will continue to happen with one or more web servers dedicated for such purpose. Streaming services can maintain their own cache and performance improvements.

CDN will be used wherever possible to improve geographically close network. It will serve static resources and alleviate the load on the web servers. Caching will be offered at every level and as much as possible so that the backend is not hit for all purposes.

Video replication and availability from multiple regions may be offered to improve their availability and reliability. Indexes may be maintained for all video metadata including authoring and uploading by owners. Index need not be in relational data. Such information or mapping along with statistics can also be in key-value stores and executed with scatter-gather operations.

Grouping and collaborative filtering may be provided to provide recommendations to the user. Search term based top few videos may also be maintained for the most common search terms which also determine the candidacy of videos available in the pool. Social engineering features to videos such as likes and comments can also be enabled with their own microservice and non-relational databases.

There are a few forms of storage that are helpful to storing videos. I mentioned OneFS here and left the comparision to public cloud. In this case, we could also evaluate Object Storage. The files need to be available over the internet anyways and Object storage provides a convenient location and decouples the programmability with storage in either public or private cloud.

#codingexercise

Find the number of decreasing paths in a matrix:

For example we have increasing paths for

1 2

1 3

as 1, 1, 2, 3, {1,2}, {1,3}, {2,3},{1,2,3} and the decreasing paths would just be the opposite.

We can do this with recursion:

int getCount(int[,] matrix, int [,] dp, int x, int y)

{

if (dp[x,y] != -1)

return dp[x,y];

int dx = new int []{ 0, 1, -1, 0};

int dy = new int [] {1 , 0 , 0 , -1};

int result = 1; // element by itself

for (int i = 0; i < dx.count; i++)

{

int m = x + dx;

int n = y + dy;

if (isValid(m,n, matrix) && matrix[m,n] < matrix[x,y]) {

result += getCount(matrix, dp, m, n);

}

dp[x, y] = result;

return result;

}

Thursday, July 19, 2018

Today we discuss url shortening service. This involves:

1) shortening - take a url and return a much smaller url generally with an endpoint and a hash

2) redirection - take a short url and redirect to the original url.

With the primary use cases, we can start the design. Although it may not be necessary to include use cases in the design, we just list some others as custom url, analytics, automatic link expiration, manual link removal and providing interface options such as UI, API and command line.

The use cases help define the problem we are trying to solve. Spending some time up front on the use cases, helps us plan for the next steps.

Any system can solve a given use case but the design gets more specific as we introduce constraints in the system.

These constraints include high availability with high Service Level Agreements. There may be 100 million urls per month with 1 billion requests per month. The traffic may be over 400 requests per second. 99% of these requests need to be responded. 10% of the requests may be for shortening and the other 90% maybe for redirection. Adding them up, we get over 6 billion urls in 5 years.

With this estimation, we proceed to the storage and processing requirements for the design. Each URL may take say 1000 bytes and its hash may take twelve bytes for unicode. This may result in over 6 TBs for all urls.

The storage acts like a huge hash table which stores new mappings and retrieves a url for a given hash.

The hash itself is computed as the last six characters of the base64 of the md5 of the original url taken with random salt.

In order to scale the shortening and the redirection, we can create separate services for these so that each can scale independently. Services may be hosted by several web servers that are identical and behind a load balancer. The data may be a common server that stores all the states. Databases can easily scale to the data we have estimated. Caches can improve response times and the most used redirections can easily be served from cache. Other bottlenecks may similarly be improved.

#codingexercise

Find the number of increasing paths in a matrix:

For example we have

1 2

1 3

as 1, 1, 2, 3, {1,2}, {1,3}, {2,3},{1,2,3}

We can do this with recursion:
int getCount(int[,] matrix, int [,] dp, int x, int y)
{
if (dp[x,y] != -1)
return dp[x,y];
int dx = new int []{ 0, 1, -1, 0};
int dy = new int [] {1 , 0 , 0 , -1};
int result = 1; // element by itself
for (int i = 0; i < dx.count; i++)
{
int m = x + dx;
int n = y + dy;
if (isValid(m,n, matrix) && matrix[m,n] > matrix[x,y]) {
result += getCount(matrix, dp, m, n);
}
}
dp[x, y] = result;
return result;
}

Wednesday, July 18, 2018

We were discussing software versions and the ideal conditions for maintaining earlier releases. The reality is usually far from this ideal situation. Although continuous deployment may be available, generally software releases are at least months apart if not years apart. Such long development cycles leaves significant number of fixes made to earlier versions Customers may also be sticky to earlier versions as they wait for early adopters to try out the new releases. The long development cycles imply that there is quite a divergence between earlier versions and current release. This divergence is the source of several backward compatibility issues and takes significant toll on testing and suitable fixes in the wake of new development. This further prolongs the development cycle. Even as the features are developed in their own branches, a significant amount of time may go into pulling and pushing the parent branches to build the feature on the latest. More time is spent on conflicts, merges and bug fixes that demand resources and the time for development, These eat up into the resources otherwise reserved for helpful features for customers such as Command line interface (CLI), user interface (UI), software development kit (SDK) and API documentation. As corners are cut, features may not have the impact that they would otherwise do. One way to relieve this development on large code bases with lengthy development cycles is to create multiple phases of feature development so that the forking and merging are more frequent.

#codingexercise
Let us say we are given three values N, M and X.
N = number of elements in an array
M = the upper maximum any element can be. If A is the array 1 <= A[i] <= M where 0 <= i <= N-1
X = the value of the first element
The number 1 = the value of the last element
How many ways are there such that the adjacent elements are distinct ?
Let us think about this in recurrence form with say f(i) as the desired number of ways for array to be formed up to index i. Then the index i could be occupied by 1 or a non - one .
For the non-one case, there would M-1 choices after finding f(i-1) and ending with 1 at i
and there would be M-2 choices after finding f(i-1) and ending with non-one at i excluding the 1 and the non-zero value at i
Also since adjacents have to be distinct f(i) ending with 1 is the same as f(i-1) ending with non-one.
The solution is f(N-1) ending in 1.
Finally we know that we don't have to include the first element since its value is predetermined. The first element cannot be 1 and there is at most1 value in the range 1 to M at which the second element will form a duplicate with the first.
For N = 4, M = 3, X = 2
we have possible arrays by enumeration as [2, 1, 2, 1], [2, 1, 3, 1] and [2, 3, 2, 1]

In these case there are three arrays possible.

Tuesday, July 17, 2018

We were discussing software versions.

Ideally, we have one or two earlier versions to maintain when the next release is being developed in the current time. And these versions usually differ in the minor part of the dotted quad notation M.m.b.r where M stands for major, m for minor, b for branch and r for revision. This is very helpful towards the development of next release because the earlier versions are not very significantly different from the main branch and the patches made on the earlier branches may all be applicable to the main branch. Since there are not a number of versions, the software release cycles are small and the customers are able to upgrade without delay. The cadence of releases and upgrades means the current main branch has the luxury to break free from earlier designs and take on revolutionary ideas and architecture.

With the introduction of the agile development model and the Continuous Integration and Delivery the software releases, branching and versioning strategy were greatly relieved as there was no waiting time between when the main branch could be forked as a software release. The agile development model helped small teams remain focused on features and the features were incrementally built also with continuous integration and deployment. Only when all the backlog for the feature was satisfactorily burned down, the feature would move into the main branch and the feature branches deleted. Since the feature was continuously deploy-able and the feature branch was continuously syncing from the main branch the risk of merging the feature branch into the main branch is less risky, Moreover, developers spend time on feature development rather than bug fixes on existing and earlier versions. As the resources and the time for development increases, the business grows as innovation and incorporation are better articulated, Helpful features for customers such as Command line interface (CLI), user interface (UI), software development kit (SDK) and API documentation are greatly improved. Connectors and interfaces to automate some of the customer chores are developed and there is more customer mindshare than ever before potentially increasing the number of customers. This is the virtuous side of the sustaining engineering as long as the maintenance is not onerous.

Sustaining engineering managers like to draw the software engineering in branches forked from the main branch. Usually these branches are also merged into the main branch or deleted when earlier releases are no longer supported. However, the migration of fixes from the releases branch to the main branch does not need to be all at once and can even be made in parallel as and when the fix is applied to the release branch. This results in the marking of the release branch for delete without any further actions of including them to the main branch.

However, the world is far from these ideal situation and engineers and managers alike grapple to close the difference.
#codingexercise
Find the most frequently occuring gap in an array:
Int GetFrequentGap(List<int> A)
{
A.sort();
Var gapCount = new Dictionary<int, int> ();
For (int I = 1; I < A.Count; I++)
{
Int gap = A[I] - A[I-1];
If (gapCount.ContainsKey(gap)){
GapCount[gap] += 1;
} else {
GapCount.Add(gap, 1);
}
}
If (gapCount.isEmpty()) return –1;
Int max = GapCount.Values.Max();
Foreach (var kvp in GapCount){
If (kvp.Value == max)
Return kvp.Key;
}
Return –1;
}

Monday, July 16, 2018

Versions

Perhaps one of the necessary evils of software releases is that it introduces versions of shipped software. Customers generally take their own time to upgrade their copy of the software and at any point of time, different customers may have different versions of the software from the maker. Therefore, a Sustaining engineer is sure to find that a defect appearing in an old version may also manifest in other versions. Consequently, a fix applicable to a defect in a version may also be needed to be ported in other versions. When the fix is made to the current software and applied retroactively as a patch to former releases, it is called backporting a fix. Likewise, a fix made as a patch to a shipped software may need to be made to the current development pipeline so that the defect does not reappear in future releases.
Determining the software versions where the defect occurs, finding a suitable fix without side-effects and safe for those releases and propagating the fixes forward and backwards through versions constitute one of the main tasks of the sustaining engineering team. A large part of this effort involves side by side or sequential installation, testing and validation of fixes and take up a lot of time. Yet there are no tools to facilitate these chores for the Sustaining Engineer. One tool that comes very handy to track these fixes is source code version control systems such as git where branching and versioning strategy helps the engineers to label the fixes so that they can be tracked across releases and repeatedly applied to more than one versions.
Let us say there are two previous versions of software releases in addition to the current source code version that is in active development for future releases. A fix made to the current version may also need to be made to the earlier versions and vice versa. Yet there is no automation of bringing in the fix to these branches unless the branch itself is closed out and carried forward into the others. This makes up a significant delay in the appearance of the same fix in other branches. Moreover, the delay is dependent on manual actions taken to issue the command to bring in the same fix to these other versions. If the branches were chained and a fix could be tagged as applicable to specified branches the commands issued to make the fixes in these other branches could be automated. Such automation will not only relieve the tasks associated with making fixes in more than one places, it will also help with the consistency in performing these steps and to the result in each and every place where it is performed. Since fixes are made to components and components have owners, such fixes could trigger notifications to the component developers so that they are made aware of the flaws that have appeared in the component and affected external customers.

codingexercise
Find the minimum gap in an array. The largest gap is between the maximum and the minimum:
int GetMinGap(List<int> A)
{
A.sort();
int min = A.max() - A.min();
for(int i = 1; i < A.Count; i++)
if (A[i]-A[i-1] < min)
min = A[i] - A[i-1];
return min;
}

Sunday, July 15, 2018

We were discussing different types of query processing. We talked about this in context of P2P networks : both structured and unstructured. We saw how Acquisitional Query Processing worked in this case. A distributed structured P2P overlay application uses a distributed hash table over all the peers. Data objects are assigned unique identifiers called keys which are chosen from the same identifier space. Keys are mapped to live peer on the overlay network and create, update, delete and list operations are supported on a {key, value} pair. Each peer maintains a small routing table for the addresses of its neighbors and their IDs. Recently there have been some attempts at standardization on the keybased routing KBR API abstractions and OpenHash - an open publically DHT service that helps with a unification platform. Unstructured P2P overlay systems are generally adhoc in nature. They do not present this opportunity to being unified under a common platform for application development. There are other ways to describe the differences between these networks such as with :
Decentralization which is a way to determine whether the overlay system is distributed.
Architecture to describe the overlay system in terms of its operations
LookupProtocol which is used by the overlay system for queries
System Parameters which facilitate tuning of the system for various usages
Routing Performance where the lookup routing protocol performance can be meansured
Routing State where the state and the scalability of the overlay system can be compared.
Peers join and leave which describe the behaviour of the overlay system when rotations and churn occur.
Security which looks into the vulnerabilities of different implementations
Reliability and fault resiliency which examine how robust the overlay system is to faults.
With these measures, the networks may demonstrate different scales of performance which help with their comparisions. For example, the addressing may even be provided at Internet-like scale. The Content-Addressable-Network is a distributed P2P infrastructure where a hash-table over the Internet enables it to be scalable, fault-tolerant and self-organizing.
#codingexercise
Find the minimum gap in an array. The largest gap is between the maximum and the minimum:
int GetMinGap(List<int> A)
{
A.sort();
int min = A.max() - A.min();
for(int i = 1; i < A.Count; i++)
if (A[i]-A[i-1] < min)
min = A[i] - A[i-1];
return min;
}

Saturday, July 14, 2018

We were discussing the different kinds of query processing:

First, the time-series analysis queries are primarily focused on detecting trends or anomalies in archived streams.They can specify sort order, anomalies and not contiguous change patterns. Incident detection falls under this category.

Second, the similarity search queries. In this class of queries, a user is interested in determining whether the data is similar to a given pattern. Again using archived data, the pattern matching helps determine events of interest. Surveillance data querying falls under this category.

Third, the classification queries are related to similarity search queries but these run classification algorithms that group and tag the event data. Determining grades of data is one example of this query processing.

Fourth the signal processing queries. These are heavy computations performed on the data directly such as Fast Fourier Transform and filtering. These enable interpretations that are not possible via grouping, sorting, searching and ranking techniques mentioned earlier.
We talked about these queries in the context of the data from internet of things. There are different data sources. Logs for instance fall naturally in a time - series database and the querying for logs has given rise to a rich form of search operators and query options.

#codingexercise
We were discussing the method to find pairs that make up the same sum in a sorted list.
void PrintPairsForGivenSum (List <int> sorted, int sum)
{
validate(sorted);
for (int I = 0; i < sorted.length; i++) {
int index = binarySearch(sorted, sum - sorted [i], i);
if ( index != -1 && index != i) {
Console.WriteLine ("{0} {1}", sorted [i], sorted [index]);
}
}
}
Notice that it might be sufficient to iterate the list only upto the element that crosses the value of sum/2 and results in an optimization of the method above. The binary search already performs log(n)