Cluster computing

Sunday, May 7, 2017

#codingexercise

Count pairs with sum as prime number and less than n. The numbers in the pair are each greater than or equal to 1 and less than n and they are not equal to each other

Get all prime numbers less than n.

Prime numbers can be generated with the sieve of Eratosthenes

This is a very simple way to filter out the prime numbers upto n.
for i ranging from 2 to square root of n:
for j ranging from multiples of i upto n
mark the number as not prime
those that remain unmarked are the primes.
int count = 0;
for (int i = 0; i < primes.Count; i++)
for (int j = 0; j <= i; j++)
if (primes[i] != primes[j] && primes.Contains(primes[i] + primes[j]) && primes[i] + primes[j] < 30)
{
Console.WriteLine("{0},{1}", primes[j], primes[i]);
count++;
}
Console.Write("Count={0}", count);

Saturday, May 6, 2017

#codingexercise

Find count of all sets in which adjacent elements are such that one of them divides the other. We are given n as the size of the set required and elements can range in value from 1 to m.

We know the adjacencies are possible only when the numbers are factors or multiples. Therefore the valid combinations must be as many as possible with each of the factors or multiples.

1 <= i <= m, dp[1, m] = 1.

For 2 <= i <= n

for 1 <= j <= m and

dp[i,j] = 0 +

dp-value of the previous row in the dp-matrix for column equal to each of the factors of j +

dp-value of the previous row in the dp-matrix for column equal to each of the multiples of j

For m = 3 and n = 3 we get

Dp-matrix as

1 1 1

0+1+2 0+2+0 0+2+0

0+3+4 0+5+0 0+5+0

For factors of 1,2,3 as 1 and 1,2, and 1,2,3 respectively

And multiples of 1,2,3 as 2,3 and 0 and 0 respectively

This gives a total of 17 values as the sum of the last row to indicate the number of set with elements satisfying the adjacency requirement.

Friday, May 5, 2017

We were reviewing the containerization differences in the public clouds. Containerization has been a new and emerging trend in application development and cloud computing usage.
As we review more about containerization, we should take note of some of the classics of application development:

1) Physics laws cannot be violated. All aspects of applications from design to performance are determined by storage and processing requirements.

2) Applications need to be modular so that they can have separation of concerns

3) Applications need not assume anything about storage and processing as they are provisioned on a variety of platform components

4) Layering still makes sense as that the functionalities are tiered so one can work assuming lower layer works.

5) Virtualization can be deep but applications don’t have to be chatty.

6) Networking communications are still secured by encryption and authentication related activities

7) Dependencies and backing services continue to be declared and isolated to provide fault zones

8) Logging and other application monitoring patterns continue to assist in the field as their forms have changed.

9) Application best practice has become norm and they are made available in a variety of options.

10) Portability remains important as they are now being serviced by containers.

Each decision taken by the developer for the application development is weighed against these and other guiding principles.
#codingexercise

If we have a matrix where each cell has a cost to travel and we can only move right or down, what is the minimum cost to travel from top right of matrix to bottom right corner ?

int GetCost(int[,] A, int m, int n)
{
if (n < 0 || m < 0)
return INT_MAX;
if (m == 0 && n == 0)
return A[m,n];
return A[m, n] + Math.min(GetCost(A, m-1, n) , GetCost(A, m, n-1));
}

Thursday, May 4, 2017

A comparision of Container Service from public clouds:
Both AWS and Azure now provide Container Services where applications can be modified to run in containers. AWS container service is built upon ACS is based on Mesos which is an open-source container orchestration system. ECS does not monitor Containers at a fine grained level but ECS has integrated containers with CloudWatch. With Cloud watch, we can monitor all activities of resources. AWS supports only Linux containers and not Microsoft containers. Container Service is bound with native data services. So the containers are not yet portable. Google and Microsoft use Kubernetes but in different ways. Their container service comes with automated upgrade and repair. Kubernetes laid more emphasis on container orchestration and management and little emphasis on automated upgrades and repairs. So this is complimentary to the Container Service offerings.
Container images are pulled in from container registries which may exist within or outside the AWS infrastructure. Images are typically built from a Docker file. To prepare the applications to run on the Amazon ECS, ECS provides tasks which maps applications to containers along with container definitions. Each task is the instantiation of Container instance within the cluster. An ECS agent is installed per cluster.
Azure provides an integrated container experience from application development standpoints. Applications can be published to docker and actions can be taken on the containers without ever having to leave the IDE. Directly from the Windows marketplace, users can now deploy a Windows Server 2016 with docker engine installed and container feature enabled. Even a vm can be created on the localhost hyper-V to act as the container host. Both Docker and the Powershell can be used to interact with the containers.
Public cloud adoption of container services is emerging so it is picking up speed but not as much in comparision to the virtual machine services. This is a trend that is also seen with private clouds. Some of the trends seen with private cloud are actually quite relevant to study user habits. These trends include the following:
Compute users and application developers are increasingly divergent. The former require single instances of compute with a specific environment –OS, flavor and version while the latter are generally favoring clusters and containerization.
Individual instances of compute no longer matter where they are carved from. Users do not distinguish between Linux Host Containers, Virtual machines hosted on Openstack or Virtual machines hosted on VMWare.
Application developers prefer clustering with load balancing to host their application backing services such as databases which are therefore going to shared volumes between nodes versus dedicated storage or clusters in the organization
OS containerization enables the possibility of seamlessly moving users and their compute resources over regions and networks

#codingexercise
1) If we are given a list of stations and distances between them, can we find the all pairs shortest distance ?

Yes, Floyd-Warshall's method gives shortest path weighs matrix This algorithm computes the shortest path weights in a bottom-up manner. It exploits the relationship between a pair of intermediary vertices and the shortest paths that pass through them. If there is no intermediary vertex, then such a path has at most one edge and the weight of the edge is the minimum. Otherwise, the minimum weight is the minimum of the path from I to j or the path from I to k and k to j. Thus this algorithm iterates for each of the intermediary vertices for each of the given input of an N*N matrix to compute the shortest path weight.

dij with no intermediary vertices = edge weight

= min(dij-at-(k-1) , dik-at-(k-1) + djk-at-(k-1)) when # intermediary vertices k >=1

D for k be a new nxn matrix initialized with the node to itself distance

for k = 1 to n

for I = 1 to n

for j = 1 to n

dij(k) = min(dij(k-1), dik(k-1) + dkj(k-1))

return Dn

Wednesday, May 3, 2017

We talked about the twelve factors that enable the applications to be hosted on distributed environment with containerization so that it can scale. Let us now look at cluster services and abstractions that service these applications. Such use case will hit home on a majority of the considerations for scheduling and and workflows. Deis Workflow is a good example of these services.
L et us look at the components of the Deis workflow:
The workflow manager – checks your cluster for the latest stable components. If the components are missing. It is essentially a Workflow Doctor providing first aid to your Kubernetes cluster that require servicing.
The monitoring subsystem which consists of three components – the Telegraf, InfluxDB, and Grafana. The first is a metrics collection agent, that runs using the daemon set API.The second is a database that stores the metrics collected by the first. The third is a graphing application, which natively stores the second as a data source and provides a robust engine for creating dashboards on top of time-series data.
The logging subsystem which consists of two components – first that handles log shipping and second that maintains a ring buffer of application logs
The router component which is based on Nginx and routes inbound https traffic to applications. This includes a cloud based load balancer automatically.
The registry component which holds the application images generated from the builder component.
The object storage component where the data that needs to be stored is persisted. This is generally an off-cluster object storage.
Slugrunner is the component responsible for executing build-pack based applications. Slug is sent from the controller which helps the Slugrunner download the application slug and launch the application
The builder component is the workhorse that builds your code after it is pushed from source control.
The database component which holds the majority of the platform state. It is typically a relational database. The backup files are pushed to object storage. Data is not lost between backup and database restarts.
The controller which serves as the http endpoint for the overall services so that CLI and SDK plugins can be utilized.
Deis Workflow is more than just an application deployment workflow unlike CloudFoundry. It performs application rollbacks, supports zero time app migrations at the router level and provides scheduler tag support that determines which nodes the workloads are scheduled on. Moreover, it runs on Kubernetes so other workloads can be run on Kubernetes along with these workflows. Workflow components have a “deis-” namespace that tells them apart from other Kubernetes workloads and provide building, logging, release and rollback, authentication and routing functionalities all exposed via a REST API. In other words it is a layer distinct from the Kubernetes. While Deis provides workflows, Kubernetes provides orchestration and scheduling

The separation of workflows from resources and built to scale design is a pattern that will serve any automation.

#codingexercise

Recursive determination if a string is a palindrome:

bool isPalin(string A, int start, int end)

{

assert (start <= end);

if (start == end || start == end-1) return true

if (A[start] != A[end]) return false;

return IsPalin(A, start+1, end-1);

}

Today I found profiling quite useful to troubleshoot an issue:

python -m cProfile -o ~/profile.log alloc.py

import pstats

p = pstats.Stats('profile.log’)

p.strip_dirs().sort_stats(-1).print_stats()

Wed May  3 15:21:45 2017    profile.log

         105732 function calls (102730 primitive calls) in 1.271 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

        2    0.000    0.000    0.000    0.000 SSL.py:1438(send)

        2    0.000    0.000    0.953    0.476 SSL.py:1502(recv)

Tuesday, May 2, 2017

We were discussing an API for a scheduler that executed jobs. Services like that have the same notion at any level or scale. Therefore instead of being a microservice, it can also be considered as workflows. Take the example of Deis Workflow and we immediately see the benefit of expanding into a distributed service of any scale.
Before we review the Workflows, let us review the application needs. Applications are being built with a methodology that demands containerization. This methodology involves twelve factors that allow applications to scale across distributed systems. The Twelve-Factor Apps is the distilled essence of hosting software as services on Heroku platform:

These are briefly enumerated below but more can be read on https://12factor.net

1. Codebase – version control the software and deploy many times.

2. Dependencies – Declare the dependencies and isolate them.

3. Config – Configuration is part of the environment not the source.

4. Backing services – All such services are satellite resources.

5. Build, Release and Run – Build time and run time are separated

6. Processes – The application is executed as stateless as possible. Think REST APIs

7. Port Binding – Services are exported with the ports that they bind to

8. Concurrency – The application is scaled out by forking more workers to the reentrant code

9. Disposability – Applications are maximized for robustness with fast startup and graceful shutdown

10. Dev/Prod parity – Each environment is similar to the other as much as possible. This improves maintenance

11. Logging – All logging is treated as event stream

12. Admin processes – Admin/management tasks are run as one-off processes.

Once the application is in this format, it can be hosted on distributed environment with Kubernetes containerization. Kubernetes is an open-source cluster manager that is developed by Google and donated to Cloud Native Compute Foundation Kubernetes provides automation to cluster activities for the services such as state convergence, service addresses, health monitoring, service discovery and DNS resolution.

While Kubernetes provides abstractions like Services, Deployments and Pods, Workflows builds upon it.Workflow adds the features of building containers from application code, aggregating logs and managing deployment configurations and app releases. In fact, Deis workflow is all made up of Kubernetes components.
In terms of layers, Workflow therefore sits at the very top of the stack comprising of Workflow, Orchestration, Scheduling. Container Engine, Compute Operating System, Virtual Infrastructure and Physical Infrastructure as the seven layers.
#codingexercise
Here's an alternative way to count the occurrences of a given digit in sequential numbers upto a given number.
We can think of this as combinations with repetitions for r items among n which is equal to ((n+r-1) Choose r ) using stars and bars theorem and r ranges from 1 to n in this case. (r <= n )
So we sum the combinations with repetitions for each value of r for a given n
We repeat the above for
N ranging from single digit to as many digits in the given number.

Note that some of these generated numbers maybe larger than the given number. Therefore it is better to repeat the procedure above by fixing the first digit when the combination length is tge same as number length.

Monday, May 1, 2017

Jobs API:

Many applications require tasks that can be executed without user involvement. These tasks all follow the same pattern of repeated executions over time or based on some state that is already recorded with the task. Any worker can pick up the task and execute it to completion or failure and finalize the task. This is the nature of these tasks.

Therefore a service that can schedule these jobs or enable their execution would be helpful because jobs can now be submitted over the http. This service has an endpoint as:

jobs.paas.corp.xyz.com

The service is primarily a scheduler but not a message broker. The latter is also similar in purpose for the execution of jobs by putting them in a message queue however each queue has its own producer and consumer. Here we are talking about a service that has a single queue and manages hybrid tasks in it. It also put these jobs in the queue again and again as the task needs to be executed repeatedly. The count of times a task is repeatedly executed or the interval after which they are executed depends on the scheduler.

Jobs are handled in one of two ways. They are either successfully executed or they are marked as failed When the jobs are marked as failed, any helpful error message or exception may also be captured with the task. The states of the task move forward progressively from initialized, started, processing, completed or failed and closed. The granularity of the states and the progression depend exclusively on the schedule for the kind of tasks processed. The data structures for storing the state of the job, the results and the associated action description depend on the tasks. The scheduler follows the command design pattern to invoke the jobs. Each job is granular and flat with no hierarchy even though the associated tasks may be heterogeneous.

Jobs metadata can keep track of logic to execute in the form of location of scripts or assemblies. These are stored in a database along with the job description. The parameters are passed to the logic at the time of scheduler invocation. To make it easier for the scheduler to identify the entry point of the function and the call parameters, some decorators may be allowed to be used with the logic defined. All associated files or auxiliary data may be packaged as a tar ball if necessary.

As with most services that have REST based API endpoints, a command line plug-in or a language library with constructs that make it easier to call the REST API may also come in handy. It should be noted however that the service is helpful only within a narrow domain and not as a general purpose library such as the likes of Celery package. The idea is that this service will fill the gap of a global scheduler for the APIs so that they can be lightened from the tasks that can be executed in the background and make the APIs more responsive. Most applications and portals manage their own cron jobs but APIs have generally relied on a message broker. This service brings both the functionality exclusively for the API developers as an internal automation. It does not need to interact with users directly.

Micro service models are thin and streamlined to avoid any dependencies on other applications and this remains true for the users of those micro services. Inside a micro service, a database, a message broker or other APIs may generally be called to complete the overall goal of the micro service. Since the scheduler is another API to be called by a micro service, there is no conflict with the notion of a micro service.

It can be argued that this scheduler service may be replaced by a data structure in a shared database for use by all services. A Jobs table for example could allow logic to be local to the micro-service while allowing consistency and accounting of the jobs in a shared repository. Whether the jobs remain in a table or in a message broker queue, it will require servicing that can all be consolidated at one point. This is where a service API comes helpful because other services may frequently ignore or miss the uniform bookkeeping that is required for the servicing of these jobs. In conclusion, the purpose of such a scheduling service is determined by the existing use cases already in practice.

#codingexercise

find the count of digits that are odd in sequential numbers upto n
int GetCount(int n)
{
int count = 0;
for (int i = 0; i <=n; i++)
{
var digits = i.ToDigits();
count += digits.Count(t => t%2 == 1);
}
return count;
}