Cluster computing

Tuesday, January 16, 2018

Techniques for HTTP Request Tracing at a web server
Introduction: When we point our browser to a website, we hardly realize the number of exchanges of HTTP request and responses that load the page. Still it is easy to ask the browser to capture and dump the request responses in an archive file. The web server handles many such sessions from different browsers corresponding to the millions of users using it. Consequently, its logs show a mix of requests from different origins at any given time and there is little reconstruction of history. The following are some of the techniques used to tag the requests so that the sequence determination can be made.
Description: Every request to a web server is a discrete item of work for it. Since these items may be referred to later in the logs, they are issued IDs often called a RequestID or RID for short. No two requests have the same ID and its generally not possible to exceed a trillion IDs in a given duration no matter how high the load after which the sequence can rollover. Therefore, RIDs happen to be convenient to look up the requests in the logs. Logs also happen to be a convenient destination to publish the data for any production category server and is well suited for subsequent translation and interpretation by analysis systems that can pull the data without affecting the business-critical systems.
One such way to trace request responses is to establish a previous next link between the requests it makes. Between a bunch of redirects to itself, a server may stamp the RID in an additional field designated as previous RID. If it’s a new request from a user and we don't have a current session, we have none to start with. As the conversation grows the RID is piggybacked on the responses so the new requests formed at the server can have the previous RID propagated to the current. This let us establish a trace.
Another way is for the server to reuse the session ID that the requests are part of. Unfortunately, the sessions are generally restricted to a domain that the web server is hosted in and does not help with the cross domain scenario unless we include additional IDs. Since its difficult to maintain unique integer ID across disparate servers, session IDs can generally be a Unique Universal identifier which has a predefined format and almost all systems know how to generate.
Another technique is for the server to make a stash of a request which can be looked up in a stash store with a key. The stashes are generally encrypted and can include values that may not be logged for privacy. These stashes may easily be purged the same way as logs are discarded. The stashes are done for high value requests and the stashes may be maintained in a list that is piggybacked and carried forward in each conversation.
The above techniques emphasize two key concepts. The web server is the only one that can determine the sequence either by chaining the requests or by propagating a common identifier be it scoped at the session or across domains. The command line tools that serve to analyze the logs can be smart enough to search by pattern or form rich search queries that can elicit the same information. This separates the concerns and keeps the business critical systems from having to deal with the more onerous tasks. That said, there are few command line tools that can discover and render chaining as most are still aimed at regular expression based search.
Conclusion – Request chaining and chain discovery tools will be helpful to trace requests and responses from logs and stores.
#codingexercise
Get Fibonacci number by tail recursion. A tail recursion is one where the recursion is last statement in execution inside the function
uint GetTailRecursiveFibonacci(uint n, uint a = 0, uint b = 1)
{
if (n == 0)
return a;
if (n == 1)
return b;
return GetTailRecursiveFibonacci(n-1, b, a+b);
}
0 1 1 2 3 5 8 13 21 34
TailRecursion does not involve the sum of the recursive parts

Monday, January 15, 2018

File Descriptors on steroids continued.

A library or sdk that manages cloud file operations and their logical operations within the process itself, gives more ability for the retail business to own and manage their logic without having to rely on a shared architecture.

Proposed File Descriptors:

- Benefits:

Free from public cloud policies, restrictions and management of storage tier tasks

Flexible design with plugin extensions for every aspect of storage framework where plugins can be appliances so long as it is in compliance with a new wire based file system protocol to be designed

Unified new file system protocol that spans nfs for unix, cifs/samba for windows, rsync for replication, aging and deduplication protocols that promotes interoperability

The library is the most performant and efficient as compared to any wire level protocols

works with ZeroMQ for fan-out, pub-sub, task-distribution and request-reply models

- Drawbacks:

Clients have to build additional features themselves

Greater control of cloud resources comes at increased TCO

Vendor agnostic public cloud resource usages

Differentiation:

Public cloud provider services such as Microsoft Azure StorSimple also provide framework for storage services.

Indeed StorSimple is documented to meet the needs of performance and capacity centric applications, and give a complete hybrid cloud storage solution for enterprises with both physical arrays for deployment and virtual arrays for satellite offices that rely on Network Accessible Storage. While StorSimple expands on the usage of Network Access Storage for point to point connectivity, we discuss a stripped down version as an embedded library. This does not require a server running on every computer as part of their operating systems that implements all the features of a distributed file system. Instead it focuses on giving the capability to write to files in a way that works with cloud resources without utilizing their native cloud storage services. For that matter this library does not restrict itself to providing DFS and can also include protocols such as for deduplication and rsync.
#codingexercise
Get the sum of Binomial Coefficients of degree n
static int GetBinomialCoefficientsSum(int n)
{
if (n == 0)
return 1;
return (1 << (n - 1));
}
This equals the row wise sum of Pascals triangle.

Sunday, January 14, 2018

File Descriptors on steroids
Introduction: Data Structures for persisting and organizing data on disk has remained the same as desktops evolved from mainframes. However, as more and more data migrate from the desktop to the cloud storage, the operating system artifacts for the desktop seem woefully inadequate to meet the features of the cloud such as elasticity, redundancy, availability, aging, deduplication, conversion, translation, parsing, sealing extents, IO scheduling and finally, replication. While some of these may be mitigated by higher level applications and services, it brings in players and vendors that have different areas of emphasis. A retail company wanting to own a business specific service at cloud scale starts relying on more technological notions and disparate systems while not addressing the minimal requirement of an in-process compute and storage requirements. Instead if the service were given primitives that looked and behaved the same as they do at an individual compute resource level, then they can also work with resources at cloud scale so long as the primitives are made smarter.
Implementation: We have solved the problem of communications via peer to peer networking and message queuing to not just multicast but implement publisher subscriber models. Moreover, we have made networking and concurrency into libraries that expand on the definition of their corresponding primitives and can be hosted in process. We offer notions of background processing with libraries like Celery that application developers and enterprise architects use to offload intensive processing away from end customers. They however end up creating fatter servers and dumb clients that can work anywhere on any device. If we contrast this model, we can have more uniform processing in peer nodes and on-demand basis if there were operating system primitives that supported the existing and new functionalities for cloud from the operating system notion of a process itself.
Conclusion: Enhanced File descriptors transcend clouds and bring the cloud capabilities in process.

#codingexercise
Find the JacobsthalLucas number
uint GetJacobsthalLucas(uint n)
{
if (n == 0) return 2;
if (n == 1) return 1;
return GetJacobsthalLucas(n-1) + 2 * GetJacobsthalLucas(n-2);
}
0 1 1 3 5 11 ...
While Pascal's triangle forms from diagonal bands of Fibonacci numbers, Jacobsthal Lucas numbers also forms computed values from adjacent numbers along the diagonals.
Jacobsthal and Jacobsthal Lucas numbers are part of Lucas series.

Saturday, January 13, 2018

Today we resume our discussion on the AWS papers in software architecture which suggests five pillars:
- Operational Excellence for running and monitoring business critical systems.
- Security to protect information, systems, and assets with risk assessments and mitigation strategies.
- Reliability to recover from infrastructure or service disruptions
- Performance Efficiency to ensure efficiency in the usage of resources
- Cost Optimization to help eliminate unneeded cost and keeps the system trimmed and lean.
The guidelines to achieve the above pillars include:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement
In AWS, the architecture is set by individual teams that demonstrate best practice. These guidelines are driven by data to build systems at internet scale and shared with virtual team of principal engineers who peer review each other's designs and showcase them. This is re-inforced with the following:
First, the practices focus on enabling each team to have this capability
Second, the mechanisms that carry out the automated checks ensure that the intentions are met
Third the culture works backs from the value to the customer across all roles.
The internal review processes and the mechanisms to enforce compliance are widely adopted
We will start reviewing the guidelines in greater detail but let us take a moment to take note of the push back encountered for such initiatives:
Teams often have to get ready for a big launch so they don't find time
Even if they did get all the results in from the mechanisms, they might not be able to act on it
Sometimes the teams don't want to disclose the internal mechanisms.
In all of the above, the shortcomings are fallacious.
#codingexercise
Find the Jacobsthal number
uint GetJacobsthal(uint n)
{
if (n == 0) return 0;
if (n == 1) return 1;
return GetJacobsthal(n-1) + 2 * GetJacobsthal(n-2);
}
0 1 1 3 5 11 ...
While Pascal's triangle forms from diagonal bands of Fibonacci numbers, Jacobsthal numbers also forms computed values from adjacent numbers along the diagonals.

Friday, January 12, 2018

Today we will take a break from AWS architecture papers to discuss wire framing tools.

UI Customizations:

Introduction:

The web page of an application is proprietary to it. When the web page is displayed to the user as part of a workflow for another application using HTTP redirects, the pages have to be customized for seamless experience to the user This involves propagating the brand or the logo through the cross domain pages, modifying text and links and user controls on the shared page. A classic example for this is the login page. Many clients want to customize the login page to suit their needs. Consequently, the login provider needs to provide some sort of wireframing tool to help these clients. We discuss the design of one such wireframing tool.

Design:

A wire framing tool that enables in browser experience works universally on all devices. Therefore, hosting the wireframed pages and templates as a web application will be helpful to most clients. Additionally, if this web application was in the form of REST based APIs, it will benefit the client to come up with a tool for themselves that doesn't necessarily pose the restrictions that a provider may have. Sample API could involve something like

GET displays/v1/prototypes/{id}/pages/{id}.json?api_key=client_dev_key

Here we use json notation to describe the wire-framing artifacts. These include entities for prototypes, pages and layers. A prototype becomes the container for all pages and layers associated with the mockup that the client would like to achieve. It has attributes such as name, version, timestamps etc. A page also has similar attributes but contains elements and might also show layers. A page is the unit of display in a workflow when user navigates from one user interface to another. A layer is a group of elements that can be shared between pages. It is similar in nature to templates and used in conjunction with a page.

As with all REST APIs, there is possibility to create, update and delete these pages so that we can let the clients manage their lifetimes. The purpose of organizing this way is to keep the wireframes simpler to compose.

The styling and scripting are elements that can be added separately or as strings that can be embedded in placeholders. The placeholder itself works with identifiers and therefore can have any text displayed with it as a literal or as an HTML element. Since the wire framing tool renders them, there is full flexibility in custom injection by the clients.

Workflows are enabled by association pages with controls. Since this is a wireframe only, data from a page is passed to the next page via the association specified. This provides a seamless replay of pages for a workflow.

Conclusion:

A wireframing tool is useful for user interface customization and can work with different clients for the same shared resource.

#codingexercise

We were discussing finding the largest sum submatrix in a matrix of integers.

As we evaluated the growing range from column to column, we iterated inwards to outwards but the reverse is more efficient.

The advantage here is that the outer bounds of the entire matrix has the largest possible array to apply Kadane's algorithm which we discussed earlier as the one used for finding the max subarray.

Since we find the maximum subarray (i1,i2) along the rows and (j1,j2) along the columns at the first and last row and column of the matrix, the submatrix with a large sum will lie within (i1,0) (i1, col-max), (i2,0)and (i2,col-max) or within (0,j1),(0,j2), (row-max, j1) and (row-max, j2).We merely refine it with every iteration that shrinks the matrix.

Note that we can use the min common (i-topleft,j-topleft) and (i-bottom-right,j-bottom-right) from the first and last row and column subarray determinations.

In this case we could terminate if the entire bounded subarray contributes to max sum. The idea is that we use row major analysis and column major analysis one after the other and pick the better answer of the two. The notion that linear dimension subarray sum will contribute towards the solution lets us exclude the arrays that won't contribute towards the result. However, since the values can be skewed we have to iterate in steps in one dimension only and updating the max sum as we go.

Thursday, January 11, 2018

Today we continue discussing the AWS papers on software architecture which suggests five pillars:
- Operational Excellence for running and monitoring business critical systems.
- Security to protect information, systems, and assets with risk assessments and mitigation strategies.
- Reliability to recover from infrastructure or service disruptions
- Performance Efficiency to ensure efficiency in the usage of resources
- Cost Optimization to help eliminate unneeded cost and keeps the system trimmed and lean.
The guidelines to achieve the above pillars include:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement
It should be noted that Amazon's take on architecture is that there is no need for centralized decisions as described by TOGAF or the Zachman Framework. TOGAF is a framework that pools in requirements from technology vision, business demands, deployment and operations, implementations, migrations and upgrades to form central recommendations The Zachman framework provides a table with columns such as why, how, what, who, when, and rows as layers of organization so that nothing is left out of the considerations from architecture point of view. There are definitely risks associated with the distributed recommendations where internal teams may not adhere to strict standards. This is mitigated in two ways - there are mechanisms that carry out automated checks to ensure standards are being met and second - a culture that work backwards from the customer focused innovation.
#codingexercise
We were discussing finding the largest sum submatrix in a matrix of integers.
As we evaluated the growing range from column to column, we iterated inwards to outwards but the reverse is more efficient.

The advantage here is that the outer bounds of the entire matrix has the largest possible array to apply Kadane's algorithm which we discussed earlier as the one used for finding the max subarray.

Since we find the maximum subarray (i1,i2) along the rows and (j1,j2) along the columns at the first and last row and column of the matrix, the submatrix with the largest sum will lie within (i1,j1) and (i2,j2). We merely refine it with every iteration that shrinks the matrix.

Note that we can use the min common (i-topleft,j-topleft) and (i-bottom-right,j-bottom-right) from the first and last row and column subarray determinations.

Wednesday, January 10, 2018

We start reading AWS whitepapers. We begin with the AWS well-architected framework.
This is based on five pillars:
Operational Excellence which is the ability to run and monitor systems which are business cirtical and delivering functionality to customers. This pillar also includes all the processes and procedures that come with software support.
Security which is not only a non-functional requirement for a cloud hosted software but also a mandate from users and governance as well. It includes the ability to protect information, systems, and assets with risk assessments and mitigation strategies.
Reliability which is the pillar that safeguards against failures both sporadic and frequent. This is the ability of a system to recover from infrastructure or service disruptions, handling failures by acquiring and seamlessly moving to computing resources to meet demand, and mitigating disruptions that come via say misconfigurations or transient network issues.
Performance Efficiency which is the pillar to make sure there is efficiency in the usage of computing resources to meet system requirements, and to maintain it in the face of business and technology changes.
Cost Optimization which is the pillar that helps eliminate unneeded cost and keeps the system trimmed and lean.
This paper enforces the guidelines for a sound architecture to be based around the following:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement

#codingexercise
We were discussing finding the largest sum submatrix in a matrix of integers.
As we evaluated the growing range from column to column, we iterated inwards to outwards but the reverse is more efficient.

The advantage here is that the outer bounds of the entire matrix has the largest possible array to apply Kadane's algorithm which we discussed earlier as the one used for finding the max subarray.

Since we find the maximum subarray (i1,i2) along the rows and (j1,j2) along the columns at the first and last row and column of the matrix, the submatrix with the largest sum will lie within (i1,j1) and (i2,j2). We merely refine it with every iteration that shrinks the matrix.