Cluster computing

Thursday, January 18, 2018

Today we resume our discussion on the AWS papers in software architecture which suggests five pillars:
- Operational Excellence for running and monitoring business critical systems.
- Security to protect information, systems, and assets with risk assessments and mitigation strategies.
- Reliability to recover from infrastructure or service disruptions
- Performance Efficiency to ensure efficiency in the usage of resources
- Cost Optimization to help eliminate unneeded cost and keeps the system trimmed and lean.
The guidelines to achieve the above pillars include:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement
We look at the security pillar today:
The security pillar emphasizes protection of information, systems and assets. There are six design principles for security in the cloud:
It implements a strong identity foundation where the privileges are assigned on a need by need basis. and there is separation of concerns. It centralizes privilege management and reduces usage of long term credentials.
It also monitors, alerts and audits actions to respond and take actions.
It applies security at all layers and not just the one at the edge so that the impact radius is covered in full
It automates mechanisms that are necessary for controls and restrictions
It protects data in transit and at rest with the use of access tokens and encyrption
It prepares for security events when many are affected.

#codingexercise
Check if a number is Fibonacci :
boolean is Fibonacci(uint n)
{
return IsSquare(5*n*n + 4) || IsSquare(5*n*n - 4);
}
A way to test for squares is to binary chop the squares until we find something closest to either or both of the required values

Wednesday, January 17, 2018

File descriptors on steroids continued
design

While the file descriptor was inherently local to the process, the DFS allowed it to point to a file on a remote computer. Likewise, file system protocols such as CIFS allowed remote servers to be connected through Active Directory and access was granted to users registered there. Deduplication worked on segments that were identical so that space could be conserved. RSync protocol helped replicate between source and destination regardless of whether the destination was a file system or a S3 endpoint. In all these tasks, much of the operations were asynchronous and involved a source and destination. This library utilizes ZeroMQ messaging library for file system operations.
Performance:
ZeroMQ has demonstrated performance for communications. The stacking of file operations from a storage solution perspective over this library can also meet the stringent requirements for cloud level operations. The implementation might vary on purpose, scope, scale and management as we add plugins for a client but the assumption that asynchronous operations on a remote file will not be hampered by ZeroMQ remains sound.

Security:
Enhanced File Descriptors are inherently as secure as sockets. However file system utilities to secure the files work because these behave the same as regular ones to the layers above.

Testing:
The implementation for this Storage Framework must be able to process hundred thousand requests per second with message sizes of a mix from 0 to 64kb for duration of 1 hour with little or no degradation in write latency for the writes to a million files. Checksum may be used to see that the files are correct. Testing might require supportability features in addition to random file writes. The statistics, audit log, history and other management aspects of the queue should be made available for pull via web APIs.

Conclusion:
With smart operating system primitives we can enhance each process to give more power to individual business.
#codingexercise

trace requests and responses from logs and stores.
#codingexercise
Get Fibonacci number
we compared the following
uint GetTailRecursiveFibonacci(uint n, uint a = 0, uint b = 1)
{
if (n == 0)
return a;
if (n == 1)
return b;
return GetTailRecursiveFibonacci(n-1, b, a+b);
}
with the conventional:
uint GetFibonacci (uint n)
{
if (n == 0)
return 0;
if (n == 1)
return 1;
return GetFibonacci(n-1) + GetFibonacci(n-2);
}
0 1 1 2 3 5 8 13 21 34

Tuesday, January 16, 2018

Techniques for HTTP Request Tracing at a web server
Introduction: When we point our browser to a website, we hardly realize the number of exchanges of HTTP request and responses that load the page. Still it is easy to ask the browser to capture and dump the request responses in an archive file. The web server handles many such sessions from different browsers corresponding to the millions of users using it. Consequently, its logs show a mix of requests from different origins at any given time and there is little reconstruction of history. The following are some of the techniques used to tag the requests so that the sequence determination can be made.
Description: Every request to a web server is a discrete item of work for it. Since these items may be referred to later in the logs, they are issued IDs often called a RequestID or RID for short. No two requests have the same ID and its generally not possible to exceed a trillion IDs in a given duration no matter how high the load after which the sequence can rollover. Therefore, RIDs happen to be convenient to look up the requests in the logs. Logs also happen to be a convenient destination to publish the data for any production category server and is well suited for subsequent translation and interpretation by analysis systems that can pull the data without affecting the business-critical systems.
One such way to trace request responses is to establish a previous next link between the requests it makes. Between a bunch of redirects to itself, a server may stamp the RID in an additional field designated as previous RID. If it’s a new request from a user and we don't have a current session, we have none to start with. As the conversation grows the RID is piggybacked on the responses so the new requests formed at the server can have the previous RID propagated to the current. This let us establish a trace.
Another way is for the server to reuse the session ID that the requests are part of. Unfortunately, the sessions are generally restricted to a domain that the web server is hosted in and does not help with the cross domain scenario unless we include additional IDs. Since its difficult to maintain unique integer ID across disparate servers, session IDs can generally be a Unique Universal identifier which has a predefined format and almost all systems know how to generate.
Another technique is for the server to make a stash of a request which can be looked up in a stash store with a key. The stashes are generally encrypted and can include values that may not be logged for privacy. These stashes may easily be purged the same way as logs are discarded. The stashes are done for high value requests and the stashes may be maintained in a list that is piggybacked and carried forward in each conversation.
The above techniques emphasize two key concepts. The web server is the only one that can determine the sequence either by chaining the requests or by propagating a common identifier be it scoped at the session or across domains. The command line tools that serve to analyze the logs can be smart enough to search by pattern or form rich search queries that can elicit the same information. This separates the concerns and keeps the business critical systems from having to deal with the more onerous tasks. That said, there are few command line tools that can discover and render chaining as most are still aimed at regular expression based search.
Conclusion – Request chaining and chain discovery tools will be helpful to trace requests and responses from logs and stores.
#codingexercise
Get Fibonacci number by tail recursion. A tail recursion is one where the recursion is last statement in execution inside the function
uint GetTailRecursiveFibonacci(uint n, uint a = 0, uint b = 1)
{
if (n == 0)
return a;
if (n == 1)
return b;
return GetTailRecursiveFibonacci(n-1, b, a+b);
}
0 1 1 2 3 5 8 13 21 34
TailRecursion does not involve the sum of the recursive parts

Monday, January 15, 2018

File Descriptors on steroids continued.

A library or sdk that manages cloud file operations and their logical operations within the process itself, gives more ability for the retail business to own and manage their logic without having to rely on a shared architecture.

Proposed File Descriptors:

- Benefits:

Free from public cloud policies, restrictions and management of storage tier tasks

Flexible design with plugin extensions for every aspect of storage framework where plugins can be appliances so long as it is in compliance with a new wire based file system protocol to be designed

Unified new file system protocol that spans nfs for unix, cifs/samba for windows, rsync for replication, aging and deduplication protocols that promotes interoperability

The library is the most performant and efficient as compared to any wire level protocols

works with ZeroMQ for fan-out, pub-sub, task-distribution and request-reply models

- Drawbacks:

Clients have to build additional features themselves

Greater control of cloud resources comes at increased TCO

Vendor agnostic public cloud resource usages

Differentiation:

Public cloud provider services such as Microsoft Azure StorSimple also provide framework for storage services.

Indeed StorSimple is documented to meet the needs of performance and capacity centric applications, and give a complete hybrid cloud storage solution for enterprises with both physical arrays for deployment and virtual arrays for satellite offices that rely on Network Accessible Storage. While StorSimple expands on the usage of Network Access Storage for point to point connectivity, we discuss a stripped down version as an embedded library. This does not require a server running on every computer as part of their operating systems that implements all the features of a distributed file system. Instead it focuses on giving the capability to write to files in a way that works with cloud resources without utilizing their native cloud storage services. For that matter this library does not restrict itself to providing DFS and can also include protocols such as for deduplication and rsync.
#codingexercise
Get the sum of Binomial Coefficients of degree n
static int GetBinomialCoefficientsSum(int n)
{
if (n == 0)
return 1;
return (1 << (n - 1));
}
This equals the row wise sum of Pascals triangle.

Sunday, January 14, 2018

File Descriptors on steroids
Introduction: Data Structures for persisting and organizing data on disk has remained the same as desktops evolved from mainframes. However, as more and more data migrate from the desktop to the cloud storage, the operating system artifacts for the desktop seem woefully inadequate to meet the features of the cloud such as elasticity, redundancy, availability, aging, deduplication, conversion, translation, parsing, sealing extents, IO scheduling and finally, replication. While some of these may be mitigated by higher level applications and services, it brings in players and vendors that have different areas of emphasis. A retail company wanting to own a business specific service at cloud scale starts relying on more technological notions and disparate systems while not addressing the minimal requirement of an in-process compute and storage requirements. Instead if the service were given primitives that looked and behaved the same as they do at an individual compute resource level, then they can also work with resources at cloud scale so long as the primitives are made smarter.
Implementation: We have solved the problem of communications via peer to peer networking and message queuing to not just multicast but implement publisher subscriber models. Moreover, we have made networking and concurrency into libraries that expand on the definition of their corresponding primitives and can be hosted in process. We offer notions of background processing with libraries like Celery that application developers and enterprise architects use to offload intensive processing away from end customers. They however end up creating fatter servers and dumb clients that can work anywhere on any device. If we contrast this model, we can have more uniform processing in peer nodes and on-demand basis if there were operating system primitives that supported the existing and new functionalities for cloud from the operating system notion of a process itself.
Conclusion: Enhanced File descriptors transcend clouds and bring the cloud capabilities in process.

#codingexercise
Find the JacobsthalLucas number
uint GetJacobsthalLucas(uint n)
{
if (n == 0) return 2;
if (n == 1) return 1;
return GetJacobsthalLucas(n-1) + 2 * GetJacobsthalLucas(n-2);
}
0 1 1 3 5 11 ...
While Pascal's triangle forms from diagonal bands of Fibonacci numbers, Jacobsthal Lucas numbers also forms computed values from adjacent numbers along the diagonals.
Jacobsthal and Jacobsthal Lucas numbers are part of Lucas series.

Saturday, January 13, 2018

Today we resume our discussion on the AWS papers in software architecture which suggests five pillars:
- Operational Excellence for running and monitoring business critical systems.
- Security to protect information, systems, and assets with risk assessments and mitigation strategies.
- Reliability to recover from infrastructure or service disruptions
- Performance Efficiency to ensure efficiency in the usage of resources
- Cost Optimization to help eliminate unneeded cost and keeps the system trimmed and lean.
The guidelines to achieve the above pillars include:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement
In AWS, the architecture is set by individual teams that demonstrate best practice. These guidelines are driven by data to build systems at internet scale and shared with virtual team of principal engineers who peer review each other's designs and showcase them. This is re-inforced with the following:
First, the practices focus on enabling each team to have this capability
Second, the mechanisms that carry out the automated checks ensure that the intentions are met
Third the culture works backs from the value to the customer across all roles.
The internal review processes and the mechanisms to enforce compliance are widely adopted
We will start reviewing the guidelines in greater detail but let us take a moment to take note of the push back encountered for such initiatives:
Teams often have to get ready for a big launch so they don't find time
Even if they did get all the results in from the mechanisms, they might not be able to act on it
Sometimes the teams don't want to disclose the internal mechanisms.
In all of the above, the shortcomings are fallacious.
#codingexercise
Find the Jacobsthal number
uint GetJacobsthal(uint n)
{
if (n == 0) return 0;
if (n == 1) return 1;
return GetJacobsthal(n-1) + 2 * GetJacobsthal(n-2);
}
0 1 1 3 5 11 ...
While Pascal's triangle forms from diagonal bands of Fibonacci numbers, Jacobsthal numbers also forms computed values from adjacent numbers along the diagonals.

Friday, January 12, 2018

Today we will take a break from AWS architecture papers to discuss wire framing tools.

UI Customizations:

Introduction:

The web page of an application is proprietary to it. When the web page is displayed to the user as part of a workflow for another application using HTTP redirects, the pages have to be customized for seamless experience to the user This involves propagating the brand or the logo through the cross domain pages, modifying text and links and user controls on the shared page. A classic example for this is the login page. Many clients want to customize the login page to suit their needs. Consequently, the login provider needs to provide some sort of wireframing tool to help these clients. We discuss the design of one such wireframing tool.

Design:

A wire framing tool that enables in browser experience works universally on all devices. Therefore, hosting the wireframed pages and templates as a web application will be helpful to most clients. Additionally, if this web application was in the form of REST based APIs, it will benefit the client to come up with a tool for themselves that doesn't necessarily pose the restrictions that a provider may have. Sample API could involve something like

GET displays/v1/prototypes/{id}/pages/{id}.json?api_key=client_dev_key

Here we use json notation to describe the wire-framing artifacts. These include entities for prototypes, pages and layers. A prototype becomes the container for all pages and layers associated with the mockup that the client would like to achieve. It has attributes such as name, version, timestamps etc. A page also has similar attributes but contains elements and might also show layers. A page is the unit of display in a workflow when user navigates from one user interface to another. A layer is a group of elements that can be shared between pages. It is similar in nature to templates and used in conjunction with a page.

As with all REST APIs, there is possibility to create, update and delete these pages so that we can let the clients manage their lifetimes. The purpose of organizing this way is to keep the wireframes simpler to compose.

The styling and scripting are elements that can be added separately or as strings that can be embedded in placeholders. The placeholder itself works with identifiers and therefore can have any text displayed with it as a literal or as an HTML element. Since the wire framing tool renders them, there is full flexibility in custom injection by the clients.

Workflows are enabled by association pages with controls. Since this is a wireframe only, data from a page is passed to the next page via the association specified. This provides a seamless replay of pages for a workflow.

Conclusion:

A wireframing tool is useful for user interface customization and can work with different clients for the same shared resource.

#codingexercise

We were discussing finding the largest sum submatrix in a matrix of integers.

As we evaluated the growing range from column to column, we iterated inwards to outwards but the reverse is more efficient.

The advantage here is that the outer bounds of the entire matrix has the largest possible array to apply Kadane's algorithm which we discussed earlier as the one used for finding the max subarray.

Since we find the maximum subarray (i1,i2) along the rows and (j1,j2) along the columns at the first and last row and column of the matrix, the submatrix with a large sum will lie within (i1,0) (i1, col-max), (i2,0)and (i2,col-max) or within (0,j1),(0,j2), (row-max, j1) and (row-max, j2).We merely refine it with every iteration that shrinks the matrix.

Note that we can use the min common (i-topleft,j-topleft) and (i-bottom-right,j-bottom-right) from the first and last row and column subarray determinations.

In this case we could terminate if the entire bounded subarray contributes to max sum. The idea is that we use row major analysis and column major analysis one after the other and pick the better answer of the two. The notion that linear dimension subarray sum will contribute towards the solution lets us exclude the arrays that won't contribute towards the result. However, since the values can be skewed we have to iterate in steps in one dimension only and updating the max sum as we go.