Cluster computing

Saturday, January 20, 2018

Today we resume our discussion on the AWS papers in software architecture which suggests five pillars:
- Operational Excellence for running and monitoring business critical systems.
- Security to protect information, systems, and assets with risk assessments and mitigation strategies.
- Reliability to recover from infrastructure or service disruptions
- Performance Efficiency to ensure efficiency in the usage of resources
- Cost Optimization to help eliminate unneeded cost and keeps the system trimmed and lean.
The guidelines to achieve the above pillars include:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement
We looked at the security pillar now we review its best practices.
They include identity and access management, monitoring controls, infrastructure protection, data protection and incident response.
The identity and access management only
allows authenticated and authorized users to access the resources.
The monitoring controls are used to identify a potential security incident.
Infrastructure protection includes control methodologies which are defense in depth
Data protection involves techniques that involve securing data, encrypting it, and putting access controls etc.
Incident response means putting in place controls and prevention to mitigate security incidents.
IAM is the AWS service that is essential security and enabled this pillar of software architecture.
#codingexercise
We were discussing how to check if a number is Fibonacci:
Check if a number is Fibonacci :
boolean is Fibonacci(uint n)
{
return IsSquare(5*n*n + 4) || IsSquare(5*n*n - 4);
}
Another way to test for Fibonacci is to binary chop Fibonacci numbers until we get close to the given number.
We discussed this binary chop method here : http://ravinote.blogspot.com/2017/11/we-resume-our-discussion-about.html

Friday, January 19, 2018

Reducing the cyclomatic complexity in software
Introduction: Mature software applications often end up as a spaghetti code – a term used to denote tangled flow of control and the use of nested conditionals. This not only makes the software hard to read but also results in unexpected behavior. Code written in programming languages like C# and Java can be checked with tools such as NDepend and CheckStyle respectively to measure this complexity. The following are some suggestions to mitigate it.
Description:
1: Refactoring into smaller methods or abstractions especially those that are cumbersome
2: Increased use of boolean variables or results to store intermediary state in processing or for evaluation of multiple conditions. These Boolean variables can also be repeated in subsequent statements
Test code for representing test matrix usually comes up with very little cyclomatic complexity because each test case can be represented by an if condition with multiple conditionals. While such repetitive if conditions are avoided in favor of set once and check once conditions in development code, the latter contributes to cyclomatic complexity. Unpacking the conditions into repetitive but separate lines avoids unnecessary branching and missing of test cases.
3: Using inheritance or encapsulation or design patterns such as Factory or Strategy so that the logic can be re-organized and not just refactored. For example, multiple throw and catch statements increase cyclomatic complexity but if the catch statements are all in one place and with smaller blocks of code, it will help the same way that switch statements do. In C language, the use of pre-processors was prevalent because it made these repetitions easier to write. A catch statement for example may have to add an entry into the log, update counters and do other chores that increase code bloat. This probably may be tolerated but putting logic into the catch handler increases this complexity.
4: Designating private methods for validations, separate methods for business operations and treating dependency calls as merely method and parameters
5: This tip is to not to over-do the practice of reducing this complexity. Many organizations take pride in the way they write the code such as when it looks like a textbook. This emphasizes the fact that code is of the people, by the people and for the people.
Conclusion: Cyclomatic complexity may be a desirable metric in static code analysis practices of an organization and worthy to address at the time of checkins.

Thursday, January 18, 2018

Today we resume our discussion on the AWS papers in software architecture which suggests five pillars:
- Operational Excellence for running and monitoring business critical systems.
- Security to protect information, systems, and assets with risk assessments and mitigation strategies.
- Reliability to recover from infrastructure or service disruptions
- Performance Efficiency to ensure efficiency in the usage of resources
- Cost Optimization to help eliminate unneeded cost and keeps the system trimmed and lean.
The guidelines to achieve the above pillars include:
1. Infrastructure capacity should be estimated not guessed
2. Systems should be tested on production scale to eliminate surprises
3. Architectural experimentation should be made easier with automation
4. There should be flexibility to evolve architectures
5. Changes to the architecture should be driven by data
6. Plan for peak days and test at these loads to observe areas of improvement
We look at the security pillar today:
The security pillar emphasizes protection of information, systems and assets. There are six design principles for security in the cloud:
It implements a strong identity foundation where the privileges are assigned on a need by need basis. and there is separation of concerns. It centralizes privilege management and reduces usage of long term credentials.
It also monitors, alerts and audits actions to respond and take actions.
It applies security at all layers and not just the one at the edge so that the impact radius is covered in full
It automates mechanisms that are necessary for controls and restrictions
It protects data in transit and at rest with the use of access tokens and encyrption
It prepares for security events when many are affected.

#codingexercise
Check if a number is Fibonacci :
boolean is Fibonacci(uint n)
{
return IsSquare(5*n*n + 4) || IsSquare(5*n*n - 4);
}
A way to test for squares is to binary chop the squares until we find something closest to either or both of the required values

Wednesday, January 17, 2018

File descriptors on steroids continued
design

While the file descriptor was inherently local to the process, the DFS allowed it to point to a file on a remote computer. Likewise, file system protocols such as CIFS allowed remote servers to be connected through Active Directory and access was granted to users registered there. Deduplication worked on segments that were identical so that space could be conserved. RSync protocol helped replicate between source and destination regardless of whether the destination was a file system or a S3 endpoint. In all these tasks, much of the operations were asynchronous and involved a source and destination. This library utilizes ZeroMQ messaging library for file system operations.
Performance:
ZeroMQ has demonstrated performance for communications. The stacking of file operations from a storage solution perspective over this library can also meet the stringent requirements for cloud level operations. The implementation might vary on purpose, scope, scale and management as we add plugins for a client but the assumption that asynchronous operations on a remote file will not be hampered by ZeroMQ remains sound.

Security:
Enhanced File Descriptors are inherently as secure as sockets. However file system utilities to secure the files work because these behave the same as regular ones to the layers above.

Testing:
The implementation for this Storage Framework must be able to process hundred thousand requests per second with message sizes of a mix from 0 to 64kb for duration of 1 hour with little or no degradation in write latency for the writes to a million files. Checksum may be used to see that the files are correct. Testing might require supportability features in addition to random file writes. The statistics, audit log, history and other management aspects of the queue should be made available for pull via web APIs.

Conclusion:
With smart operating system primitives we can enhance each process to give more power to individual business.
#codingexercise

trace requests and responses from logs and stores.
#codingexercise
Get Fibonacci number
we compared the following
uint GetTailRecursiveFibonacci(uint n, uint a = 0, uint b = 1)
{
if (n == 0)
return a;
if (n == 1)
return b;
return GetTailRecursiveFibonacci(n-1, b, a+b);
}
with the conventional:
uint GetFibonacci (uint n)
{
if (n == 0)
return 0;
if (n == 1)
return 1;
return GetFibonacci(n-1) + GetFibonacci(n-2);
}
0 1 1 2 3 5 8 13 21 34

Tuesday, January 16, 2018

Techniques for HTTP Request Tracing at a web server
Introduction: When we point our browser to a website, we hardly realize the number of exchanges of HTTP request and responses that load the page. Still it is easy to ask the browser to capture and dump the request responses in an archive file. The web server handles many such sessions from different browsers corresponding to the millions of users using it. Consequently, its logs show a mix of requests from different origins at any given time and there is little reconstruction of history. The following are some of the techniques used to tag the requests so that the sequence determination can be made.
Description: Every request to a web server is a discrete item of work for it. Since these items may be referred to later in the logs, they are issued IDs often called a RequestID or RID for short. No two requests have the same ID and its generally not possible to exceed a trillion IDs in a given duration no matter how high the load after which the sequence can rollover. Therefore, RIDs happen to be convenient to look up the requests in the logs. Logs also happen to be a convenient destination to publish the data for any production category server and is well suited for subsequent translation and interpretation by analysis systems that can pull the data without affecting the business-critical systems.
One such way to trace request responses is to establish a previous next link between the requests it makes. Between a bunch of redirects to itself, a server may stamp the RID in an additional field designated as previous RID. If it’s a new request from a user and we don't have a current session, we have none to start with. As the conversation grows the RID is piggybacked on the responses so the new requests formed at the server can have the previous RID propagated to the current. This let us establish a trace.
Another way is for the server to reuse the session ID that the requests are part of. Unfortunately, the sessions are generally restricted to a domain that the web server is hosted in and does not help with the cross domain scenario unless we include additional IDs. Since its difficult to maintain unique integer ID across disparate servers, session IDs can generally be a Unique Universal identifier which has a predefined format and almost all systems know how to generate.
Another technique is for the server to make a stash of a request which can be looked up in a stash store with a key. The stashes are generally encrypted and can include values that may not be logged for privacy. These stashes may easily be purged the same way as logs are discarded. The stashes are done for high value requests and the stashes may be maintained in a list that is piggybacked and carried forward in each conversation.
The above techniques emphasize two key concepts. The web server is the only one that can determine the sequence either by chaining the requests or by propagating a common identifier be it scoped at the session or across domains. The command line tools that serve to analyze the logs can be smart enough to search by pattern or form rich search queries that can elicit the same information. This separates the concerns and keeps the business critical systems from having to deal with the more onerous tasks. That said, there are few command line tools that can discover and render chaining as most are still aimed at regular expression based search.
Conclusion – Request chaining and chain discovery tools will be helpful to trace requests and responses from logs and stores.
#codingexercise
Get Fibonacci number by tail recursion. A tail recursion is one where the recursion is last statement in execution inside the function
uint GetTailRecursiveFibonacci(uint n, uint a = 0, uint b = 1)
{
if (n == 0)
return a;
if (n == 1)
return b;
return GetTailRecursiveFibonacci(n-1, b, a+b);
}
0 1 1 2 3 5 8 13 21 34
TailRecursion does not involve the sum of the recursive parts

Monday, January 15, 2018

File Descriptors on steroids continued.

A library or sdk that manages cloud file operations and their logical operations within the process itself, gives more ability for the retail business to own and manage their logic without having to rely on a shared architecture.

Proposed File Descriptors:

- Benefits:

Free from public cloud policies, restrictions and management of storage tier tasks

Flexible design with plugin extensions for every aspect of storage framework where plugins can be appliances so long as it is in compliance with a new wire based file system protocol to be designed

Unified new file system protocol that spans nfs for unix, cifs/samba for windows, rsync for replication, aging and deduplication protocols that promotes interoperability

The library is the most performant and efficient as compared to any wire level protocols

works with ZeroMQ for fan-out, pub-sub, task-distribution and request-reply models

- Drawbacks:

Clients have to build additional features themselves

Greater control of cloud resources comes at increased TCO

Vendor agnostic public cloud resource usages

Differentiation:

Public cloud provider services such as Microsoft Azure StorSimple also provide framework for storage services.

Indeed StorSimple is documented to meet the needs of performance and capacity centric applications, and give a complete hybrid cloud storage solution for enterprises with both physical arrays for deployment and virtual arrays for satellite offices that rely on Network Accessible Storage. While StorSimple expands on the usage of Network Access Storage for point to point connectivity, we discuss a stripped down version as an embedded library. This does not require a server running on every computer as part of their operating systems that implements all the features of a distributed file system. Instead it focuses on giving the capability to write to files in a way that works with cloud resources without utilizing their native cloud storage services. For that matter this library does not restrict itself to providing DFS and can also include protocols such as for deduplication and rsync.
#codingexercise
Get the sum of Binomial Coefficients of degree n
static int GetBinomialCoefficientsSum(int n)
{
if (n == 0)
return 1;
return (1 << (n - 1));
}
This equals the row wise sum of Pascals triangle.