Cluster computing

Saturday, December 20, 2014

Continuing on our discussion from the previous post on sequential consistency in the WRL research report on shared memory consistency model, let us know look at the implementation considerations for the same.
We first look at architectures without caches. The key design principle here is to enforce and maintain program order of execution. We also keep track of the memory operations and make sure that they are sequential.
Let us take a look at write buffers with bypassing capability. We have a bus based shared memory system with no cache. A processor issues memory operations in sequential order. The write buffer merely helps the processor to not wait on the finishing of the write. Reads are allowed to bypass any previous writes to the buffer. This is allowed as long as the read address is different from any of the addresses in the write buffer. This optimization allows us to hide the write latency. To see how this can violate sequential consistency, let us consider the example of the Dekker algorithm mentioned earlier where we maintained that both reads of the flag cannot return the same value. However, in our bypassing capability, this is now permitted. This problem doesn't occur with uniprocessors.
Now let us take a look at overlapping write operations. Here we use a general interconnection network that alleviates the bottleneck of a bus based design. We still assume processors issue memory operations in program order. As opposed to the previous example, multiple write operations issued by the same processor may be simultaneously serviced by the different memory modules. We maintained that the reads of data by a processor should return the value written by the other processor for sequential consistency. In our case now, we violate this assumption because a write may be injected before a write makes its way to the memory module. This highlights the importance of maintaining program order between memory operations. If we coalesce other write operations to the same cache line, that can also cause the inconsistency.
One way to remedy this could be to wait for the write to reach the memory before allowing another into the network. Enforcing the above requires an acknowledgement to the processor issuing a write. These acknowledgements help with maintaining program order.
Next we consider non-blocking read operations on the same example as above. Here a processor ensures that its write arrives at the memory location in program order. If another processor is allowed to issue its read operation, then the read might arrive before the write leading to an inconsistency.
#codingexercise
Decimal GetOddNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeMode();
}

Thursday, December 18, 2014

Today I'm going to continue discussing the WRL research report on Shared memory consistency models. Most programmers assume sequential semantics for memory operations. A read will return the value of the last write. As long as there is a uniprocessor data and dependences can be controlled, the illusion of sequentiality can be maintained. The compiler and hardware can freely reorder operations to different locations. Compiler optimizations that reorder memory operations can include register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding and lockup-free caches.
The following are some of the myths about the memory consistency models:
1) It only applies to systems that allow multiple copies of shared data This is not true as in the case with overlapped writes and non-blocking reads.
2) Most current systems are sequentially consistent. On the contrary, Cray computers and Alpha computers relax this consistency.
3) The memory consistency model only affects the design of the hardware. But it actually affects compiler and other system design.
4) A cache coherence protocol inherently supports sequential consistency. In reality, this is only part of it. Another myth is that the memory consistency model depends on whether the system supports an invalidate or update based coherence protocol. But it can allow both.
5) The memory model may be defined solely by specifying the behaviour of the processor. But it is affected by both the processor and memory system.
6) Relaxed memory models may not be used to hide read latency. The reality is it can hide both read and write latency.
7) Relaxed consistency models require the use of extra synchronization. There is no synchronization required for some of the relaxed models.
8) Relaxed consistency models do not allow asynchronous algorithms. This can actually be supported.

#codingexercise
Decimal GetOddNumberRangeMean(Decimal [] A)
{

if (A == null) return 0;

Return A.OddNumberRangeMean();

}

Tonight we will continue to discuss this WRL Research report on Shared memory consistency models. In particular, we will be discussing Sequential consistency. This is the most commonly assumed memory consistency model. The definition is that the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appears in program order. Maintaining this program order together with maintaining a single sequential order among all the processors is the core tenet of this model. The latter part actually requires that memory is updated atomically with respect to other operations. Let us take two examples for using this model. We use a code segment from Dekker's algorithm for critical sections, involving two processors and two flags that are initialized to zero. When a processor enters the critical section, it updates its flag and checks the other. Here the assumption is that the value is read only after it is written. Here the expectation that the program order is maintained ensures sequential consistency and a race condition. To illustrate the importance of the atomic execution, let us instead consider an example where there are three processors sharing variables A and B both set to zero. Here if one of the processor reads a variable and writes to the other which another processor reads, then the atomicity requires that the updates made to the variable are seen by all the processors at once.

Wednesday, December 17, 2014

In a post, previous to the last one, I was talking about securing AWS S3 artifacts by a corporate subscriber.

The following are the mechanisms available to implement a corporate governance strategy:
IAM Policy:
These are global policies and are attached to IAM users, groups, or roles, which are then subject to permissions we have defined. They apply to principals and hence in their definition, the application to a principal is implicit and usually omitted.
For programmatic access to the resources created by Adobe users, we will require an account that has unrestricted access to their resources. Creating such a user will help with requesting an AWS Access Key and an AWS Secret that will be necessary for use with the programmatic access.

Bucket Policy:
These are policies that are attached to the S3 buckets. They are S3 specific as opposed to IAM policies that are applicable across all AWS usages by the principals. These bucket policies are applied one per bucket. A sample bucket policy looks like this:
{
"Id": "Policy1418767581422",
"Statement": [
{
"Sid": "Stmt1418761739258",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::*",
"Principal": {
"AWS": [
"rravishankar"
]
}
}
]
}
Notice the mention for a principal and a resource in this bucket policy. The ARN is specified such that it covers all the resources we want to govern. Note that these resource names have to be specified with wild cards. This is done via :
arn:aws:s3:::my_corporate_bucket/*
arn:aws:s3:::my_corporate_bucket/Development/*

ACLs:
These are specified by the user and admin to specific folders or objects within a bucket or to the bucket itself. These are fine grained control of individual resources. The access control can be expressed in terms of pre-canned ACLs such as :
'private', 'public-read', 'project-private', 'public-read-write', 'authenticated-read', 'bucket-owner-read', 'bucket-owner-full-control'
Or they can be granted explicitly to an individual, group or e-mail based principal with permissions such as :
FULL_CONTROL | WRITE | WRITE_ACP | READ | READ_ACP

In this regard, the set of items that the governance needs to take include the following:

1) Create a corporate governance service principal or account.

2) Request access key and secret for this account.

3) Set up a distribution list with storage team members for this account.

4) Prefer to have only one bucket deployed for all corporate usage.

5) User access will be based on folders as described by the prefix in the keys of their objects.

6) Define a path as the base for all users just like we do on any other file storage.

7) Cross user access will be permitted based on ACL’s No additional action is necessary except for provisioning a web page on the portal that lists the bucket and objects, their ACLs and applying those ACLs

8) If step 1) is not possible or there are other complications, then have a bucket policy generated for the account we created to have full access (initially and later reduced to changing permissions) to all the objects in that bucket.

9) Policies are to be generated from :

http://awspolicygen.s3.amazonaws.com/policygen.html

10) Apply policies generated to their targets, - IAM or buckets.

11) Provision the webpage.

#codingexercise
Decimal GetOddNumberRangeStdDev(Decimal [] A)

{

if (A == null) return 0;

Return A.OddNumberRangeStdDev();

}

Today I will be reading from the WRL research report on Shared Memory Consistency Models. The topic means a specification of memory semantics that helps right correct and efficient programs on multiprocessors. Sequential access is also one such model but it is severely restricting for multiprocessors. That is why system designers introduce relaxed consistency models but they are sometimes so different that a general specification becomes altogether more complicated. In this paper, the authors describe issues related to such models. Here the models are for hardware based shared memory systems.There are optimizations made based on each system but this paper describes the models with the same terminology. Furthermore, the paper also describes the models from the point of view of a programmer.
Shared memory models provide a single address space abstraction over parallel systems. It simplifies difficult programming tasks such as data partitioning and dynamic load distribution. Having a precise understanding of how memory behaves is important for programming. Consider a producer consumer queue where memory is written by one and read by another. If the memory gives the old value prior to the write for the reader, then there is an inconsistency between the programmer's expectation and the actual behaviour.
On a single processor, a read returns the last write because the instructions are executed in program order which isn't the case for multiprocessors. We can extend the notion to multiprocessors in which case its called sequential consistency and the reads will now be consistent because the processor is executing in program order.However, this simple and intuitive model severely restricts the hardware and compiler optimizations.

#codingexercise
Decimal GetOddNumberRangeStdDev(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeStdDev();
}

A consistency model like the one described is an interface between programmability and system design. The model affects programmability because the programmers use it to reason the correctness of the program. The model affects performance because system designers will use it to determine optimizations in hardware and software.

Tuesday, December 16, 2014

In today's post we talk about security and policies as applicable to AWS S3 artifacts. Policies are mapping that dictate what users have access to what resources. Typically policies are declared and applied to containers. If you see the .Net security model, there is a mention for different tiers of policies. These include an Enterprise level policy which comes in handy for corporate accounts. Policies don't necessarily have to be expressed in the same way on different systems but there is a reason for a consistency that we see in many different systems. For example, the inheritance of permissions set at a folder level flows to its contents. Another way to see these principles is the cross account authorization. For example, one owner of a folder may grant permissions to another user for reading and writing. Policies and their applications are implemented differently but there is a lot in common in their design. This comes from their requirements which are common. Coming back to the mention on S3 artifacts, there are two ways enterprise policies can be applied
One is to create a corporate level bucket with user level folders. Another is to apply a policy with full access to a corporate governance account to all the buckets.

Monday, December 15, 2014

Decimal GetEvenNumberRangeStdDev(Decimal [] A)

{

if (A == null) return 0;

Return A.EvenNumberRangeStdDev();

}

Today we continue to discuss the WRL long address trace generation system. We were reviewing the effects of line size in the first and second level cache in this study and the contribution of the write buffer. We summarize the study now. This is a method to generate and analyze very long traces of program execution on RISC machines. The system is designed to allow tracing of multiple user processes and the operating system. Traces can be stored or analyzed on the fly. The latter is preferred because the storage is very large even for a very short duration. Besides it allows the trace to be compacted and written to tape. The system requires that the programs be linked in a special way.
From the trace analysis using this system, it was found that while second level cache was necessary, large second level caches provide little or no benefit. A program which has a large working set can benefit from a large cache. Block size plays a big role in first level instruction cache, but sizes above 128 bytes had little effect on overall performance. If the proportion of writes in the memory references is large, then the contribution from the write buffer is significant. There is little or no benefit from the associativity in large second level cache.

#codingexercise
Decimal GetEvenNumberRangeVariance(Decimal [] A)

{

if (A == null) return 0;

Return A.EvenNumberRangeVariance();

}

Decimal GetEvenNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMode ();
}

Today we will continue to discuss the WRL system for trace generation. We discussed the effects of direct mapped and associative second level cache. We now review the line size in first and second level cache. When the line size was doubled or quadrupled with a 512 K second level cache, there were reductions in the Cumulative Total CPI and this was independent of the second level size. The improvement was due to decreased contribution of the instruction cache. The data cache remained constant. The improvement in doubling the line size was the same as the improvement in increasing the second level cache size eight times. The effects of doubling the length of the lines in the second level cache from 128 bytes to 256 bytes made no difference. This may be due to too much conflict for long line sizes to be beneficial.
We next review write buffer contributions. The performance of the write buffer varied from program to program. This is mainly due to the sensitivity of the write buffer to both the proportion and distribution of writes in the instructions executed. A write buffer entry is retired every six cycles. If the writes were any more frequent or bunched, it would degrade the performance. The relationship between the proportion of writes in a program and the write buffer performance is clear. The CPI contributed by the write buffer shows a corresponding jump.
There was a case where the percent of writes was frequently in the 15-20% range but the CPI write buffer is usually zero. If the writes were uniformly distributed below a threshold, the write buffer will never fill and a subsequent write will never be delayed. Above the threshold, there may be some delays. Since the second level cache is pipelined into three stages, the processing of a single miss gives the write buffer a chance to retire two entries. If enough misses are interspersed between the two writes, the write buffer may work smoothly Thus the first level data cache miss ratio is the third factor that comes into play.
Although this report doesn't study it, but the effects of entropy on the cache distribution hierarchy could be relevant. The Shannon entropy is defined as negative sum of P(x) log base 2 P(x)