Friday, December 26, 2014

Today we continue our discussion on WRL shared memory consistency models. We were reviewing relaxing all program orders  We were discussing the weak ordering model and two flavors of the release consistency model - the serialization consistency and processor consistency flavors. All of the models in this group allow a processor to read its own write early. However, the two flavors are the only ones whose straightforward implementations allow a read to return the value of another processor's write early. These models distinguish memory operations based on their type  and provide stricter ordering constraints for some type of operations.
The Weak ordering model classifies memory operations into two categories : data operations and synchronization operations. Since the programmer is required to identify at least one of the operations as  a synchronization operation, the model can reorder memory operations between these synchronization operations without affecting the program correctness. Compared to the weak ordering model, the release consistency flavors provides further distinctions among memory operations. Operations are first distinguished as special or ordinary. Special operations are further distinguished as sync or nsync operations. Sync operation is further distinguished as acquire or release operations. The two flavors of release consistency differ based on the program orders they maintain among special operations.  The first flavor maintains sequential consistency while the second flavor maintains processor consistency. The first flavor enforces acquire to precede all operations and all operations to precede release operation in addition to requiring special operations to precede special operations.  The second flavor enforces almost the same with the exception for a special write followed by a special read. Imposing program order from a write to a read operation requires using read-modify-write operations much the same way as we had seen earlier.
The other category of models for relaxing all program orders such as Alpha, RMO and PowerPC - all provide explicit fence instructions as their safety nets.
The alpha model provides two different fence instructions: the memory barrier and the write memory barrier. The memory barrier (MB) instruction can be used to maintain program order from any memory operation before the MB to any memory instruction after the MB. The write memory barrier instruction provides this guarantee only among write operations.
The RMO model has more flavors for fence instructions. It can be customized to order a combination of read and write operations with respect to future read and write operations using a four bit encoding. Since a combination can be used to order a write with respect to a following read, there is no need for read-modify-write semantics.
The PowerPC model provides a single fence instruction : the SYNC instruction. This is similar to the memory barrier instruction with the exception that when there are two reads to the same location, one may return the value of an older write than the first read. This model therefore requires read-modify-write semantics to enforce program order.
#codingexercise
Decimal GetAlternateNumberRangeSum ()(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeSum ();
}

#codingexercise
Double GetAlternateNumberRangeMean ()(Double[] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMean();
}
We now look at an alternate abstraction for Relaxed Memory Models. The models mentioned so far have put to use on a wide variety of systems however they have required a higher level of complexity for the programmers. This comes from the system-centric commands that the programmer has to use. Further, they directly expose the programmer to the reordering and atomicity optimizations that are allowed by a model. Instead the programmer could use safety nets (eg. fence instructions, more conservative operation types, or read-modify-write operations) to impose the ordering and atomicity requirements on memory operations. However its not easy to identify the ordering constraints.  For example, the weak ordering requires that programmers should identify all synchronization operations. In order that we define a programmer centric specification, we must first define the notion of correctness for the programs. Here we could use sequential consistency. And second, the information required from the programmer must be defined precisely. The information used in the weak ordering model or the release consistency model could be candidates for this information required by the programmers although it may be described in program level information.

Thursday, December 25, 2014

#codingexercise
Decimal GetAlternateNumberRangeStdDev ()(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeStdDev ();
}

We will continue our discussion on Shared Memory Consistency Model today. We discussed relaxing sequential consistency followed by relaxing write to read program order and relaxing write to read and write to write program orders. We now review relaxing all program orders.
By that we mean relaxing program order between all operations to different locations. Thus, a read or write operation may be re-ordered with respect to a following read or write to a different location. We discuss the weak ordering model and two flavors of the release consistency model (RCsc/RCpc) The weak ordering model classifies memory operations into two categories : data operations and synchronization operations This model reasons that since the programmer has called out one of operations as the synchronization operation, the reordering of memory operations between synchronization operations does not typically affect the correctness of the program. In other words the operations called out as synchronization operations provide the safety net to enforce program order. To implement this, each processor can provide a counter to keep track of its outstanding operations. This counter is incremented when the processor issues an operation and is decremented when a previously issued operation completes.  Each processor must ensure that a synchronization operation is not issued until all previous operations are complete, which is signaled by a zero value for that counter. Furthermore, no operations are issued until the previous synchronization operation completes. This way the memory operations may be reordered while enforcing the program order. Moreover, the writes always appear atomic to the programmer, therefore no safety net is required for write atomicity.  We now discuss the two flavors of release consistency that differ based on the program orders they maintain among special operations. The first maintains sequential consistency among special operations and the second flavor maintains processor consistency among such operations.  In the first flavor, the constraints for program order are enforced with the requirements that the acquire precedes all operations and all operations precede release in addition to special operations preceding other special operations in program order. For processor consistency in the second flavor, the write to read program order is eliminated. The acquire precedes all operations and all operations precede release operations and special operations precede special operations except for a special write followed by a special read. In the processor consistency case, imposing a program order from a write to a read operation requires using read modify write operations. Further if the write being ordered is ordinary, then the write in the corresponding read modify write needs to be a release, otherwise the write in the read-modify-write can be any special write. The use of read-modify-write also helps enforce atomicity of writes in this case.
#codingexercise

Decimal GetAlternateNumberRangeVariance ()(Decimal [] A)

{

if (A == null) return 0;

Return A.AlternateNumberRangeVariance ();

}


Today we continue our discussion

Monday, December 22, 2014

We quickly summarize the discussion on sequential consistency from the WRL report on Shared Memory consistency models. The requirements are : First, a processor must ensure that its previous memory operation is complete before proceeding with its next memory operation in program order.  This is the "program order" requirement. To determine that an operation is complete, acknowledgement messages are exchanged. Further, in a cache based system, a write must generate invalidate or update messages.  The second requirement pertains only to cache based systems and concerns write atomicity. It requires that the writes to the same memory location be serialized and made visible to all the processors. The value of the write is not returned until all messages are exchanged. This is the "write atomicity" requirement. Compiler optimizations such as register allocations violate this sequential consistency and are unsafe. Hardware and Compiler optimizations that do not violate sequential consistency include: First technique involves prefetching the ownership for any write operations that are delayed and thus partially overlapping with the operations preceding them in the program order. This technique is only applicable to cache based systems that use an invalidation based protocol.  The second technique speculatively services read operations that are delayed due to the program order requirement.  The sequential consistency is ensured in this case by rolling back and reissuing the read and subsequent operations in the case that the read line gets invalidated. The second technique is generally applicable to dynamically scheduled processors because they already have an implementation for rollbacks. Other latency hiding techniques such as non-binding software prefetching or hardware support for multiple contexts also improve performance but the above techniques are widely applicable.  Next we review more relaxed memory models.
#codingexercise
Decimal GetAlternateNumberRangeMin(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMin();
}
Relaxed memory consistency models are based on two key characteristics:  how they relax the program order requirement and how they relax the write atomicity requirement. With the former, we can further distinguish models based on whether they relax the order from a write to a following read between two writes and from  a read to a following read In all cases, the relaxation only applies to operation pairs with different addresses. The architecture here resembles the one without caches that we discussed earlier.
With respect to the write atomicity requirement, we distinguish models based on whether they allow a read to return the value of another processor's write before all copies of the accessed location receive an invalidation or update messages generated by the write. This relaxation only applies to cache based systems.
Finally we consider a relaxation related to both program order and write atomicity. when a processor is allowed to read the value of its own previous write before the write is made visible to other processors.In a cache based system, the relaxation allows the read to return value of the write before the write is serialized with respect to other writes to the same location and before the invalidations / updates of write reach any other processors.

Relaxed models typically provide programmers with mechanisms for overriding such relaxations. For example, fence instructions may be provided to override program order relaxations. The ability to override relaxations is often referred to as safety nets.

The different relaxation models can now be categorized as :
Relax write to read program order
Relax write to write program order
Relax read to read and read to write program order
Read others write early
Read own write early

The corresponding relaxations are usually straightforward implementations of the corresponding model. We will review the above relaxation models in detail.  We implicitly assume the following constraints are satisfied. First all models that require a write to be eventually made visible to all processors and for writes to the same location to be serialized. This is true if shared data is not cached or enforced by cache coherence protocol when cached. Second, all models enforce uniprocessor data and control dependencies. Finally, the models that relax program order from reads to the following writes must also maintain a subtle form of microprocessor data and control dependencies.

We now look at the first relaxation model : relaxing the write to read program order.
Here the optimization enabled is that a read is allowed to be reordered with respect to previous writes from the same processor. As a consequence of this reordering, the shared bus write buffer model where two processors check and set their flags have their sequential consistency violated. The safety net feature here is to provide specialization instructions that may be placed between two operations.  The serialization functions are either special memory instructions that are  used for synchronization (e.g compare and Swap) or they are non-memory instructions such as branch. Placing the serialization instruction after the write on each processor provides sequentially consistent results for the program even in this case with the optimization. However, this is not the only technique used by these models.  A read-modify-write operation can also provide the illusion that program order is maintained between a write and a following read. Usually making the read part of or replaced by the read-modify-write instruction helps with sequential consistency. In certain models, doing the same for a write operation is not sufficient because they require that no other writes to any location appear to occur between the read and the write of the read-modify-write.
Next to enforce the write atomicity in models which relax this atomicity, making the read or replacing it by the read-modify-write suffices to enforce this atomicity as long as it is not indiscriminately used.There is however some cost in performing the write when replacing a read with a read-modify write unless the read is already part of it. Moreover the compilers require the full flexibility of both reordering a write followed by a read as well as a read followed by a write.

We now look at relaxing the write to read and write to write program orders.
These set of models further relax the program order requirement by eliminating ordering constraints between writes to different locations.The writes to different locations from the same processor can be pipelined or overlapped and are allowed to reach memory or other cached copies out of program order. With respect to atomicity requirements, this model allows the processor to read the value of its own write early and prohibits a processor from reading the value of another processor's write before the write is visible to all the processors.
The safety net provided by this model is the same as earlier where serializing instructions are placed between writes.  In this case, an STBAR instruction is placed for imposing program order between two writes. With FIFO write buffers, this instruction is inserted into the write buffer and the retiring of writes that were buffered after a STBAR are delayed until writes that were buffered before the STBAR have retired and completed.

Sunday, December 21, 2014

We continue our discussion on WRL report on Shared memory consistency models. We looked at implementing sequential consistency in architectures without caches. We specifically reviewed write buffers with bypassing capability, overlapping write operations and non-blocking read operations. We now study architectures with caches. Caching can also introduce reordering behavior. Therefore an implementation with caches must also hide it. For example, if a processor issues a read to the cache, it cannot read until its previous operations are complete. The replication of shared data introduces three additional issues: First, the presence of multiple copies requires a cache coherence protocol to propagate newly written values to all cached copies of the modified location. Second, detecting when a write is complete requires more transactions with replication. Thirdly, propagating changes to multiple copies is inherently non-atomic. We review these in this order.
Cache coherency has often implied both that a write is eventually made visible to all the processors and writes to the same location appear to be seen in the same order by all the processors. But these are not sufficient to enforce sequential consistency because that requires writes to all locations not just the same location to be made visible and also operations of a single processor must be in program order.To do this we propagate the value by either invalidating the copy or updating the copy to the newly written value. We translate the sequential consistency requirements to be an early and a late bound on when a new value can be propagated to any given processor. The early bounds should be such that the sequential order is maintained between early and late bounds of successive operations. The late bounds should be such that the program order is maintained.
To detect the completion of write operations, we were using acknowledgements in the design without caches.There the acknowledgement could be generated as soon as the value reached the memory module. Here we have to wait until the cache is invalidated or updated. For example, with an overlapped write in a general interconnect system, consider a processor P2 to have Data in its cache. P1 proceeds with the write to Head after the previous write to Data reaches the target memory but before it's propagated to P2. Now P2 can read the value of Head but return an old value of Data. This can be avoided by having P2 cache updated before write to Head.
Therefore invalidation and update messages need to be exchanged in addition to acknowledgement messages.
To maintain the illusion of atomicity for writes, we first look at the problems due to non-atomicity.  Let us take the update based protocol, Let us say there are four processors, two of whom write values to variables and two others read those variables. Initially all variables are set to zero. Then its possible to violate the sequential consistency if the writes reach the two processors arbitrarily consequently making the write of A appear non-atomic. This can be avoided by either requiring them to be written to the same location to be serialized or having the network preserve the ordering of messages.
#codingexercise
Decimal GetAlternateNumberRangeMax(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMax();
}
We continue our discussion with considerations for Compilers.  A key requirement for the compiler is to preserve the program order in its optimizations and shared memory operations. This requirement directly restricts any uniprocessor compiler optimizations which can result in reordering memory operations. In addition to a reordering effect, optimizations such as register allocation also leads to the elimination of certain shared memory operations because they can potentially violate sequential consistency. Take the example we described above and the register allocation of Head of P2 which may never get updated. On the other hand, sequentially consistent code clearly avoids this.  To summarize some compiler optimizations can violate sequential consistency. Fortunately, compilers for parallel programs avoid such unsafe optimization. 

Saturday, December 20, 2014

codingexercise

Decimal GetOddDecimalRangeSum (Decimal  [] A)

{

if (A == null) return 0;

Return A.OddNumberRangeSum ();

}

Continuing on our discussion from the previous post on sequential consistency in the WRL research report on shared memory consistency model, let us know look at the implementation considerations for the same.
We first look at architectures without caches. The key design principle here is to enforce and maintain program order of execution. We also keep track of the memory operations and make sure that they are sequential.
Let us take a look at write buffers with bypassing capability. We have a bus based shared memory system with no cache. A processor issues memory operations in sequential order. The write buffer merely helps the processor to not wait on the finishing of the write. Reads are allowed to bypass any previous writes to the buffer. This is allowed as long as the read address is different from any of the addresses in the write buffer. This optimization allows us to hide the write latency.  To see how this can violate sequential consistency, let us consider the example of the Dekker algorithm mentioned earlier where we maintained that both reads of the flag cannot return the same value. However, in our bypassing capability, this is now permitted. This problem doesn't occur with uniprocessors.
Now let us take a look at overlapping write operations. Here we use a general interconnection network that alleviates the bottleneck of a bus based design. We still assume processors issue memory operations in program order. As opposed to the previous example, multiple write operations issued by the same processor may be simultaneously serviced by the different memory modules. We maintained that the reads of data by a processor should return the value written by the other processor for sequential consistency. In our case now, we violate this assumption because a write may be injected before a write makes its way to the memory module. This highlights the importance of maintaining program order between memory operations. If we coalesce other write operations to the same cache line, that can also cause the inconsistency.
One way to remedy this could be to wait for the write to reach the memory before allowing another into the network. Enforcing the above requires an acknowledgement to the processor issuing a write. These acknowledgements help with maintaining program order.
 Next we consider non-blocking read operations on the same example as above. Here a processor ensures that its write arrives at the memory location in program order. If another processor is allowed to issue its read operation, then the read might arrive before the write leading to an inconsistency.
#codingexercise
Decimal GetOddNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeMode();
}

Thursday, December 18, 2014

Today I'm going to continue discussing the WRL research report on Shared memory consistency models. Most programmers assume sequential semantics for memory operations. A read will return the value of the last write. As long as there is a uniprocessor data and dependences can be controlled, the illusion of sequentiality can be maintained. The compiler and hardware can freely reorder operations to different locations.  Compiler optimizations that reorder memory operations can include register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding and lockup-free caches.
The following are some of the myths about the memory consistency models:
1) It only applies to systems that allow multiple copies of shared data This is not true as in the case with overlapped writes and non-blocking reads.
2) Most current systems are sequentially consistent. On the contrary, Cray computers and Alpha computers relax this consistency.
3) The memory consistency model only affects the design of the hardware. But it actually affects compiler and other system design.
4)  A cache coherence protocol inherently supports sequential consistency. In reality, this is only part of it. Another myth is that the memory consistency model depends on whether the system supports an invalidate or update based coherence protocol. But it can allow both.
5) The memory model may be defined solely by specifying the behaviour of the processor. But it is affected by both the processor and memory system.
6) Relaxed memory models may not be used to hide read latency. The reality is it can hide both read and write latency.
7) Relaxed consistency models require the use of extra synchronization. There is no synchronization required for some of the relaxed models.
8) Relaxed consistency models do not allow asynchronous algorithms. This can actually be supported.

#codingexercise
Decimal GetOddNumberRangeMean(Decimal [] A)
{

if (A == null) return 0;


Return A.OddNumberRangeMean();

}


Tonight we will continue to discuss this WRL Research report on Shared memory consistency models. In particular, we will be discussing Sequential consistency. This is the most commonly assumed memory consistency model. The definition is that the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appears in program order. Maintaining this program order together with maintaining a single sequential order among all the processors is the core tenet of this model.  The latter part actually requires that memory is updated atomically with respect to other operations. Let us take two examples for using this model. We use a code segment from Dekker's algorithm for critical sections, involving two processors and two flags that are initialized to zero. When a processor enters the critical section, it updates its flag and checks the other. Here the assumption is that the value is read only after it is written. Here the expectation that the program order is maintained ensures sequential consistency and a race condition.  To illustrate the importance of the atomic execution, let us instead consider an example where there are three processors sharing variables A and B both set to zero. Here if one of the processor reads a variable and writes to the other which another processor reads, then the atomicity requires that the updates made to the variable are seen by all the processors at once.