Tuesday, January 6, 2015

We discuss next the performance results for Shasta implementation.The cluster comprised of four AlphaServers 4100 connected by a Memory Channel network which is a memory mapped network that allows a process to transmit data to a remote process without any operating system overhead via a simple store to a mapped page. Shasta implementation involved a message passing layer that ran efficiently on top of the Memory Channel. By using separate message buffers between each pair of processors, locking is eliminated when adding or removing messages from the buffers. Message passing is exploited through shared memory segments when the communicating processors are on the same node.
Nine applications were run for at least a few seconds on the cluster.When the checking overheads were compared to the sequential running times, it ranged from 10 to 25%. On parallel execution, this overhead was relatively less due to communication and synchronization overheads.
The baseline for parallel execution was measured with 64 byte fixed line size, home placement optimization, all optimizations related to exploiting release consistency and avoiding sharing writeback option. All applications achieve higher speedups with more processors.
To study the effects of variable coherence granularity, five of the applications had it set with higher than 64 bytes. Variable granularity improved performance by transfering data in large units and reducing the number of misses on the main data structures.
When some of the applications were run with slightly larger problem sizes, the miss checks were identical but the speedups improved significantly.
The base set of runs exploited all of the optimizations related to batching and release consistency.  When compared with sequential consistency, the gain in performance was as much as 10%, with the gains more noticeable with fewer processors. Addition of non-blocking release optimization does not visibly improve performance beyond the base set of runs and in some cases leads to a slightly lower performance.
The execution time was also broken down and studied. Task time represented the time spent executing the application, including hardware cache misses. Task time also included the time for executing the inline miss checks and the code necessary to enter the protocol. Read time and write time represented the stall time for read and write misses.Synchronization time represented the stall time for application locks and barriers. Message time represented the time spent handling messages.
The runs indicated a significant reduction and sometimes elimination of write stall time. The overheads increased in read, synchronization, message handling and others but this had little or no effect on performance.
For runs with variable block sizes for the subset of applications that exploit this feature, the breakdowns indicated a higher efficiency compared to the runs with a 64 byte block size and the trends were similar to those observed as described above.
To further isolate the effects of various optimizations used in the base set of runs, the experiments were repeated while allowing the overlap of multiple misses in the batches relative to the sequential consistency. This showed virtually no improvement in performance. A second set of experiments was run by adding the eager exclusive reply optimization where the reply data to a read exclusive request is used by the requesting processor before all invalidations are acknowledged. In this case too, the performance did not improve.Therefore much of the performance difference between sequential consistency and base runs could be attributed to the non-blocking store optimization.
#codingexercise
Double GetAlternateEvenNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMin();
}

Monday, January 5, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing protocol optimizations with prefetch and home placement directives. The shared data in the system undergoes state transitions only between three states - invalid, shared and exclusive.  By requiring that each load and store  is a shared miss check on the data being referenced, Shasta maintains cache coherency. Shared data is split based on address space ranges into block and line size. The operations supported are read, read-exclusive and exclusive/upgrade and a state table is maintained for each line. Each processor maintains a directory and an owner for each line. With these data structures and the tight transitions, coherency is guaranteed via communication.
We now look at detecting migratory sharing patterns. Migratory sharing occurs when data is read and modified by different processors, leading to the migration of the data from one processor to another, By keeping extra information at each directory entry, the protocol detects whether the data in each line exhibits migratory pattern. When a threshold number of times migratory sharing is observed, the is designated for migratory conversion and automatically converted to a read-exclusive request at the directory. This conversion avoids a load miss followed by a store miss to the same line that is typical for migratory shared data. The protocol provides a mechanism to revert a line from migratory conversion. The reply data for a converted read request  is cached with a special caching state called exclusive-migratory.  The owner processor treats the line as exclusive and subsequently on a store changes the line to the ordinary exclusive state. Breaks in the migratory behavior are easy to detect. If an incoming request from another processor arrives before the owner processor writes to the line while the line is still is exclusive migratory, then a break is observed and a message is sent to the home directory to nullify or revert the migratory conversion for that line.
#codingexercise

Double GetAlternateEvenNumberRangeCount()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateEvenNumberRangeCount();

}
#codingexercise


Double GetAlternateEvenNumberRangeMin()(Double [] A)


{


if (A == null) return 0;


Return A.AlternateEvenNumberRangeMin();


}

The switching to and from migratory conversion state can be stopped for a given line if the line reverts a threshold number of times. This removes needless conversions and avoids continuous switching. 

Sunday, January 4, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing multiple coherence granularity. The block size is chosen automatically. It is the same as the allocated size up to 64bytes and line size otherwise. The programmer can override this to fine tune the application. Different granularities are associated with different virtual pages and place newly allocated data on the appropriate page. The block size of each page is communicated to all the nodes.at the time the pool of shared pages are allocated. A caveat with configurable granularities is that too much control may adversely affect the system performance. The home processor for the individual pages is explicitly specified. Shasta also supports non-binding prefetch and prefetch exclusive directives thus enabling the programmer to specify prefetch when large number of misses occur at a specific point in code.
This protocol exploits relaxed consistency memory models. It emulates the behavior of a processor with non-blocking loads and stores and a lockup free cache. The non-blocking store is supported by issuing a read-exclusive or exclusive request which records where the store occurred, thus letting the operation continue. Shasta then merges the reply data with the newly written data that is already in memory. The non-blocking load is implemented by batching optimization. Further non-blocking releases are supported by delaying a release operation on the side until all the previous operations have completed. This allows the processor to continue with operations following the release. Since the load and store operations are non-blocking, a line may be in one of two pending states - pending invalid and pending share. The pending invalid state corresponds to an outstanding read or read-exclusive on that line. The pending shared state corresponds to an outstanding exclusive request.
Batching reduces communication by merging load and store misses to the same line and by issuing requests to multiple lines at the same time. Sharing  handles a miss corresponding to batched load and store by jumping to the inline batch miss code if there's a miss on any single line within a batch.  The inline code calls a batch miss handler that issues all the necessary miss requests. Non-stalling stores are implemented by requiring the handler to not wait for invalidation acknowledgements. One caveat with batching is that the batch miss handler may bring in all the necessary lines, however it cannot guarantee that they will be in the appropriate state once all the replies have come back. The reason is that while the handler is waiting for replies, requests from other processes will be served and they can change the state of the lines in a batch. This does not impact the load operation which will still get the correct value as long as the original contents of the line remain in main memory. Hence the flag for invalidated lines is not set until the batch has ended. After the batch code has been executed, the invalidation is completed at the next entry into the protocol. At that time, stores to lines, which are no longer in exclusive or pending shared state, are reissued.
#codingexercise
Double GetAlternateEvenNumberRangeMid()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMid();
}

Saturday, January 3, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We study the performance optimizations in Shasta that reduce the effect of long latencies and large message overheads that are typical in software DSM systems. The optimizations include minimizing extraneous protocol messages, supporting prefetch and home placement directives, supporting coherence and communication at multiple granularities, exploiting a relaxed memory model, batching together requests for multiple misses, and optimizing access to migratory data. We review this one by one. Number of messages plays  a significant role in the overhead associated with DSM protocol. To reduce the number of messages, we trim the extraneous ones. First, Shasta protocol has a key property that the current owner node specified by the directory guarantees to service the request that is forwarded to it. Hence there are no retries and the messages such as negative acks for the retry requests. The transient state is maintained by allocating queue space at the target processor to delay servicing an incoming request. This avoids sharing write-backs or owner-ship changes or replacements as well and the current owner has a valid copy of the data.   There are other ways to trim the extraneous messages as well. Shasta supports dirty sharing which eliminates the need for sending an up to date copy of the line back to home when the home node is remote and the data is dirty in a third node. Third, supporting exclusive or upgrade requests is also an important optimization since it reduces the need for fetching data on a store if the requesting processor has the line in a shared state. Lastly, the number of invalidation acknowledgements that are expected for an exclusive request is piggybacked on one of the invalidation acknowledgements to the requestor being sent a separate message.
Shasta proposes supporting multiple granularities for communication and coherence, even within a single application. This is enabled for multiple applications via the configurability of the granularity in the inline state check. In addition, data that is close together can be communicated with coarser granularity within the same application while that which leads to false sharing can be communicated with finer granularity. The shared data in the system undergoes state transitions only between three states - invalid, shared and exclusive.  By requiring that each load and store  is a shared miss check on the data being referenced, Shasta maintains cache coherency. Shared data is split based on address space ranges into block and line size. The operations supported are read, read-exclusive and exclusive/upgrade and a state table is maintained for each line. Each processor maintains a directory and an owner for each line. With these data structures and the tight transitions, coherency is guaranteed via communication.
#codingexercise
Double GetAlternateNumberRangeVariance()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeVariance();
}

#codingexercise

Double GetAlternateNumberRangemid()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateNumberRangemid();

}

Friday, January 2, 2015

Today we will continue to discuss the WRL Research report on the Shasta shared memory protocol. We were introducing the system in our last post and in this post, we will continue to look at some of the checks that Shasta instruments. First let us look at a basic shared miss check.  This code first checks if the target address is in the shared memory range and if not, skip the remainder of the check. Otherwise the code calculates the address of the state table entry corresponding to the target address and checks that the line containing the target address is in the exclusive state. While Shasta optimizes these checks, the costs of the missed checks are still high.  To reduce the overhead further, some more advanced optimizations are involved. These are described next:
Invalid Flag technique: Whenever a line of processor becomes invalid,  the Shasta protocol stores a particular flag value in each longword of the line.  The miss check code for a load can then just compare the value loaded with the flag value. If the loaded value is not equal to the flag value, the data must be valid and the application code can continue immediately. If the loaded value is equal to the flag value, then a miss routine is called that first does the normal range check and state table lookup. The state check distinguishes an actual miss from a "false miss" and simply returns back to the application code in case of a false miss. A false miss is one when the application data actually contains the flag value. Another advantage of this technique is that the load of the state table entry is eliminated. The second optimization is batching miss checks. When the checks for multiple loads and stores are batched together, it reduces the overhead significantly.  If the base is the same and the offsets are all less than or equal to the Shasta line size, then a sequence of these loads and stores collectively touch at most two consecutive lines in memory.  Therefore, if inline checks verify that these are in the correct state, then all loads and stores can proceed without further checks. Its convenient to check both lines by just checking the beginning and ending address of the range. This batching technique also applies to loads and stores via multiple registers as long as the range can be determined.
Today we continue our discussion on the WRL research report on the performance of the Shasta shared memory protocol

Thursday, January 1, 2015

#codingexercise
Double GetAlternateNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMode();
}
Today we discuss the WRL Research report on the performance of the Shasta Distributed Shared Memory Protocol. Shasta is a software that supports shared address space across a cluster of computers with physically distributed memory. Shasta can keep data coherent at a fine granularity. It implements this coherence by inlining code that checks the cache state of shared data before each load or store. In addition, shasta allows the coherence granularity to be varied across different shared data structures in the same application. This is helpful when compared to similar purpose systems that have inefficiencies due to fixed large page-size granularity. This paper talks about the cache coherence protocol in Shasta. The protocol borrows a number of optimizations from hardware systems and since it is written in software, it provides tremendous flexibility in the design of the protocol.  The protocol runs on Alpha systems connected via Digital's Memory Channel network. The performance of Shasta was studied for the different optimizations enabled. 
Since Shasta supports the shared address space entirely in software  and at fine granularity of coherence, it reduces the false sharing and transmission of unneeded data, both of which are potential problems in systems with large coherence granularities.  Code is inserted into the application executable before loads and stores that checks if the data being accessed is available locally in the appropriate state. These checks can further be minimized with appropriate optimizations making it a viable replacement of existing systems.
The Shasta protocol provides a number of mechanisms for dealing with the long communication latencies in a workstation cluster. Since it can support variable coherence granularities, it can exploit any potential gains from larger communication granularities from specific shared data. If there are concerns over the overheads in using software based messages, Shasta does minimize the extraneous coherence messages and uses only a fewer messages to satisfy the shared memory operations compared to protocols commonly used in hardware systems.
In fact, the optimizations that attempt to hide memory latencies by exploiting a relaxed memory consistency model, lead to a much more limited gains. This is due to the fact that when a processor waits for data or synchronization to be overlapped with the handling of incoming coherence messages from other processors, it spends most of its time in the wait making it difficult to improve performance by reducing the wait. Finally, optimizations related to migratory data are not useful in Shasta because the migratory sharing patterns are unstable or even absent at block sizes of 64 bytes or higher.  
Shasta divides the virtual address space of each processor into private or shared regions. Data in the shared region may be cached by multiple processors at the same time, with copies residing at the same virtual address space on each processor.
#codingexercise
Double GetAlternateNumberRangeCount()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeCount();
}

Let us discuss some more on this WRL Research report. The hardware cache coherent multiprocessors and Shasta both define the same basic states:
invalid - the data is not valid on this processor.
shared - the data is valid on this processor, and other processors have copies of the data as well.
exclusive - the data is valid on this processor, and no other processors have copies of this data.
Shasta inserts checks in the code prior to each load and store to safeguard against the invalid state. When a processor attempts to write data that is in an invalid or shared state, we say there is a shared miss. The checks inserted by Shasta safeguard against this shared miss.
Also as in hardware shared systems, Shasta divides up the shared address space into ranges of memory called blocks. All data within a block is in the same state and fetched and kept coherent as a unit. Shasta allows this block size to be different for different ranges of shared address space. Further, to simplify the instrumentation, Shasta divides up the address space further into fixed-size ranges called lines and maintains state information for each line in a state table. This line size is configurable at compile time and the block size is taken to be a multiple of the fixed line size.
Coherence is maintained using a directory based invalidation protocol. It supports three types of requests - read, read-exclusive, and exclusive (or -upgrade). Supporting exclusive requests is an important optimization since it reduces message latency and overhead if the processor already has the line in the shared state. Shasta also supports three types of synchronization primitives in the protocol and these are : locks, barriers, and event flags.  These primitives are sufficient for supporting the applications run on the system.
Each virtual page of data is associated with a home processor and each processor maintains the directory information for the shared data. Each line is assigned to an owner processor which is the last processor that held an exclusive copy of the line. The directory information comprises of a pointer to the current owner processor and a full bit vector of the processors that are sharing the data. Data can be shared without requiring the home node to have an up to date copy and this is called dirty sharing The owner is forwarded a copy when a request arrives at a home unless the home processor has a copy.
Message passing is costly when done via interrupts, hence messages are serviced through a polling mechanism. Polling is cheaper because there is a single cacheable location which can be tested to determine if a message has arrived. The polls are inserted at every loop backedge to ensure reasonable response times whenever the protocol waits for a reply. Further there are no messages between a shared miss check and the load or store that is being checked and thus polling can be used to simplify the inline miss checks.
#codingexercise
Double GetAlternateNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeStdDev();
}