Cluster computing

#codingexercise
Double GetAlternateNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMode();
}
Today we discuss the WRL Research report on the performance of the Shasta Distributed Shared Memory Protocol. Shasta is a software that supports shared address space across a cluster of computers with physically distributed memory. Shasta can keep data coherent at a fine granularity. It implements this coherence by inlining code that checks the cache state of shared data before each load or store. In addition, shasta allows the coherence granularity to be varied across different shared data structures in the same application. This is helpful when compared to similar purpose systems that have inefficiencies due to fixed large page-size granularity. This paper talks about the cache coherence protocol in Shasta. The protocol borrows a number of optimizations from hardware systems and since it is written in software, it provides tremendous flexibility in the design of the protocol. The protocol runs on Alpha systems connected via Digital's Memory Channel network. The performance of Shasta was studied for the different optimizations enabled.
Since Shasta supports the shared address space entirely in software and at fine granularity of coherence, it reduces the false sharing and transmission of unneeded data, both of which are potential problems in systems with large coherence granularities. Code is inserted into the application executable before loads and stores that checks if the data being accessed is available locally in the appropriate state. These checks can further be minimized with appropriate optimizations making it a viable replacement of existing systems.
The Shasta protocol provides a number of mechanisms for dealing with the long communication latencies in a workstation cluster. Since it can support variable coherence granularities, it can exploit any potential gains from larger communication granularities from specific shared data. If there are concerns over the overheads in using software based messages, Shasta does minimize the extraneous coherence messages and uses only a fewer messages to satisfy the shared memory operations compared to protocols commonly used in hardware systems.
In fact, the optimizations that attempt to hide memory latencies by exploiting a relaxed memory consistency model, lead to a much more limited gains. This is due to the fact that when a processor waits for data or synchronization to be overlapped with the handling of incoming coherence messages from other processors, it spends most of its time in the wait making it difficult to improve performance by reducing the wait. Finally, optimizations related to migratory data are not useful in Shasta because the migratory sharing patterns are unstable or even absent at block sizes of 64 bytes or higher.
Shasta divides the virtual address space of each processor into private or shared regions. Data in the shared region may be cached by multiple processors at the same time, with copies residing at the same virtual address space on each processor.
#codingexercise
Double GetAlternateNumberRangeCount()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeCount();
}

Let us discuss some more on this WRL Research report. The hardware cache coherent multiprocessors and Shasta both define the same basic states:
invalid - the data is not valid on this processor.
shared - the data is valid on this processor, and other processors have copies of the data as well.
exclusive - the data is valid on this processor, and no other processors have copies of this data.
Shasta inserts checks in the code prior to each load and store to safeguard against the invalid state. When a processor attempts to write data that is in an invalid or shared state, we say there is a shared miss. The checks inserted by Shasta safeguard against this shared miss.
Also as in hardware shared systems, Shasta divides up the shared address space into ranges of memory called blocks. All data within a block is in the same state and fetched and kept coherent as a unit. Shasta allows this block size to be different for different ranges of shared address space. Further, to simplify the instrumentation, Shasta divides up the address space further into fixed-size ranges called lines and maintains state information for each line in a state table. This line size is configurable at compile time and the block size is taken to be a multiple of the fixed line size.
Coherence is maintained using a directory based invalidation protocol. It supports three types of requests - read, read-exclusive, and exclusive (or -upgrade). Supporting exclusive requests is an important optimization since it reduces message latency and overhead if the processor already has the line in the shared state. Shasta also supports three types of synchronization primitives in the protocol and these are : locks, barriers, and event flags. These primitives are sufficient for supporting the applications run on the system.
Each virtual page of data is associated with a home processor and each processor maintains the directory information for the shared data. Each line is assigned to an owner processor which is the last processor that held an exclusive copy of the line. The directory information comprises of a pointer to the current owner processor and a full bit vector of the processors that are sharing the data. Data can be shared without requiring the home node to have an up to date copy and this is called dirty sharing The owner is forwarded a copy when a request arrives at a home unless the home processor has a copy.
Message passing is costly when done via interrupts, hence messages are serviced through a polling mechanism. Polling is cheaper because there is a single cacheable location which can be tested to determine if a message has arrived. The polls are inserted at every loop backedge to ensure reasonable response times whenever the protocol waits for a reply. Further there are no messages between a shared miss check and the load or store that is being checked and thus polling can be used to simplify the inline miss checks.
#codingexercise
Double GetAlternateNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeStdDev();
}

Cluster computing

Thursday, January 1, 2015

No comments:

Post a Comment