Cluster computing

Sunday, January 4, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing multiple coherence granularity. The block size is chosen automatically. It is the same as the allocated size up to 64bytes and line size otherwise. The programmer can override this to fine tune the application. Different granularities are associated with different virtual pages and place newly allocated data on the appropriate page. The block size of each page is communicated to all the nodes.at the time the pool of shared pages are allocated. A caveat with configurable granularities is that too much control may adversely affect the system performance. The home processor for the individual pages is explicitly specified. Shasta also supports non-binding prefetch and prefetch exclusive directives thus enabling the programmer to specify prefetch when large number of misses occur at a specific point in code.
This protocol exploits relaxed consistency memory models. It emulates the behavior of a processor with non-blocking loads and stores and a lockup free cache. The non-blocking store is supported by issuing a read-exclusive or exclusive request which records where the store occurred, thus letting the operation continue. Shasta then merges the reply data with the newly written data that is already in memory. The non-blocking load is implemented by batching optimization. Further non-blocking releases are supported by delaying a release operation on the side until all the previous operations have completed. This allows the processor to continue with operations following the release. Since the load and store operations are non-blocking, a line may be in one of two pending states - pending invalid and pending share. The pending invalid state corresponds to an outstanding read or read-exclusive on that line. The pending shared state corresponds to an outstanding exclusive request.
Batching reduces communication by merging load and store misses to the same line and by issuing requests to multiple lines at the same time. Sharing handles a miss corresponding to batched load and store by jumping to the inline batch miss code if there's a miss on any single line within a batch. The inline code calls a batch miss handler that issues all the necessary miss requests. Non-stalling stores are implemented by requiring the handler to not wait for invalidation acknowledgements. One caveat with batching is that the batch miss handler may bring in all the necessary lines, however it cannot guarantee that they will be in the appropriate state once all the replies have come back. The reason is that while the handler is waiting for replies, requests from other processes will be served and they can change the state of the lines in a batch. This does not impact the load operation which will still get the correct value as long as the original contents of the line remain in main memory. Hence the flag for invalidated lines is not set until the batch has ended. After the batch code has been executed, the invalidation is completed at the next entry into the protocol. At that time, stores to lines, which are no longer in exclusive or pending shared state, are reissued.
#codingexercise
Double GetAlternateEvenNumberRangeMid()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMid();
}

Cluster computing

Sunday, January 4, 2015

No comments:

Post a Comment