We discuss next the performance results for Shasta implementation.The cluster comprised of four AlphaServers 4100 connected by a Memory Channel network which is a memory mapped network that allows a process to transmit data to a remote process without any operating system overhead via a simple store to a mapped page. Shasta implementation involved a message passing layer that ran efficiently on top of the Memory Channel. By using separate message buffers between each pair of processors, locking is eliminated when adding or removing messages from the buffers. Message passing is exploited through shared memory segments when the communicating processors are on the same node.
Nine applications were run for at least a few seconds on the cluster.When the checking overheads were compared to the sequential running times, it ranged from 10 to 25%. On parallel execution, this overhead was relatively less due to communication and synchronization overheads.
The baseline for parallel execution was measured with 64 byte fixed line size, home placement optimization, all optimizations related to exploiting release consistency and avoiding sharing writeback option. All applications achieve higher speedups with more processors.
To study the effects of variable coherence granularity, five of the applications had it set with higher than 64 bytes. Variable granularity improved performance by transfering data in large units and reducing the number of misses on the main data structures.
When some of the applications were run with slightly larger problem sizes, the miss checks were identical but the speedups improved significantly.
The base set of runs exploited all of the optimizations related to batching and release consistency. When compared with sequential consistency, the gain in performance was as much as 10%, with the gains more noticeable with fewer processors. Addition of non-blocking release optimization does not visibly improve performance beyond the base set of runs and in some cases leads to a slightly lower performance.
The execution time was also broken down and studied. Task time represented the time spent executing the application, including hardware cache misses. Task time also included the time for executing the inline miss checks and the code necessary to enter the protocol. Read time and write time represented the stall time for read and write misses.Synchronization time represented the stall time for application locks and barriers. Message time represented the time spent handling messages.
The runs indicated a significant reduction and sometimes elimination of write stall time. The overheads increased in read, synchronization, message handling and others but this had little or no effect on performance.
For runs with variable block sizes for the subset of applications that exploit this feature, the breakdowns indicated a higher efficiency compared to the runs with a 64 byte block size and the trends were similar to those observed as described above.
To further isolate the effects of various optimizations used in the base set of runs, the experiments were repeated while allowing the overlap of multiple misses in the batches relative to the sequential consistency. This showed virtually no improvement in performance. A second set of experiments was run by adding the eager exclusive reply optimization where the reply data to a read exclusive request is used by the requesting processor before all invalidations are acknowledged. In this case too, the performance did not improve.Therefore much of the performance difference between sequential consistency and base runs could be attributed to the non-blocking store optimization.
#codingexercise
Double GetAlternateEvenNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMin();
}
Nine applications were run for at least a few seconds on the cluster.When the checking overheads were compared to the sequential running times, it ranged from 10 to 25%. On parallel execution, this overhead was relatively less due to communication and synchronization overheads.
The baseline for parallel execution was measured with 64 byte fixed line size, home placement optimization, all optimizations related to exploiting release consistency and avoiding sharing writeback option. All applications achieve higher speedups with more processors.
To study the effects of variable coherence granularity, five of the applications had it set with higher than 64 bytes. Variable granularity improved performance by transfering data in large units and reducing the number of misses on the main data structures.
When some of the applications were run with slightly larger problem sizes, the miss checks were identical but the speedups improved significantly.
The base set of runs exploited all of the optimizations related to batching and release consistency. When compared with sequential consistency, the gain in performance was as much as 10%, with the gains more noticeable with fewer processors. Addition of non-blocking release optimization does not visibly improve performance beyond the base set of runs and in some cases leads to a slightly lower performance.
The execution time was also broken down and studied. Task time represented the time spent executing the application, including hardware cache misses. Task time also included the time for executing the inline miss checks and the code necessary to enter the protocol. Read time and write time represented the stall time for read and write misses.Synchronization time represented the stall time for application locks and barriers. Message time represented the time spent handling messages.
The runs indicated a significant reduction and sometimes elimination of write stall time. The overheads increased in read, synchronization, message handling and others but this had little or no effect on performance.
For runs with variable block sizes for the subset of applications that exploit this feature, the breakdowns indicated a higher efficiency compared to the runs with a 64 byte block size and the trends were similar to those observed as described above.
To further isolate the effects of various optimizations used in the base set of runs, the experiments were repeated while allowing the overlap of multiple misses in the batches relative to the sequential consistency. This showed virtually no improvement in performance. A second set of experiments was run by adding the eager exclusive reply optimization where the reply data to a read exclusive request is used by the requesting processor before all invalidations are acknowledged. In this case too, the performance did not improve.Therefore much of the performance difference between sequential consistency and base runs could be attributed to the non-blocking store optimization.
#codingexercise
Double GetAlternateEvenNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMin();
}
No comments:
Post a Comment