Cluster computing

Wednesday, January 7, 2015

Today we review the effects of exploiting release consistency in the Shasta protocol we were discussing. The change in the execution time of eight and sixteen processor runs with a 64 byte block size was studied for different levels of optimizations. For each application, the problem sizes was determined beforehand and the corresponding execution time was used to normalize the other times.The base set of runs exploited all of the optimizations related to batching and release consistency except that the release operations were blocking.The batching optimizations therefore included multiple outstanding misses and merging of loads and stores to the same lines within a batch. The release consistency optimizations included non blocking stores, eager exclusive replies, and lockup free optimization. The execution times were compared to a conservative implementation of sequential consistency. The execution times were also compared to a run with the addition of non-blocking releases to the optimizations in the base set of runs. The improvement in the base set of runs over the conservative sequential consistency was nearly 10 %. However, the addition of non-blocking releases had little or no improvement in performance. The execution time was also broken down and studied. It comprised of task time, read time, write time, synchronization time and message time. Task time includes the time for executing inline miss checks and the code necessary to enter the protocol. Read time and write time represent the stall time for read and write misses that are satisfied by other processors through the software protocol. There are also stalls on a store if there are non-contiguous stores to a pending line. Synchronization time represents the stall time for application locks and barriers including both acquire and release times. Message time represents the time spent handling messages when the processor is not already stalled. If the processors are stalled on data or synchronization, it is hidden by the read, write and synchronization time. Everything else was considered miscellaneous. The breakdowns indicate that the optimizations used in the base set of runs are effective in significantly reducing and sometimes eliminating the write stall time. When compared with the sequential consistency runs, the reduction in write stall time is accompanied by an increase in other overhead categories. Even though the processors do not directly stall for stores, the pending store requests still require servicing thereby increasing the time for read, synchronization, message and other overheads. This meant little or no improvement due to increases in other categories. The time spent by the processors to handle incoming protocol messages was also collected. The contribution of the message handling time to the total execution time is less than 10%. Nevertheless, a large portion of messages are handled while a processor is waiting for its own data and synchronization requests. The applications reported an average of 20-35% time for this waiting. Therefore, the processors are heavily utilized while they wait for their own requests to complete. This also implies that when the stall time for some operations is hidden, it increases the stall time for other operations.
#codingexercise
Double GetAlternateEvenNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMax();
}
#codingexercise
Double GetAlternateEvenNumberRangeSum()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSum();
}

Cluster computing

Wednesday, January 7, 2015

No comments:

Post a Comment