Friday, January 2, 2015

Today we will continue to discuss the WRL Research report on the Shasta shared memory protocol. We were introducing the system in our last post and in this post, we will continue to look at some of the checks that Shasta instruments. First let us look at a basic shared miss check.  This code first checks if the target address is in the shared memory range and if not, skip the remainder of the check. Otherwise the code calculates the address of the state table entry corresponding to the target address and checks that the line containing the target address is in the exclusive state. While Shasta optimizes these checks, the costs of the missed checks are still high.  To reduce the overhead further, some more advanced optimizations are involved. These are described next:
Invalid Flag technique: Whenever a line of processor becomes invalid,  the Shasta protocol stores a particular flag value in each longword of the line.  The miss check code for a load can then just compare the value loaded with the flag value. If the loaded value is not equal to the flag value, the data must be valid and the application code can continue immediately. If the loaded value is equal to the flag value, then a miss routine is called that first does the normal range check and state table lookup. The state check distinguishes an actual miss from a "false miss" and simply returns back to the application code in case of a false miss. A false miss is one when the application data actually contains the flag value. Another advantage of this technique is that the load of the state table entry is eliminated. The second optimization is batching miss checks. When the checks for multiple loads and stores are batched together, it reduces the overhead significantly.  If the base is the same and the offsets are all less than or equal to the Shasta line size, then a sequence of these loads and stores collectively touch at most two consecutive lines in memory.  Therefore, if inline checks verify that these are in the correct state, then all loads and stores can proceed without further checks. Its convenient to check both lines by just checking the beginning and ending address of the range. This batching technique also applies to loads and stores via multiple registers as long as the range can be determined.
Today we continue our discussion on the WRL research report on the performance of the Shasta shared memory protocol

Thursday, January 1, 2015

#codingexercise
Double GetAlternateNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMode();
}
Today we discuss the WRL Research report on the performance of the Shasta Distributed Shared Memory Protocol. Shasta is a software that supports shared address space across a cluster of computers with physically distributed memory. Shasta can keep data coherent at a fine granularity. It implements this coherence by inlining code that checks the cache state of shared data before each load or store. In addition, shasta allows the coherence granularity to be varied across different shared data structures in the same application. This is helpful when compared to similar purpose systems that have inefficiencies due to fixed large page-size granularity. This paper talks about the cache coherence protocol in Shasta. The protocol borrows a number of optimizations from hardware systems and since it is written in software, it provides tremendous flexibility in the design of the protocol.  The protocol runs on Alpha systems connected via Digital's Memory Channel network. The performance of Shasta was studied for the different optimizations enabled. 
Since Shasta supports the shared address space entirely in software  and at fine granularity of coherence, it reduces the false sharing and transmission of unneeded data, both of which are potential problems in systems with large coherence granularities.  Code is inserted into the application executable before loads and stores that checks if the data being accessed is available locally in the appropriate state. These checks can further be minimized with appropriate optimizations making it a viable replacement of existing systems.
The Shasta protocol provides a number of mechanisms for dealing with the long communication latencies in a workstation cluster. Since it can support variable coherence granularities, it can exploit any potential gains from larger communication granularities from specific shared data. If there are concerns over the overheads in using software based messages, Shasta does minimize the extraneous coherence messages and uses only a fewer messages to satisfy the shared memory operations compared to protocols commonly used in hardware systems.
In fact, the optimizations that attempt to hide memory latencies by exploiting a relaxed memory consistency model, lead to a much more limited gains. This is due to the fact that when a processor waits for data or synchronization to be overlapped with the handling of incoming coherence messages from other processors, it spends most of its time in the wait making it difficult to improve performance by reducing the wait. Finally, optimizations related to migratory data are not useful in Shasta because the migratory sharing patterns are unstable or even absent at block sizes of 64 bytes or higher.  
Shasta divides the virtual address space of each processor into private or shared regions. Data in the shared region may be cached by multiple processors at the same time, with copies residing at the same virtual address space on each processor.
#codingexercise
Double GetAlternateNumberRangeCount()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeCount();
}

Let us discuss some more on this WRL Research report. The hardware cache coherent multiprocessors and Shasta both define the same basic states:
invalid - the data is not valid on this processor.
shared - the data is valid on this processor, and other processors have copies of the data as well.
exclusive - the data is valid on this processor, and no other processors have copies of this data.
Shasta inserts checks in the code prior to each load and store to safeguard against the invalid state. When a processor attempts to write data that is in an invalid or shared state, we say there is a shared miss. The checks inserted by Shasta safeguard against this shared miss.
Also as in hardware shared systems, Shasta divides up the shared address space into ranges of memory called blocks. All data within a block is in the same state and fetched and kept coherent as a unit. Shasta allows this block size to be different for different ranges of shared address space. Further, to simplify the instrumentation, Shasta divides up the address space further into fixed-size ranges called lines and maintains state information for each line in a state table. This line size is configurable at compile time and the block size is taken to be a multiple of the fixed line size.
Coherence is maintained using a directory based invalidation protocol. It supports three types of requests - read, read-exclusive, and exclusive (or -upgrade). Supporting exclusive requests is an important optimization since it reduces message latency and overhead if the processor already has the line in the shared state. Shasta also supports three types of synchronization primitives in the protocol and these are : locks, barriers, and event flags.  These primitives are sufficient for supporting the applications run on the system.
Each virtual page of data is associated with a home processor and each processor maintains the directory information for the shared data. Each line is assigned to an owner processor which is the last processor that held an exclusive copy of the line. The directory information comprises of a pointer to the current owner processor and a full bit vector of the processors that are sharing the data. Data can be shared without requiring the home node to have an up to date copy and this is called dirty sharing The owner is forwarded a copy when a request arrives at a home unless the home processor has a copy.
Message passing is costly when done via interrupts, hence messages are serviced through a polling mechanism. Polling is cheaper because there is a single cacheable location which can be tested to determine if a message has arrived. The polls are inserted at every loop backedge to ensure reasonable response times whenever the protocol waits for a reply. Further there are no messages between a shared miss check and the load or store that is being checked and thus polling can be used to simplify the inline miss checks.
#codingexercise
Double GetAlternateNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeStdDev();
}

Wednesday, December 31, 2014


Schedulers:
Scheduling refers to the methods by which threads, processes or data flows are given access to system resources. The need for scheduling arises usually due to multitasking or multiplexing. The former refers to execution of more than one processes while the latter refers to transmitting multiple data streams across a single physical channel.  The goal of the scheduler can be one of the following:
1)      To increase throughput  - such as to the total number of processes that complete execution in unit time
2)      To decrease latency – such that the total time between the submission of a process and its completion (turnaround time) is reduced or the time it takes between submitting a job and its first response (response time) is reduced
3)      To increase fairness – where the cpu time allotted to each process is fair or weighted based on priority ( the process priority and workload is utilized)
4)      To decrease waiting time – time the process remains in the waiting queue.
As we can see the goals can be conflicting, such as throughput vs latency. Thus schedulers can be instantiated for different goals.  Preference given to the goals depends upon requirements.
Schedulers may have additional requirements such as to ensure that scheduled processes can meet deadlines so that the system is stable. This is true for embedded systems or real-time environments.
The notion of a scheduler brings up a data structure of one or more queues. In fact, the simplest scheduler is a queue. A ready queue is one where jobs can be retrieved for execution. Queues can also be prioritized.
Schedulers may perform their actions one time, every once in a while or at regular intervals. They are classified accordingly as long-term, mid term or short term schedulers.  The long-term or admission schedulers are generally used in batch processing systems where the scheduler admits the processes or delays and determines the degree of concurrency. Usually processes are distinguished as CPU-bound or I/O bound. The mid term scheduler usually swaps in or swaps out the jobs in order to free up the processes can execute and the system resources are available for them to execute.  The short term scheduler decides which of the jobs needs to be executed at periodic regular intervals called quantums.
Scheduling Algorithm can be one of the following:
FIFO – First Come First Served is a simple queue.
Shortest remaining time – also known as Shortest Job First, the scheduler tries to arrange the jobs in the order of the least estimated processing time remaining.
Fixed priority pre-emptive scheduling – A rank is assigned to each job and the scheduler arranges the job in the ready queue based on this rank. Lower priority processes can get interrupted by incoming higher priority processes.
Round-Robin scheduling – The scheduler assigns a fixed time unit per process and cycles through them. This way starvation is avoided.
Multilevel queue scheduling – is used when processes can be divided into groups.

#codingexercise
Double GetAlternateNumberRangeMean()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMean();
}

Tuesday, December 30, 2014

Today we conclude our discussion on shared memory consistency models. We were discussing the programmer centric approach. We took an example where the memory operations are distinguished as a synchronization operation or not. At the language level if there is support for parallelism, we can write doall loops or explicit synchronization constructs. Correct use of a doall loop implies that no two parallel iterations of the loop should access the same location if at least one of the accesses is a write. A library of common routines for programmers could also be provided so that the programmer doesn't have to specify the hardware level directives.finally a programmer may directly specify the synchronization options. One way to do this would be to provide static instructions at the program level. Associating the information with a specific memory instruction can be done in one of two different ways : first to provide multiple flavors of multiple instructions by providing extra opcodes and second to use any high order bits of virtual memory address. Commercial systems may choose instead to transform this information to explicit fence instructions supported at the hardware level such as a memory barrier. Thus we see that the relaxed memory models provide better performance than is possible with sequential consistency and in face enable many of the compiler optimizations.

#codingexercise
Double GetAlternateNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMin();
}

#codingexercise
Double GetAlternateNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMax();
}

Monday, December 29, 2014

Today we continue our discussion on shared memory consistency models. We review the programmability of relaxed memory models.  A key goal of the programmer centric approach is to define the operations that should be distinguished as synchronization. In other words a user's program consists of operations that are to be synchronized or otherwise categorized as data operations in an otherwise sequentially consistent program. Operations are labeled as synchronization operations when they can potentially cause a race condition by accessing the same location and where at least one operation is  a write and there are no intervening operations between them. With this separation of synchronization versus data operations, the system can be conservative with one and aggressive in optimizations with the other. There are two caveats with this approach. One that the programmer be allowed to mark an operation as synchronization when in doubt. This will help ensure correctness by being conservative as well as enable incremental tune up for performance. Another caveat is that the programmer supplied information could be incorrect but we don't have to do anything special here.
Many languages specify high level paradigms for parallel tasks and synchronization, and restrict programmers to using these paradigms. The language may provide a library of common synchronization routines for the programmer which in turn can be conveyed to compiler and hardware. In addition, the programmer may also be allowed to directly call the memory operations. For example, this information can be associated with static instructions at the program level which implicitly marks all dynamic operations generated from that region as synchronization.Another option is to assign a synchronization attribute to a shared variable or memory address such as volatile. While the language constructs play a role on their ease of use for the programmer, making the synchronization operation as default can potentially decrease errors by requiring programmers to explicitly declare the more aggressive data operations.
#codingexercise
Double GetAlternateNumberRangeSum ()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeSum();
}

Friday, December 26, 2014

Today we continue our discussion on WRL shared memory consistency models. We were reviewing relaxing all program orders  We were discussing the weak ordering model and two flavors of the release consistency model - the serialization consistency and processor consistency flavors. All of the models in this group allow a processor to read its own write early. However, the two flavors are the only ones whose straightforward implementations allow a read to return the value of another processor's write early. These models distinguish memory operations based on their type  and provide stricter ordering constraints for some type of operations.
The Weak ordering model classifies memory operations into two categories : data operations and synchronization operations. Since the programmer is required to identify at least one of the operations as  a synchronization operation, the model can reorder memory operations between these synchronization operations without affecting the program correctness. Compared to the weak ordering model, the release consistency flavors provides further distinctions among memory operations. Operations are first distinguished as special or ordinary. Special operations are further distinguished as sync or nsync operations. Sync operation is further distinguished as acquire or release operations. The two flavors of release consistency differ based on the program orders they maintain among special operations.  The first flavor maintains sequential consistency while the second flavor maintains processor consistency. The first flavor enforces acquire to precede all operations and all operations to precede release operation in addition to requiring special operations to precede special operations.  The second flavor enforces almost the same with the exception for a special write followed by a special read. Imposing program order from a write to a read operation requires using read-modify-write operations much the same way as we had seen earlier.
The other category of models for relaxing all program orders such as Alpha, RMO and PowerPC - all provide explicit fence instructions as their safety nets.
The alpha model provides two different fence instructions: the memory barrier and the write memory barrier. The memory barrier (MB) instruction can be used to maintain program order from any memory operation before the MB to any memory instruction after the MB. The write memory barrier instruction provides this guarantee only among write operations.
The RMO model has more flavors for fence instructions. It can be customized to order a combination of read and write operations with respect to future read and write operations using a four bit encoding. Since a combination can be used to order a write with respect to a following read, there is no need for read-modify-write semantics.
The PowerPC model provides a single fence instruction : the SYNC instruction. This is similar to the memory barrier instruction with the exception that when there are two reads to the same location, one may return the value of an older write than the first read. This model therefore requires read-modify-write semantics to enforce program order.
#codingexercise
Decimal GetAlternateNumberRangeSum ()(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeSum ();
}

#codingexercise
Double GetAlternateNumberRangeMean ()(Double[] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMean();
}
We now look at an alternate abstraction for Relaxed Memory Models. The models mentioned so far have put to use on a wide variety of systems however they have required a higher level of complexity for the programmers. This comes from the system-centric commands that the programmer has to use. Further, they directly expose the programmer to the reordering and atomicity optimizations that are allowed by a model. Instead the programmer could use safety nets (eg. fence instructions, more conservative operation types, or read-modify-write operations) to impose the ordering and atomicity requirements on memory operations. However its not easy to identify the ordering constraints.  For example, the weak ordering requires that programmers should identify all synchronization operations. In order that we define a programmer centric specification, we must first define the notion of correctness for the programs. Here we could use sequential consistency. And second, the information required from the programmer must be defined precisely. The information used in the weak ordering model or the release consistency model could be candidates for this information required by the programmers although it may be described in program level information.