Cluster computing

Monday, December 1, 2014

Today we discuss another WRL report. This is titled Long Address Traces from RISC machines: Generation and analysis by Borg, Kessler, Lazana and Wall. This talks about cache design when processor speed was picking up. If the processor speed was shown to increase by a factor of ten with the memory system remaining the same, the overall execution speed improved merely by a factor of two. The difference was mainly attributed to servicing cache misses.
To analyze caches, they used simulation to trace memory addresses of running systems. The accuracy of the results depends both on the simulation model and the trace data. Some of the limitations included complexity, inaccuracy, lack of system references, short length, inflexibility or applicability only to CISC machines. They built a system to overcome these limitations where the trace generation is based on link time code modification. This makes generation of new trace easier and the system was flexible to allow control of what is traced and when it is traced. For large fast systems, very long trace is required. This system posed no limitations on the length of the trace.
The traces generated by this system was used to analyze the behavior of multi-level cache hierarchies. The link time code modification was possible because the programs were compiled to an intermediate language called Mahler. An option to the linker enables branches to trace code to be inserted into the program wherever a referenced address is to be recorded. Instruction addresses are computed and written into a trace buffer. This addresses are those where the instructions would have been located were it not for the trace branches. The trace buffer is shared among the kernel and all running processes but only those that have been linked actually use the buffer. The buffer is mapped to the high end of every user's virtual address space and the user trace code can write directly to the trace buffer referencing its virtual address. The trace code is written in an assembly language proprietary to the WRL system. Since the kernel execution is uninterruptible unlike user execution, additional code to ensure correct synchronization among the users of the trace buffer is required. A user trace entry could therefore be split with an arbitrary amount of trace data generated by kernel.
Trace extraction and analysis are time-consuming and cannot be performed together with tracing. Hence tracing is interrupted for those activities. The challenge is to ensure that the trace is seamless. A partial trace is implemented where the trace data is analyzed immediately by a high priority analysis process and the trace is not required to be saved.
We will also review another paper for combining branch predictors by McFarling from Rajeev's website after this one.
#codingexercise
Int GetDistinctRangeMin(int [] A)
{
if (A == null) return 0;
return A.DistinctRangeMin();
}

Int GetDistinctRangeMax(int [] A)

{

if (A == null) return 0;
Return A.DistinctRangeMax ();

}

Cluster computing

Monday, December 1, 2014

No comments:

Post a Comment