Cluster computing

Wednesday, December 10, 2014

#codingexercise
Decimal GetOddNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeMean ();
}

Today I will continue to discuss the WRL system.
The results of the study showed that in a long run of a multi set of programs, the miss ratio of the data cache far exceeded that of the instruction cache and second level cache yet the total contribution to CPI from the data cache was far lesser than that of the other two cache. Further, the second level cache contributes twice as much to the overall degradation as do either the instruction or data cache.

Long traces provided better insight than short ones. Some of the variations of the CPI came only after a spell of execution and this is not merely the cache 'warmup' phase. When the TV program discussed earlier was executed with a long trace, the high points of the CPIdata and CPI level2 matched the lows in the graph of CPIInst. These corresponded to the loops that execute small sections of the code and go through large volumes of data. The usefulness of long traces is not limited to large multi-level caches. It could also be used for single level caches, varying cache size and degree of associativity. The study indicated that long traces are more essential for large caches and higher degrees of associativity.

Address mapping to the second level cache was also studied. Trace addresses generated for user programs were virtual addresses while that for kernel were physical address. Address translation is not problematical. If the cache size divided by its associativity is less than or equal to the page size, then the high order bits of the page offset can be used to identify the appropriate cache line. These high order bits are the same for virtual or physical addresses. Misses will be identical to those occurring in a physical cache where the pid field is replaced by a physical page number field. If the cache is much larger than the page size as in the case of second level cache, the page offset maps only to a tiny proportion of the cache.

Since the cache is large, references from different processes are spread out eliminating interprocess conflicts. Using a simple pid hashing algorithm where the pid is exclusive-ORed with part of the address that indexes into the sets of the cache, a study was performed with three sizes of second level cache. This techniques was compared to the address modulo cache size and it was found that the pid hashing reduces CPI level2 only when the cache is large.

#codingexercise
Invert a Fibonacci series given a number in the series.
This problem is unsolvable unless you are given the next number in the series or you are allowed to generate the Fibonacci series upto and inclusive of the given number.

void printDecreasingFib(int x , int y)
{
Assert ( y > = x);
print y;
print x;
while ( y - x > 0 )
{
t = y - x;
y = x;
x = t;
print y;
print x;
}
}

Tuesday, December 9, 2014

Vector form content have the advantage that they store items in layers. Consider png format with bmp. In the former we can save and remove layers where as the latter is scalar content per pixel. If this layering approach were available to most web pages and views, then we could have a css that can define style to each layer.
For example, let us say we want to build a web page for a clothing and accessories store. The web page has to let users choose the clothing and dress up a mannequin. where different clothing can go one on top of each other. This we can do today with overlaying of images in .png format.
HTML:
<img id="img1" src="img1.png"/>
<img id="img2" src="img2.png"/>
<img id="img3" src="img3.png"/>
and we define the css for these as follows:
img
{
position:absolute;
top: 0px;
left: 0px;
}
Now to cause the images to overlay, we set the z-index in css as follows:
#img1
{
z-index: 10;
}
#img2
{
z-index: 20;
}
#img3
{
z-index: 30;
}
This will overlay 3 on top of 2 on top of 1. Here 3 could be the accessory, 2 could be the mannequin and 1 could be the background.
To change the accessory, we could do :
<img src="accessory1.png" onClick="setAccessory('accessory1.png');"/>
and use javascript like so:
function setAccessory(path){
document.getElementById('layer1').src = path;
}
To add more accessories, we can just define more layers.
Adobe's macromedia dreamweaver has the notion of dynamic layers. Any object can be added to the webpage by directly adding it to a layer. Then the layer can be moved around.

#codingexercise

Decimal GetDistinctRangeMean(Decimal [] A)

{

if (A == null) return 0;

Return A.DistinctRangeMean ();

}

Monday, December 8, 2014

#codingexercise
Decimal GetDistinctRangeVariance(Decimal [] A)
{
if (A == null) return 0;
Return A.DistinctRangeVariance ();
}

We continue our discussion on WRL systems. We saw that the first level caches was a write through and behaves like a four entry FIFO, queuing up writes to the second level cache. The second level cache is write back. It may be either virtually or physically addressed.

We next review the choice of user programs to trace made in this paper. User programs that were real and in use, those that require large amounts of memory not because they are poorly written but because they problems they solve are inherently large are the user programs chosen for the study.
One such program was a timing verifier (tv) for VLSI circuits which is in regular use at WRL. It generates a linked data structure and traverses it. SOR is another program chosen which implements successive overrelaxation algorithm using sparse matrices. Tree is a compiled Scheme program chosen for this study. It builds and searches for the largest element in the tree. This implementation uses a garbage collection algorithm. Mult is another program chosen because it's a multiprogramming workload.
It was found that the TV made the most memory references to the tune of billions. The number of instructions as a percent of memory references was generally higher across all the chosen programs than in CISC programs.

Measures for cache behavior: The most common measure for cache performance is the miss ratio, the fractions of requests that are not satisfied by a cache. Though the miss ratio is useful for comparing individual caches, it does not give a clear picture of the effect of the cache's performance on the overall or its relative importance in a multi cache system. A better measure in such case is the CPI, the average cycles per instruction. CPI accounts for variations in the number of references to the different caches and differences in the costs of misses to the caches and can be broken down into the components contributed by each part of the memory hierarchy. Since we are concerned only with memory delays and assume one instruction is issued per cycle, the total CPI over some execution interval is one plus the CPI from data plus the CPI from instructions plus the CPI from level 2 and the CPI as the contribution of the write buffer. CPI total thus gives a better indicator of the performance of the base architecture.

If for each cache,
m-cache-type = number of misses during the interval
c - cache type = cost in cycles of a miss
i = number of instructions in the interval
then CPI = the ratio of the product m with c over i
We want to reduce the total CPI.

We resume our discussion on the WRL system's cache analysis program. It was found that simulating split instruction and data first level caches and a direct mapped second level cache, analysis plus trace generation takes about 100 times as long as untraced execution. For example, tracing and analyzing a run of TV, a WRL timing verification program, extends the runtime from 25 minutes to 45 hours.

Saving traces for later analysis :
On the fly analysis does not require storage. When storage was required, the data was put on tapes for the sake of capacity and portability. Data was written onto the tape using the Maching scheme. A differencing technique was used to write the data. The original trace data was first converted into a cache difference trace consisting of reference points and differences. Addresses are represented as a difference from the previous address, rather than as an absolute address. This helps with creating patterns in the trace reference string. The cache difference trace is then compressed common sequences of data are translated to single code words. Reference locality and regularity of trace entry formats allow trace data to be very effectively compressed.

We next review the Base Case Cache organizations. A number of assumptions were made about the machine used for analyzing.
The only assumption related to the generation of traces is that of a RISC architecture in which the instruction fetches and explicit data loads and stores are the only forms of memory reference. This makes the simple form of the traces useful even though they contain no information about the instruction types other than the loads and the stores.
Among the assumptions made about the machine, one is that it's a pipelined architecture. If there were no cache misses, a new instruction is started on every cycle. This is possible when the page offset is simultaneously used to retrieve the real page from the TLB and to index the correct line in the cache. If there are no other delays other than by cache misses, the based cost is one cycle. Therefore the design of memory hierarchy is to keep the CPI close to 1.
Another assumption is that the machine is fast enough to require two levels of cache. The cost of going to the second level cache was considered about 12 cycles.

These assumptions are in line with the discussions on this topic with the hardware engineers who built the WRL system.

The first level data cache is considered a write through cache. It's a four entry FIFO that queues up writes to the second level.

#codingexercise
Decimal getStdDev(decimal [] a)
{
If (a == NULL) return 0;
Return a.StdDev();
}

Sunday, December 7, 2014

In today's post we continue discussing the WRL Trace generation system. We discussed a verification technique for the correctness of the trace data. First, the simulation results were compared with that of an existing instruction level simulator. Second, to check for the correctness of the information created by the trace code, we had to verify synchronization. There were two techniques used for this purpose - one where a program that used a simple tight infinite loop for generating a recognizable pattern, and then run a variant of the analysis program that checked whether anything was missing or out of place and two - the trace entry generated by the operating system on entry to the kernel was extended to include a sequence number that was incremented on each kernel entry.
We next look at a cache analysis program. The requirements were that it be flexible. It should be changed from run to run. It shouldn't take up too much memory to cause paging overhead. It should be as fast as possible given that the analysis takes much more time than trace generation. When constants were used in this program whose values were specified at compile time, they defined the number of cache levels, split or integrated data and instruction caches, degree of associativity, number of cache lines, line size, write policy and cost of hit and miss (in units of processor cycles) A version of this program is compiled for each cache configuration and there can even be a fully associative version of the second level cache.
The program is executed at intervals that can be based on the number of instructions, trace entries, or memory references encountered. The data written includes both intervals and cumulative summaries of the number of references, miss ratio, contributions to the cycles per instruction, the percent of misses that resulted from user/kernel conflicts and all of these for each cache, contribution to the CPI by second level compulsory misses (first time references), second level cache contribution to the CPI, if fully associative, and for user mode and kernel mode, number of cycles executed and CPI, CPI broken down by instruction, load and store contributions.
#coding exercise
Decimal getMedian(decimal [] a)
{
If (a == NULL) return 0;
Return a.Median();
}
#coding exercise
Decimal getMode(decimal [] a)
{
If (a == NULL) return 0;
Return a.Mode();
}

Friday, December 5, 2014

#coding exercise
Decimal getAvg(decimal [] a)
{
If (a == NULL) return 0;
Return a.Avg();
}

In this post , we continue our study of the WRL trace system. we review Interrupt timing. Traces may not accurately represent normal execution because code segments are executed at different times and therefore possibly in different order. Slowdown in execution means that various interrupts happen sooner or more frequently than they would without tracing.If user code does asynchronous I/O or relies on signals, more trace code is executed before and after the completion of an I/O operation.
Another such concern is that the additional timer interrupts are not only traced but they affect the rate at which process switching is mandated. For example, if there are more context switches, the time slice for actual user code execution is reduced. To overcome this, we could increase the quantum associated with traced processes. Similarly, if the additional timer interrupts are a problem, we could account for them in the analysis program rather than modifying the trace data.

Since less user code is executed between process switches, the traced processes are executed slower. This affects the accuracy of multiple process traces. To account for this, the process switch interval was increased. This does not affect the interrupts since they still need to be recognized immediately.

The process switch interval was increased by a factor of 8 to reflect actual switching. This corresponded to approximately 200,000 user instructions between switches. To simulate faster machines, the process switch interval was doubled. The average number of instructions executed between process switches for the multi-process benchmark was about 175000 because the quantum is measured in cycles rather than instructions.

In the comparision with simulation results, we were fortunate to have an original simulator with the WRL system that could print out traces for short real user programs. The WRL system was modified to write out traces in a similar form produced by the simulator. The resulting trace file was compared and the number of cache hits and misses were compared.

In order to check for the correctness of the information created by the trace code, it was necessary to assure that the kernel and user access to the trace buffer was synchronized and that there are no gaps in traces that were produced when the analysis program runs. To verify synchronization, there were two tests. First, a program was run infintely in a simple tight loop and the pattern of trace was determined to be consistent. Second, a sequence number was added to the trace entry on each entry into the kernel. Again the pattern was found to be consistent.