Cluster computing

Sunday, December 14, 2014

Data forwarders rely on one of the following means :
file monitoring
pipe monitoring
network port monitoring
settings/registry monitoring
etc.
However there are some protocols that also exchange data, ftp, ssh, http and although they are over network ports, they have to be filtered.
Instead take for example ssh, if there were a daemon that could forward relevant information on a remote computer via ssh, then we don't need a data forwarder on the remote machine and the footprint of the forwarder will be even smaller. A forwarder's reachability of data can be considered to be expanded by ssh data forwarders.
Now let us consider ssh forwarder code. This may be set up as a listener script so that on all incoming connections using a particular login, we echo the relevant data back. In this case, the script is as small as a bash script that can read up and spew out the relevant data as a response to the incoming connection and close the connection.
Since the data is forwarded only on ssh connection, this model works well with monitoring mechanisms that can periodically poll. Most monitoring mechanisms are setup as publisher subscriber model or polling model.
The SSH connections have the additional benefit that they are generally considered more secure than the the TCP connections because the data is not in the clear. Furthermore, the script deployed to the remote computer follows the usual mechanisms for securing this access. Hence it is likely to meet up with less resistance in deployment. On the other hand, consider the installation of a binary as part of product package on the remote machine. That may take a lot of questions in some cases such as licensing, fees etc.
Lastly, the commands to determine the data to be forwarded and sending it are typical shell commands and there is a more transparency in what script is deployed.
#codingexercise
Decimal GetEvenNumberRangeMax(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMax ();
}
#codingexercise
Decimal GetEvenNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMean ();
}

Saturday, December 13, 2014

Today we continue to review the WRL system for trace generation. We read up on the effect of direct-mapped versus associative second level cache. When associativity is increased, miss ratio can decrease but the overall performance may not improve or even degrade. If the cache is very large, it is likely to have a low missed ratio already. Increasing the associativity might provide little or no benefits. Associativity helps with the miss ratio but it adds up to the cost of every reference to a cache.
If we ignore this cost, we can bound the benefits of associativity with a comparison between the direct mapped and the fully associative caches. The results showed that the cumulative CPI for a fully associative second level cache with four different sizes were minimally effective for large caches but more effective for small caches.
A more useful measure for the benefits of associativity may be to see the effect on the total CPI as a function of the cost per reference of the associativity.
We know that the CPI for the associative case will be a delta more than the CPI for the direct mapped case. This delta is the combination of the first level miss ratio and the difference in the average cost per reference to the second level cache. This difference can be elaborated in terms of the cost of a hit in the associative cache (ha), the cost of a hit in a direct mapped cache (hd), the cost of a second level cache miss (same direct-mapped and associative) (m), the miss ration for a second level associative cache and the miss ratio for a second level direct-mapped.
In this case, the delta in the average cost per reference can be worked out as the difference in the average cost per reference in the associative case and in the direct mapped case. The first term is sum of the costs of access to the associative second level and the miss from this level. The associative second level cost of access is a combination of the cost of a hit and the miss ratio for that cache. One minus this miss ratio times the cost of a second level cache miss which is the same for a direct mapped and the associative case is the cost of a miss from this level. This kind of elaboration is true for the second term as well in the case of the direct mapped second-level cache. Writing the costs of a hit in terms of a ratio, we get the delta in the average cost per reference to be weighted based on the difference in the miss ratios for a second level associative cache and the second level direct mapped cache
We can now plot the graph of the cumulative CPI in both cases, and see that there is a cross over. The distance between the points at which the crossover occurs is the maximum gain to be expected from the associativity.

#codingexercise
Decimal GetEvenNumberRangeMin(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMin ();
}

Thursday, December 11, 2014

#codingexercise
Decimal GetEvenNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMean ();
}

Today we will continue to discuss WRL system for trace generation. We were discussing long traces and its usefulness. This is not limited to multi level caches. Even for single level caches, varying cache size and degree of associativity, the effects of the trace length was studied and found to be beneficial. With the short traces, there were issues such as high miss ratios, cold start, and erroneous results or predictions depending on where the short run was chosen. The miss ratios are not always lower for long traces since not all programs behave the same and it depends on where the short run was chosen. Some programs terminate before it stabilizes, Others appear to stabilize early before they take a jump after a spell. Yet others stabilize reasonably well but they require a long run. Moreover, the cache warm up itself needs millions of references.
The WRL system does not need to save traces. To analyze a program with a different cache organization, the program only needs to be rerun with a different parameterized analysis program. This works well for single user programs that are deterministic because they generate identical traces on different runs. For multiple processes, this is no longer deterministic because they get interleaved differently on different runs. This is not bad. If we repeat the runs a few times, we can find the dominant trends. However, the variations may make it difficult to understand the significance of individual peculiarities.To overcome this limitation, the cache analysis program could be modified to simulate multiple caches at the same time. However, the variations have been within tolerable limits to not require that.
The contribution due to the size of the second level cache was also studied. The CPI was measured for three sizes of second level cache: 512K, 4M and 16M. Except for the early part of a run with TV program, there is a dramatic reduction in CPI when the cache size is increased to 4M but a considerably smaller improvement resulting from the jump to 16M. When a program actually references a large part of its address space, having a bigger size second level cache tremendously helps. However, most of the programs don't have this large working set.
We can get a bit more insight with existing data by dividing level2 cache misses into three categories - compulsory, capacity and conflict. Compulsory misses happen the first time an address is referenced, and occur regardless of the size of the cache. The capacity miss ratio is the difference between the fully associative ratio and the compulsory miss ratio. The conflict miss ratio is the difference between the direct mapped miss ratio and the associative miss ratio.
The capacity misses are due to fixed cache size. This decreases when the cache size increases but only upto a certain size.Conflict misses result from too many address mappings to the same cache lines. This increases between 512K and 4M cache size but reduces tremendously from 4M to 16M where the interfering blocks are spread out more evenly. This is where the pid hashing reduces the miss ratio for the 16M.

#codingexercise
Decimal GetEvenNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMode ();
}
#codingexercise

Decimal GGetEvenNumberRangSum(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeSum ();
}

Wednesday, December 10, 2014

#codingexercise
Decimal GetOddNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeMean ();
}

Today I will continue to discuss the WRL system.
The results of the study showed that in a long run of a multi set of programs, the miss ratio of the data cache far exceeded that of the instruction cache and second level cache yet the total contribution to CPI from the data cache was far lesser than that of the other two cache. Further, the second level cache contributes twice as much to the overall degradation as do either the instruction or data cache.

Long traces provided better insight than short ones. Some of the variations of the CPI came only after a spell of execution and this is not merely the cache 'warmup' phase. When the TV program discussed earlier was executed with a long trace, the high points of the CPIdata and CPI level2 matched the lows in the graph of CPIInst. These corresponded to the loops that execute small sections of the code and go through large volumes of data. The usefulness of long traces is not limited to large multi-level caches. It could also be used for single level caches, varying cache size and degree of associativity. The study indicated that long traces are more essential for large caches and higher degrees of associativity.

Address mapping to the second level cache was also studied. Trace addresses generated for user programs were virtual addresses while that for kernel were physical address. Address translation is not problematical. If the cache size divided by its associativity is less than or equal to the page size, then the high order bits of the page offset can be used to identify the appropriate cache line. These high order bits are the same for virtual or physical addresses. Misses will be identical to those occurring in a physical cache where the pid field is replaced by a physical page number field. If the cache is much larger than the page size as in the case of second level cache, the page offset maps only to a tiny proportion of the cache.

Since the cache is large, references from different processes are spread out eliminating interprocess conflicts. Using a simple pid hashing algorithm where the pid is exclusive-ORed with part of the address that indexes into the sets of the cache, a study was performed with three sizes of second level cache. This techniques was compared to the address modulo cache size and it was found that the pid hashing reduces CPI level2 only when the cache is large.

#codingexercise
Invert a Fibonacci series given a number in the series.
This problem is unsolvable unless you are given the next number in the series or you are allowed to generate the Fibonacci series upto and inclusive of the given number.

void printDecreasingFib(int x , int y)
{
Assert ( y > = x);
print y;
print x;
while ( y - x > 0 )
{
t = y - x;
y = x;
x = t;
print y;
print x;
}
}

Tuesday, December 9, 2014

Vector form content have the advantage that they store items in layers. Consider png format with bmp. In the former we can save and remove layers where as the latter is scalar content per pixel. If this layering approach were available to most web pages and views, then we could have a css that can define style to each layer.
For example, let us say we want to build a web page for a clothing and accessories store. The web page has to let users choose the clothing and dress up a mannequin. where different clothing can go one on top of each other. This we can do today with overlaying of images in .png format.
HTML:
<img id="img1" src="img1.png"/>
<img id="img2" src="img2.png"/>
<img id="img3" src="img3.png"/>
and we define the css for these as follows:
img
{
position:absolute;
top: 0px;
left: 0px;
}
Now to cause the images to overlay, we set the z-index in css as follows:
#img1
{
z-index: 10;
}
#img2
{
z-index: 20;
}
#img3
{
z-index: 30;
}
This will overlay 3 on top of 2 on top of 1. Here 3 could be the accessory, 2 could be the mannequin and 1 could be the background.
To change the accessory, we could do :
<img src="accessory1.png" onClick="setAccessory('accessory1.png');"/>
and use javascript like so:
function setAccessory(path){
document.getElementById('layer1').src = path;
}
To add more accessories, we can just define more layers.
Adobe's macromedia dreamweaver has the notion of dynamic layers. Any object can be added to the webpage by directly adding it to a layer. Then the layer can be moved around.

#codingexercise

Decimal GetDistinctRangeMean(Decimal [] A)

{

if (A == null) return 0;

Return A.DistinctRangeMean ();

}

Monday, December 8, 2014

#codingexercise
Decimal GetDistinctRangeVariance(Decimal [] A)
{
if (A == null) return 0;
Return A.DistinctRangeVariance ();
}

We continue our discussion on WRL systems. We saw that the first level caches was a write through and behaves like a four entry FIFO, queuing up writes to the second level cache. The second level cache is write back. It may be either virtually or physically addressed.

We next review the choice of user programs to trace made in this paper. User programs that were real and in use, those that require large amounts of memory not because they are poorly written but because they problems they solve are inherently large are the user programs chosen for the study.
One such program was a timing verifier (tv) for VLSI circuits which is in regular use at WRL. It generates a linked data structure and traverses it. SOR is another program chosen which implements successive overrelaxation algorithm using sparse matrices. Tree is a compiled Scheme program chosen for this study. It builds and searches for the largest element in the tree. This implementation uses a garbage collection algorithm. Mult is another program chosen because it's a multiprogramming workload.
It was found that the TV made the most memory references to the tune of billions. The number of instructions as a percent of memory references was generally higher across all the chosen programs than in CISC programs.

Measures for cache behavior: The most common measure for cache performance is the miss ratio, the fractions of requests that are not satisfied by a cache. Though the miss ratio is useful for comparing individual caches, it does not give a clear picture of the effect of the cache's performance on the overall or its relative importance in a multi cache system. A better measure in such case is the CPI, the average cycles per instruction. CPI accounts for variations in the number of references to the different caches and differences in the costs of misses to the caches and can be broken down into the components contributed by each part of the memory hierarchy. Since we are concerned only with memory delays and assume one instruction is issued per cycle, the total CPI over some execution interval is one plus the CPI from data plus the CPI from instructions plus the CPI from level 2 and the CPI as the contribution of the write buffer. CPI total thus gives a better indicator of the performance of the base architecture.

If for each cache,
m-cache-type = number of misses during the interval
c - cache type = cost in cycles of a miss
i = number of instructions in the interval
then CPI = the ratio of the product m with c over i
We want to reduce the total CPI.