Cluster computing

Wednesday, December 17, 2014

Today I will be reading from the WRL research report on Shared Memory Consistency Models. The topic means a specification of memory semantics that helps right correct and efficient programs on multiprocessors. Sequential access is also one such model but it is severely restricting for multiprocessors. That is why system designers introduce relaxed consistency models but they are sometimes so different that a general specification becomes altogether more complicated. In this paper, the authors describe issues related to such models. Here the models are for hardware based shared memory systems.There are optimizations made based on each system but this paper describes the models with the same terminology. Furthermore, the paper also describes the models from the point of view of a programmer.
Shared memory models provide a single address space abstraction over parallel systems. It simplifies difficult programming tasks such as data partitioning and dynamic load distribution. Having a precise understanding of how memory behaves is important for programming. Consider a producer consumer queue where memory is written by one and read by another. If the memory gives the old value prior to the write for the reader, then there is an inconsistency between the programmer's expectation and the actual behaviour.
On a single processor, a read returns the last write because the instructions are executed in program order which isn't the case for multiprocessors. We can extend the notion to multiprocessors in which case its called sequential consistency and the reads will now be consistent because the processor is executing in program order.However, this simple and intuitive model severely restricts the hardware and compiler optimizations.

#codingexercise
Decimal GetOddNumberRangeStdDev(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeStdDev();
}

A consistency model like the one described is an interface between programmability and system design. The model affects programmability because the programmers use it to reason the correctness of the program. The model affects performance because system designers will use it to determine optimizations in hardware and software.

Tuesday, December 16, 2014

In today's post we talk about security and policies as applicable to AWS S3 artifacts. Policies are mapping that dictate what users have access to what resources. Typically policies are declared and applied to containers. If you see the .Net security model, there is a mention for different tiers of policies. These include an Enterprise level policy which comes in handy for corporate accounts. Policies don't necessarily have to be expressed in the same way on different systems but there is a reason for a consistency that we see in many different systems. For example, the inheritance of permissions set at a folder level flows to its contents. Another way to see these principles is the cross account authorization. For example, one owner of a folder may grant permissions to another user for reading and writing. Policies and their applications are implemented differently but there is a lot in common in their design. This comes from their requirements which are common. Coming back to the mention on S3 artifacts, there are two ways enterprise policies can be applied
One is to create a corporate level bucket with user level folders. Another is to apply a policy with full access to a corporate governance account to all the buckets.

Monday, December 15, 2014

Decimal GetEvenNumberRangeStdDev(Decimal [] A)

{

if (A == null) return 0;

Return A.EvenNumberRangeStdDev();

}

Today we continue to discuss the WRL long address trace generation system. We were reviewing the effects of line size in the first and second level cache in this study and the contribution of the write buffer. We summarize the study now. This is a method to generate and analyze very long traces of program execution on RISC machines. The system is designed to allow tracing of multiple user processes and the operating system. Traces can be stored or analyzed on the fly. The latter is preferred because the storage is very large even for a very short duration. Besides it allows the trace to be compacted and written to tape. The system requires that the programs be linked in a special way.
From the trace analysis using this system, it was found that while second level cache was necessary, large second level caches provide little or no benefit. A program which has a large working set can benefit from a large cache. Block size plays a big role in first level instruction cache, but sizes above 128 bytes had little effect on overall performance. If the proportion of writes in the memory references is large, then the contribution from the write buffer is significant. There is little or no benefit from the associativity in large second level cache.

#codingexercise
Decimal GetEvenNumberRangeVariance(Decimal [] A)

{

if (A == null) return 0;

Return A.EvenNumberRangeVariance();

}

Decimal GetEvenNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMode ();
}

Today we will continue to discuss the WRL system for trace generation. We discussed the effects of direct mapped and associative second level cache. We now review the line size in first and second level cache. When the line size was doubled or quadrupled with a 512 K second level cache, there were reductions in the Cumulative Total CPI and this was independent of the second level size. The improvement was due to decreased contribution of the instruction cache. The data cache remained constant. The improvement in doubling the line size was the same as the improvement in increasing the second level cache size eight times. The effects of doubling the length of the lines in the second level cache from 128 bytes to 256 bytes made no difference. This may be due to too much conflict for long line sizes to be beneficial.
We next review write buffer contributions. The performance of the write buffer varied from program to program. This is mainly due to the sensitivity of the write buffer to both the proportion and distribution of writes in the instructions executed. A write buffer entry is retired every six cycles. If the writes were any more frequent or bunched, it would degrade the performance. The relationship between the proportion of writes in a program and the write buffer performance is clear. The CPI contributed by the write buffer shows a corresponding jump.
There was a case where the percent of writes was frequently in the 15-20% range but the CPI write buffer is usually zero. If the writes were uniformly distributed below a threshold, the write buffer will never fill and a subsequent write will never be delayed. Above the threshold, there may be some delays. Since the second level cache is pipelined into three stages, the processing of a single miss gives the write buffer a chance to retire two entries. If enough misses are interspersed between the two writes, the write buffer may work smoothly Thus the first level data cache miss ratio is the third factor that comes into play.
Although this report doesn't study it, but the effects of entropy on the cache distribution hierarchy could be relevant. The Shannon entropy is defined as negative sum of P(x) log base 2 P(x)

Sunday, December 14, 2014

Data forwarders rely on one of the following means :
file monitoring
pipe monitoring
network port monitoring
settings/registry monitoring
etc.
However there are some protocols that also exchange data, ftp, ssh, http and although they are over network ports, they have to be filtered.
Instead take for example ssh, if there were a daemon that could forward relevant information on a remote computer via ssh, then we don't need a data forwarder on the remote machine and the footprint of the forwarder will be even smaller. A forwarder's reachability of data can be considered to be expanded by ssh data forwarders.
Now let us consider ssh forwarder code. This may be set up as a listener script so that on all incoming connections using a particular login, we echo the relevant data back. In this case, the script is as small as a bash script that can read up and spew out the relevant data as a response to the incoming connection and close the connection.
Since the data is forwarded only on ssh connection, this model works well with monitoring mechanisms that can periodically poll. Most monitoring mechanisms are setup as publisher subscriber model or polling model.
The SSH connections have the additional benefit that they are generally considered more secure than the the TCP connections because the data is not in the clear. Furthermore, the script deployed to the remote computer follows the usual mechanisms for securing this access. Hence it is likely to meet up with less resistance in deployment. On the other hand, consider the installation of a binary as part of product package on the remote machine. That may take a lot of questions in some cases such as licensing, fees etc.
Lastly, the commands to determine the data to be forwarded and sending it are typical shell commands and there is a more transparency in what script is deployed.
#codingexercise
Decimal GetEvenNumberRangeMax(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMax ();
}
#codingexercise
Decimal GetEvenNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMean ();
}

Saturday, December 13, 2014

Today we continue to review the WRL system for trace generation. We read up on the effect of direct-mapped versus associative second level cache. When associativity is increased, miss ratio can decrease but the overall performance may not improve or even degrade. If the cache is very large, it is likely to have a low missed ratio already. Increasing the associativity might provide little or no benefits. Associativity helps with the miss ratio but it adds up to the cost of every reference to a cache.
If we ignore this cost, we can bound the benefits of associativity with a comparison between the direct mapped and the fully associative caches. The results showed that the cumulative CPI for a fully associative second level cache with four different sizes were minimally effective for large caches but more effective for small caches.
A more useful measure for the benefits of associativity may be to see the effect on the total CPI as a function of the cost per reference of the associativity.
We know that the CPI for the associative case will be a delta more than the CPI for the direct mapped case. This delta is the combination of the first level miss ratio and the difference in the average cost per reference to the second level cache. This difference can be elaborated in terms of the cost of a hit in the associative cache (ha), the cost of a hit in a direct mapped cache (hd), the cost of a second level cache miss (same direct-mapped and associative) (m), the miss ration for a second level associative cache and the miss ratio for a second level direct-mapped.
In this case, the delta in the average cost per reference can be worked out as the difference in the average cost per reference in the associative case and in the direct mapped case. The first term is sum of the costs of access to the associative second level and the miss from this level. The associative second level cost of access is a combination of the cost of a hit and the miss ratio for that cache. One minus this miss ratio times the cost of a second level cache miss which is the same for a direct mapped and the associative case is the cost of a miss from this level. This kind of elaboration is true for the second term as well in the case of the direct mapped second-level cache. Writing the costs of a hit in terms of a ratio, we get the delta in the average cost per reference to be weighted based on the difference in the miss ratios for a second level associative cache and the second level direct mapped cache
We can now plot the graph of the cumulative CPI in both cases, and see that there is a cross over. The distance between the points at which the crossover occurs is the maximum gain to be expected from the associativity.

#codingexercise
Decimal GetEvenNumberRangeMin(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMin ();
}

Thursday, December 11, 2014

#codingexercise
Decimal GetEvenNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMean ();
}

Today we will continue to discuss WRL system for trace generation. We were discussing long traces and its usefulness. This is not limited to multi level caches. Even for single level caches, varying cache size and degree of associativity, the effects of the trace length was studied and found to be beneficial. With the short traces, there were issues such as high miss ratios, cold start, and erroneous results or predictions depending on where the short run was chosen. The miss ratios are not always lower for long traces since not all programs behave the same and it depends on where the short run was chosen. Some programs terminate before it stabilizes, Others appear to stabilize early before they take a jump after a spell. Yet others stabilize reasonably well but they require a long run. Moreover, the cache warm up itself needs millions of references.
The WRL system does not need to save traces. To analyze a program with a different cache organization, the program only needs to be rerun with a different parameterized analysis program. This works well for single user programs that are deterministic because they generate identical traces on different runs. For multiple processes, this is no longer deterministic because they get interleaved differently on different runs. This is not bad. If we repeat the runs a few times, we can find the dominant trends. However, the variations may make it difficult to understand the significance of individual peculiarities.To overcome this limitation, the cache analysis program could be modified to simulate multiple caches at the same time. However, the variations have been within tolerable limits to not require that.
The contribution due to the size of the second level cache was also studied. The CPI was measured for three sizes of second level cache: 512K, 4M and 16M. Except for the early part of a run with TV program, there is a dramatic reduction in CPI when the cache size is increased to 4M but a considerably smaller improvement resulting from the jump to 16M. When a program actually references a large part of its address space, having a bigger size second level cache tremendously helps. However, most of the programs don't have this large working set.
We can get a bit more insight with existing data by dividing level2 cache misses into three categories - compulsory, capacity and conflict. Compulsory misses happen the first time an address is referenced, and occur regardless of the size of the cache. The capacity miss ratio is the difference between the fully associative ratio and the compulsory miss ratio. The conflict miss ratio is the difference between the direct mapped miss ratio and the associative miss ratio.
The capacity misses are due to fixed cache size. This decreases when the cache size increases but only upto a certain size.Conflict misses result from too many address mappings to the same cache lines. This increases between 512K and 4M cache size but reduces tremendously from 4M to 16M where the interfering blocks are spread out more evenly. This is where the pid hashing reduces the miss ratio for the 16M.

#codingexercise
Decimal GetEvenNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMode ();
}
#codingexercise

Decimal GGetEvenNumberRangSum(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeSum ();
}