Cluster computing

Saturday, April 4, 2015

Today we start to wrap up reading the WRL research report on Swift Java compiler. We were discussing the results of the performance studies specifically from global CSE, Class Hierarchy analysis, method inlining, method splitting, field analysis. etc. and we were looking at those that dominated across most applications as well as those that helped in specific cases.
We now look at related work specifically comparing it to say Marmot which is a research compiler from Microsoft, BulletTrain which is a commercial compiler, HotSpot another commercial compiler - this one from Sun, TurboJ that uses a different language, Diwan, Cytron and others.
Marmot does a form of Class hierarchy analysis but has little intra procedural analysis, code motion and does not do instruction scheduling Moreover its IR is not SSA based as is the case with Swift. This is a significant difference and we consequently rule out other such compilers such as Jalapeño.
BulletTrain uses SSA for its IR and even does check elimination, loop unrolling, type propagation and method inlining HotSpot dynamically compiles code that is frequently executed and can use runtime profiling information. It also does method inning based on CHA. TurboJ translates to C for compilation by a C compiler and can do method resolution, inlining, CSE and code motion during the translation.
Marmot keeps memory operations in order except for promoting loads out of loops. Jalapeño builds an instruction level dependence graph that is not available until later. Diwan uses type based alias analysis but does not incorporate the results into the SSA. Cytron represents alias information in an SSA graph by explicitly inserting calls that may modify values if the associated operation may modify the value. The difference between this strategy and Swift's strategy is that Cytron can greatly increase the size of the SSA graph where as Swift not only enforces strict memory ordering via the global store inputs but also relax dependences where it can be proved that there are no aliases.
Diwan uses a form of aggregate analysis to detect when a polymorphic data structure is used in only one way. For example, it can show that a linked list of general objects may in fact be objects of a certain class or its subclasses. Swifts field analysis is more comprehensive and determines the exact types. Dolby and Chien describe an object inlining optimization for C++ programs that does context sensitive intra procedural analysis but it takes minutes as compared to seconds that Swift takes. Moreover Swift allows objects to be inlined even when there is no local reference. This is usually referred to as unboxing and exists for functional languages. Lastly, Swift has exploited field properties to do more escape analysis than others. In a way Swift claims to be considered as a complete compiler.

We will close this study of the WRL research report on Swift Java compiler with the conclusion section from the paper next.

#codingexercise
GetAllNumberRangeProductCubeRootPowerTwelve) (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerTwelve();
}

As we have seen, Swift IR simplified many aspects of the compiler, the use of SSA form made it easy to express optimizations. The Swift IR includes machine dependent operations and this allows all passes to operate directly on the SSA form. The manipulation operations on the SSA graph and CFG are common in all these passes.
Swift makes extensive use of interprocedural analysis. The most effective optimizations in Swift are method inlining, class hierarchy analysis, and global CSE. Swift also introduced field analysis and store resolution. Much of the overhead in Java appears to result from the object oriented style which results in greater memory latencies. There is room to improve optimizations and increase performance with such techniques as prefetching, co-locating objects or more aggressive object inlining.

Friday, April 3, 2015

Today we continue reading the WRL research report on Swift Java compiler. We were discussing the results of the performance studies specifically from global CSE, Class Hierarchy analysis, method inlining, method splitting, field analysis. etc. We discussed how stack allocation and synchronization removal could matter on a case by case basis. Today we continue to discuss storage resolution Programs such as compress and mtrt have important loops where memory operations and runtime checks are optimized only when memory dependences are relaxed. In the case of compress program, sign extension elimination is effective because this program does many memory accesses to byte arrays. Branch removal is especially effective in one program because it made heavy use of a method that involved the computation of a boolean.
A study was made to count the number of null checks and bound checks executed in each application The count was made with all optimizations except CSE, check eliminations, field analysis and loop peeling and then counts were made with each of them successfully added. It was found the CSE is highly effective in eliminating null checks but does not eliminate any bound checks. Similarly, loop peeling eliminates a large number of null checks in several optimizations but does not remove bound checks. This is because bounds checks are not typically loop invariant.
In order to find the max benefit which is the cost of runtime checks that Swift could not eliminate, all runtime checks were removed with the possible loss of correctness, a comparison was made between full optimizations and replacement of remaining run-time checks with pin operations. The pin operations are moved upwards as much as possible without going past a branch or memory store operation It was found that these remaining runtime checks that Swift could not eliminate cost about 10-15%.
The effects of several optimizations on the number of virtual method and interface calls were also studied. A plot was drawn on the number of total unresolved calls when only virtual calls to private or to final methods were resolved. This was repeated by success fully adding resolutions from type propagation, CHA, field analysis and method return values. Again it was clear that the CHA had the most impact In one case, the return value resolved calls and improved the performance of that program.
Type propagation, field analysis and method return values have small impact compared to CA

Thursday, April 2, 2015

Today we continue reading the WRL research report on Swift Java compiler. We started discussing the results of the performance studies among the general results observed, method inlining, class hierarchy analysis and global CSE improved performance across all applications. This is largely due to several small functions implemented in most programs and a missing final specifier and because CHA facilitates method resolution and inlining and escape analysis.
Specific results were also studied. Field analysis had a large impact on some programs such as mpeg and compress because it eliminates many null checks and bound checks on some constant sized buffers and computation arrays. It plays an even more important role in programs such as mtrt where there are many refererences to neighbouring elements. In db, Swift successfully inlines a vector object contained within a database entry. Object inlining also inlines some of the smaller arrays used by mtrt program to help eliminate check. The performance of db improves because a significant comparison routine repeatedly generates and uses an enumeration object that can be stack allocated.
Method splitting has been found very useful in programs like db because it makes heavy use of the elementAt operation and file read operations where there is an ensureOpen check on the file handle. It was also noted that stack allocation does not always improve performance because of the size and effectiveness of the JVM heap. It would show better results if the heap size were small
Synchronization optimization operations also showed little or no performance improvements and only showed significance in the case of a program which involved synchronization of array and hash data structures.

Wednesday, April 1, 2015

Today we continue reading the WRL research report on Swift Java compiler. We started discussing the results of the study on Swift Java compiler.we were comparing general results across a variety of programs. We noted that swift was introduced into a fast JVM for the study.the compiler could compile 2000 lines of code per second. However it became slower by 20 to 40 % when escape analysis was turned on. This is expected because escape analysis requires several recursion and passes and has a cascading effect. We now look at the performance of the application when one or more of several optimizations are disabled. These optimizations include method inlining, class hierarchy analysis, Field analysis, object inlining, method splitting, stack allocation, synchronization removal, Store resolution, global CSE, global code motion, loop peeling, runtime check elimination, sign extension elimination and branch removal. If any of these terms sound unfamiliar at this time it's probably better to revisit the previous posts. Also these features are not really independent and yet we are merely interested in the gain from each of these features. All programs listed for comparison earlier were now repeated by turning one of the features off and then assigned a positive value in terms of the slowness introduced . If there was no slowness the value was left blank. Since the features are not mutually independent, there is no cumulative across all these metrics. The numbers are also merely rough indicators because performance can change when disabling one or more code optimizations. It was observed that method inlining class hierarchy analysis and global CSE improved almost all the programs without fail and more so when program code involved many small methods and virtual methods at that.

Tuesday, March 31, 2015

Today we continue reading the WRL research report on Swift Java compiler. We were discussing register allocations and solving it by means of graph coloring. We will discuss the results of the study on Swift Java compiler next. The Swift java compiler was measured on a Alpha workstation which had one 667MHz processor and a 64KB on-chip data cache and a 4MB board cache. The generated code was installed into a high performance JVM. This was necessary so that the results be properly evaluated against the controlled conditions. Only when the baseline is performant, can we find the results to be representative of the variables we control. A poor choice for baseline may hide gains from some of the variables or skew the results because of running time variations. In this case, the JVM chosen was already performing some form of CHA. This helps us evaluate the gains from the passes more appropriately. The heap size used was 100 MB. Although the hardware seems less powerful as compared to recent processors, the configuration was decent at that time. Moreover, with the JVM baseline established, the trends could be expected to be the same on a different choice of system. The tests were performed on a number of applications from a variety of domains with varying lengths in program size. The initial set of results were taken with all optimizations. Then they were taken without the class hierarchy analysis (CHA). This showed that the use of CHA greatly improves the overall performance. The overall speedup of the Swift generated code without CHA over the fast JVM is marginal because the JVM is already using some form of CHA to resolve method calls. The results were also compared for simple-CHA versus full CHA and it turned out that that the former was only somewhat less performant than the latter indicating it as a useful strategy when dynamic loading is present.
Swift Compilation could proceed at the rate of about 2000 lines of code per second with all optimizations except when escape analysis was on. Escape analysis may require slowed down the compilation by about 20-40%;

Monday, March 30, 2015

Today we continue reading the WRL research report on Swift Java compiler. We were discussing register allocations and solving it by means of graph coloring. To summarize, the steps involved were
Insert copies
Precolor
construct the bias graph
construct the interference graph
compute coloring order
color values
If some values failed to be colored
- spill uncolored values to the stack
- repeat by constructing the interference graph
Cleanup

We saw how each of this steps mattered in solving the register allocations. Specifically how the copies help when a value can be in more than one register. We saw how pre color helps with register allocations of method parameters and return values. The bias graph helps with establishing edges between values that need to be colored the same. The interference graph helps with finding edges between nodes which cannot be colored the same. In doing so, it encapsulates all the possible coloring assignments to the values. We saw how to apply a coloring heuristic where the hard nodes are colored first and the easy nodes last. The difficulty was translated to the degree of the nodes in the interference graph. The modes are then colored in the order computed. The bias graph is used to make intelligent choice of a color from the set of legal colorings allowed by the interference graph. If the coloring does not succeed we spill the values by inserting a spill value just after its definition and a restore value before each use. This lets the next pass to find it easier to color this node. Finally when the coloring has succeeded, data flow is used to eliminate unnecessary copies.
We next look at code generation. Swift's code generation pass translates SSA operation into machine code. Then the operations remaining in the SSA graph at this time correspond to zero or one alpha instructions. The code generation involves computing the stack frame size, emitting the prolog code, emitting code for each block as per the scheduling pass, emitting a branch when the successor is not the immediately following block, emitting the epilog code and emitting auxiliary information including a list of relocation entries, associated constants, an exception table, and a byte code map. Branches that are necessary are found and the final code block for that branch is determined.

Sunday, March 29, 2015

Today we continue reading the WRL research report on Swift Java compiler. We were discussing register allocations and solving it by means of graph coloring. Today we continue with the order of coloring. The bias graph is used to make intelligent choices of a color from the set of legal colorings allowed by the interference graph. Uncolored nodes are colored the same as a node only if the Interim nodes can be colored the same. If the coloring does not succeed, then we spill values to the stack. The value corresponding to each node that was not colored is spilled onto the stack by inserting a spill value just after its definition and a restore value before each use.This lets the original value and the newly added restore value to be in a register over a shorter range and thus will be hopefully easier to color on the next pass.
A final cleanup pass is necessary after all the coloring succeeds to remove copies that have the same source and destination and to remove unnecessary restore operations. This pass does a data flow computation to determine what value each register holds after each instruction. This helps with optimization such as replacing input value of each instruction with the oldest copy that is still in a register.
#codingexercise
GetAllNumberRangeProductCubeRootPowerSeven (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerSeven();
}
#codingexercise
GetAllNumberRangeProductCubeRootPowerNine) (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerNine();
}