Cluster computing

Wednesday, April 8, 2015

We continue reading the paper on the design of high quality streaming proxy systems. We were reviewing Active prefetching. For a media content with uniform segmentation, we calculated the minimum buffer length to be the same in all three cases - the first segment, segments upto Bt/Bs and the segments thereafter. We also found the proxy jitter is unavoidable until the threshold.
We now do active pre-fetching for exponentially segmented object. Here we assume Bs < twice Bt. the average rate of a specific segment is less than twice the average network bandwidth. When Bs >= 2 Bt, no prefetching of the uncached segments can be in time for the exponentially segmented objects.
For the case with no segment cached, Proxy jitter in this case is inevitable.
For the case with the number of cached segments n to be between 0 and 1 + log-base-2( 1 / (2 - Bs/Bt)) threshold. The proxy starts to prefetch the next segment once the client starts to access this object. When the client accesses the segments between n+1 to the threshold, the proxy jitter becomes inevitable and the number of and the minimum buffer size is the length of this segment at the threshold * Bt/Bs.

#codingexercise

GetAllNumberRangeSumCubeRootPowerFourteen) (Double [] A)

{

if (A == null) return 0;

Return A.AllNumberSumCubeRootPowerFourteen();

}

#codingexercise

GetAllNumberRangeProductCubeRootPowerSixteen) (Double [] A)

{

if (A == null) return 0;

Return A.AllNumberProductCubeRootPowerSixteen();

}

Tuesday, April 7, 2015

We discuss active prefetching from the paper "designs of high quality streaming proxy systems" by Chen, Wee and Zhang. The objective of active prefetching is to determine when to fetch which uncached segment so that proxy jitter is minimized. The paper assumes the media objects are segmented, the bandwidth is sufficient to stream the object smoothly and that each segment can be fetched in a unicast channel. Each media object has its inherent encoding rate - this is the playback rate and is denoted with an average value. Prior session data transmission rates are noted.
For a requested media object, if there are n segments cached in the proxy, then the objective is to schedule the prefetching of the n+1 th segment so that proxy jitter is avoided. At position x, the length of the to be delivered data is L-x and the Assuming Bs as the average playback rate and Bt as the average data transfer rate, we avoid proxy jitter when the time to transfer sum of all such remaining lengths L-x for each of the n segments and the n+1 segment over Bs is greater than the time for the the n+1 th segment over the average data transfer rate.
This is a way of saying that the prefetch time must not exceed the delivery time. From the equation above, the position can be varied such that the latest prefetch scheduling point is one where the arrival is just sufficient to meet demand. The buffer size would then reach minimum.
Determining the prefetching scheduling point should then be followed by a prefetching scheme and resource requirements.
If the media object is uniformly segmented, then we can determine the minimum buffer size required to avoid proxy jitter. There are three ranges which we are interested in. The first segment, the second segment and the segment upto the ratio Bs/Bt and the segments thereafter will have same minimum buffer size but they may or may not be able to avoid proxy jitter. The threshold for the number of segments we need to prefetch is Bs/Bt.

Sunday, April 5, 2015

Http(s) Video Proxy and VPN services

Video content from providers such as Netflix, Hulu or even YouTube are not available everywhere. Furthermore, we may want anonymity when viewing the video. Together they provide some challenge to us today to get high quality video streaming services with the same level of privacy as say the Tor project.

In this document, we will review some options and talk about the designs of a streaming proxy.

First of all, we should be specific about the video content. Most providers only do an IP level check to determine whether they should restrict the viewing. This can easily be tricked by one of the following methods:

1) VPN – We can easily create an IP tunnel over the internet to the domain where we can access the video content. This poses little or no risk over the internet and the quality may be as good as local given the ubiquity of such a technique in workplace.

2) Proxying – We can hide our IP address from the machine that we want to view the video content and check the same on sites that offer to look up our ip address. By doing so, we trick the providers into thinking we are local to the country where the service may be unrestricted.

However both of these are not necessarily guaranteed to be a working option in most cases for reasons such as:

1) we may not be at liberty to use the workplace VPN service to watch internet content that is not related to workplace

2) even if we do hide our IP address, most Internet service providers may have issues with such strategy or there might be already be address translations that affect our viewing.

3) They may require buffering or caching and this does not work well for live video.

4) Even proxy caching strategies such as segment based are actually partially caching video content.

5) We may still see startup latency or be required to start/stop the content again and again.

6) And then the delay by the proxy aka proxy jitter affects continuous streaming

Let us now look at some strategies to overcome this.

There are really two problems to tackle:

First, media content broken into segments requires a segment-based proxy caching strategies. Some of these strategies reduce the startup latency as seen by the client. They attempt to fix it by giving higher priority to caching the beginning segments of media objects. The other type of these strategies aim to reduce operational efficiency of the proxy by improving the byte hit ratio. The highest byte hit ratio can be assumed to be achieved when the segmentation is delayed as late as possible and till some realtime information can be collected

None of the segmentation strategies can automatically ensure continuous streaming delivery to the client. Such a proxy has to fetch and relay the uncached segments whenever necessary and if there is a delay it results in proxy jitter, something that affects the client rightaway and is very annoying.

Reducing this proxy jitter is foremost priority. This is where different pre-fetching schemes are involved. One way is to keep the pre-fetching window and fill in the missing data.

The trouble is that improving byte hit ratio and reducing proxy jitter conflict with each other. Proxy jitter occurs if the prefetching of uncached segments is delayed. Aggressive prefetching on the other hand reduces proxy efficiency. Prefetched segments may even be thrown away. That is why there is a tendency to prefetch uncached segments as late as possible. Secondly, the improvement in byte hit ratio also conflicts with reducing the delayed startup ratio.

Chen,Wee and Zhang in their paper “designs of high quality streaming proxy systems” discuss an active prefetching technique that they use to solve proxy jitter. And they also improve the lazy segmentation scheme which addresses the conflicts between startup latency and byte hit ratio.
#codingexercise
GetAllNumberRangeProductCubeRootPowerFourteen) (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerFourteen();
}

Saturday, April 4, 2015

Today we start to wrap up reading the WRL research report on Swift Java compiler. We were discussing the results of the performance studies specifically from global CSE, Class Hierarchy analysis, method inlining, method splitting, field analysis. etc. and we were looking at those that dominated across most applications as well as those that helped in specific cases.
We now look at related work specifically comparing it to say Marmot which is a research compiler from Microsoft, BulletTrain which is a commercial compiler, HotSpot another commercial compiler - this one from Sun, TurboJ that uses a different language, Diwan, Cytron and others.
Marmot does a form of Class hierarchy analysis but has little intra procedural analysis, code motion and does not do instruction scheduling Moreover its IR is not SSA based as is the case with Swift. This is a significant difference and we consequently rule out other such compilers such as Jalapeño.
BulletTrain uses SSA for its IR and even does check elimination, loop unrolling, type propagation and method inlining HotSpot dynamically compiles code that is frequently executed and can use runtime profiling information. It also does method inning based on CHA. TurboJ translates to C for compilation by a C compiler and can do method resolution, inlining, CSE and code motion during the translation.
Marmot keeps memory operations in order except for promoting loads out of loops. Jalapeño builds an instruction level dependence graph that is not available until later. Diwan uses type based alias analysis but does not incorporate the results into the SSA. Cytron represents alias information in an SSA graph by explicitly inserting calls that may modify values if the associated operation may modify the value. The difference between this strategy and Swift's strategy is that Cytron can greatly increase the size of the SSA graph where as Swift not only enforces strict memory ordering via the global store inputs but also relax dependences where it can be proved that there are no aliases.
Diwan uses a form of aggregate analysis to detect when a polymorphic data structure is used in only one way. For example, it can show that a linked list of general objects may in fact be objects of a certain class or its subclasses. Swifts field analysis is more comprehensive and determines the exact types. Dolby and Chien describe an object inlining optimization for C++ programs that does context sensitive intra procedural analysis but it takes minutes as compared to seconds that Swift takes. Moreover Swift allows objects to be inlined even when there is no local reference. This is usually referred to as unboxing and exists for functional languages. Lastly, Swift has exploited field properties to do more escape analysis than others. In a way Swift claims to be considered as a complete compiler.

We will close this study of the WRL research report on Swift Java compiler with the conclusion section from the paper next.

#codingexercise
GetAllNumberRangeProductCubeRootPowerTwelve) (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerTwelve();
}

As we have seen, Swift IR simplified many aspects of the compiler, the use of SSA form made it easy to express optimizations. The Swift IR includes machine dependent operations and this allows all passes to operate directly on the SSA form. The manipulation operations on the SSA graph and CFG are common in all these passes.
Swift makes extensive use of interprocedural analysis. The most effective optimizations in Swift are method inlining, class hierarchy analysis, and global CSE. Swift also introduced field analysis and store resolution. Much of the overhead in Java appears to result from the object oriented style which results in greater memory latencies. There is room to improve optimizations and increase performance with such techniques as prefetching, co-locating objects or more aggressive object inlining.

Friday, April 3, 2015

Today we continue reading the WRL research report on Swift Java compiler. We were discussing the results of the performance studies specifically from global CSE, Class Hierarchy analysis, method inlining, method splitting, field analysis. etc. We discussed how stack allocation and synchronization removal could matter on a case by case basis. Today we continue to discuss storage resolution Programs such as compress and mtrt have important loops where memory operations and runtime checks are optimized only when memory dependences are relaxed. In the case of compress program, sign extension elimination is effective because this program does many memory accesses to byte arrays. Branch removal is especially effective in one program because it made heavy use of a method that involved the computation of a boolean.
A study was made to count the number of null checks and bound checks executed in each application The count was made with all optimizations except CSE, check eliminations, field analysis and loop peeling and then counts were made with each of them successfully added. It was found the CSE is highly effective in eliminating null checks but does not eliminate any bound checks. Similarly, loop peeling eliminates a large number of null checks in several optimizations but does not remove bound checks. This is because bounds checks are not typically loop invariant.
In order to find the max benefit which is the cost of runtime checks that Swift could not eliminate, all runtime checks were removed with the possible loss of correctness, a comparison was made between full optimizations and replacement of remaining run-time checks with pin operations. The pin operations are moved upwards as much as possible without going past a branch or memory store operation It was found that these remaining runtime checks that Swift could not eliminate cost about 10-15%.
The effects of several optimizations on the number of virtual method and interface calls were also studied. A plot was drawn on the number of total unresolved calls when only virtual calls to private or to final methods were resolved. This was repeated by success fully adding resolutions from type propagation, CHA, field analysis and method return values. Again it was clear that the CHA had the most impact In one case, the return value resolved calls and improved the performance of that program.
Type propagation, field analysis and method return values have small impact compared to CA

Thursday, April 2, 2015

Today we continue reading the WRL research report on Swift Java compiler. We started discussing the results of the performance studies among the general results observed, method inlining, class hierarchy analysis and global CSE improved performance across all applications. This is largely due to several small functions implemented in most programs and a missing final specifier and because CHA facilitates method resolution and inlining and escape analysis.
Specific results were also studied. Field analysis had a large impact on some programs such as mpeg and compress because it eliminates many null checks and bound checks on some constant sized buffers and computation arrays. It plays an even more important role in programs such as mtrt where there are many refererences to neighbouring elements. In db, Swift successfully inlines a vector object contained within a database entry. Object inlining also inlines some of the smaller arrays used by mtrt program to help eliminate check. The performance of db improves because a significant comparison routine repeatedly generates and uses an enumeration object that can be stack allocated.
Method splitting has been found very useful in programs like db because it makes heavy use of the elementAt operation and file read operations where there is an ensureOpen check on the file handle. It was also noted that stack allocation does not always improve performance because of the size and effectiveness of the JVM heap. It would show better results if the heap size were small
Synchronization optimization operations also showed little or no performance improvements and only showed significance in the case of a program which involved synchronization of array and hash data structures.

Wednesday, April 1, 2015

Today we continue reading the WRL research report on Swift Java compiler. We started discussing the results of the study on Swift Java compiler.we were comparing general results across a variety of programs. We noted that swift was introduced into a fast JVM for the study.the compiler could compile 2000 lines of code per second. However it became slower by 20 to 40 % when escape analysis was turned on. This is expected because escape analysis requires several recursion and passes and has a cascading effect. We now look at the performance of the application when one or more of several optimizations are disabled. These optimizations include method inlining, class hierarchy analysis, Field analysis, object inlining, method splitting, stack allocation, synchronization removal, Store resolution, global CSE, global code motion, loop peeling, runtime check elimination, sign extension elimination and branch removal. If any of these terms sound unfamiliar at this time it's probably better to revisit the previous posts. Also these features are not really independent and yet we are merely interested in the gain from each of these features. All programs listed for comparison earlier were now repeated by turning one of the features off and then assigned a positive value in terms of the slowness introduced . If there was no slowness the value was left blank. Since the features are not mutually independent, there is no cumulative across all these metrics. The numbers are also merely rough indicators because performance can change when disabling one or more code optimizations. It was observed that method inlining class hierarchy analysis and global CSE improved almost all the programs without fail and more so when program code involved many small methods and virtual methods at that.