Cluster computing

Friday, April 10, 2015

Today we continue reading the paper on the design of streaming proxy systems.
We discussed the uniform and exponentially segmented media objects.
We talked about prefetching and minimum buffer size for such media. The minimum buffer size ensures low resource usage. The pre-fetching gives the scheduling point. It doesn't mean that the jitter can be avoided in all cases. The uniformly segmented media object has an advantage over the exponentially segmented object. It enables in-time prefetching which can begin at a later stage. Even so, continuous media streaming has not been guaranteed. One suggestion might be to keep enough segments cached. this leads us to define a prefetching length as the minimum length of the data that must be cached in the proxy in order to guarantee the continuous delivery when Bs > Bt. Prefetching is not necessary when Bs < Bt. Bs is the encoding rate and Bt is the network bandwidth averages respectively. Prefetching length aggregates cached segment length without breaks. Therefore we calculate the number of segments m for continuous delivery. In the case of uniform segmented media objects each segment length is the same. In the case of the exponentially segmented media objects, each cached segment length is twice that of the previous. We review the tradeoff between low proxy jitter and high byte ration and then they byte-hit ratio versus the delayed startup ratio.
#codingexercise

GetEvenNumberRangeSumCubeRootPowerFourteen) (Double [] A)

{

if (A == null) return 0;

Return A.EvenNumberSumCubeRootPowerFourteen();

}
We will take a short break now.

Thursday, April 9, 2015

Today we will continue our discussion on the design of streaming proxy systems. We were discussing active prefetching. Prefetching schemes can reduce proxy jitter by fetching uncached segments before they are accessed. We discussed the cases of both uniformly segmented as well as exponentially segmented media object For the uniformly segmented scheme, the segments take equal amount of time. Consequently, the segments upto the ratio of Bs/Bt cause proxy jitter. This threshold is determined based on the latest that a segment needs to be fetched. Recall that this position is determined such that the time it takes to prefetch this segment should not exceed the time that it takes to deliver the rest of the cached data and the fetched data. The minimum buffer size is calculated accordingly as (1- Bt/Bs) L1. This is true for all ranges namely the first cached segment, the cached segments upto the threshold and the cached segments beyond the threshold and after.
In the case of the exponentially segmented object, a similar analysis can be done. Here, we assume Bs <= 2 Bt. When it is not so, no prefetching of uncached segments can be in time for the exponentially segmented objects. If n is the number of cached segments, then for n = 0, we have to prefetch upto (1 + log-base-2(1/(2-Bs/Bt)) segment is necessary to avoid proxy jitter in the cases thereafter. The minimum buffer size is calculated by using this threshold in the same kind of calculation as above. For n > 0 and less than the threshold, the proxy starts to prefetch the threshold segment once the client starts to access this object. The jitter is unavoidable between the n+1 th segment to threshold segment and the minimum buffer size is Li times Bt/Bs where Li is that of the threshold. For segments that are larger than the threshold, the prefetching of the n+1th segment starts when the client accesses the first 1 - 2^n / (2 ^n - 1) ( Bt / Bs - 1) portion of the first n cached segment. The minimum buffer size is Ln+1 * Bt / Bs and increases exponentially for later segments.

Wednesday, April 8, 2015

We continue reading the paper on the design of high quality streaming proxy systems. We were reviewing Active prefetching. For a media content with uniform segmentation, we calculated the minimum buffer length to be the same in all three cases - the first segment, segments upto Bt/Bs and the segments thereafter. We also found the proxy jitter is unavoidable until the threshold.
We now do active pre-fetching for exponentially segmented object. Here we assume Bs < twice Bt. the average rate of a specific segment is less than twice the average network bandwidth. When Bs >= 2 Bt, no prefetching of the uncached segments can be in time for the exponentially segmented objects.
For the case with no segment cached, Proxy jitter in this case is inevitable.
For the case with the number of cached segments n to be between 0 and 1 + log-base-2( 1 / (2 - Bs/Bt)) threshold. The proxy starts to prefetch the next segment once the client starts to access this object. When the client accesses the segments between n+1 to the threshold, the proxy jitter becomes inevitable and the number of and the minimum buffer size is the length of this segment at the threshold * Bt/Bs.

#codingexercise

GetAllNumberRangeSumCubeRootPowerFourteen) (Double [] A)

{

if (A == null) return 0;

Return A.AllNumberSumCubeRootPowerFourteen();

}

#codingexercise

GetAllNumberRangeProductCubeRootPowerSixteen) (Double [] A)

{

if (A == null) return 0;

Return A.AllNumberProductCubeRootPowerSixteen();

}

Tuesday, April 7, 2015

We discuss active prefetching from the paper "designs of high quality streaming proxy systems" by Chen, Wee and Zhang. The objective of active prefetching is to determine when to fetch which uncached segment so that proxy jitter is minimized. The paper assumes the media objects are segmented, the bandwidth is sufficient to stream the object smoothly and that each segment can be fetched in a unicast channel. Each media object has its inherent encoding rate - this is the playback rate and is denoted with an average value. Prior session data transmission rates are noted.
For a requested media object, if there are n segments cached in the proxy, then the objective is to schedule the prefetching of the n+1 th segment so that proxy jitter is avoided. At position x, the length of the to be delivered data is L-x and the Assuming Bs as the average playback rate and Bt as the average data transfer rate, we avoid proxy jitter when the time to transfer sum of all such remaining lengths L-x for each of the n segments and the n+1 segment over Bs is greater than the time for the the n+1 th segment over the average data transfer rate.
This is a way of saying that the prefetch time must not exceed the delivery time. From the equation above, the position can be varied such that the latest prefetch scheduling point is one where the arrival is just sufficient to meet demand. The buffer size would then reach minimum.
Determining the prefetching scheduling point should then be followed by a prefetching scheme and resource requirements.
If the media object is uniformly segmented, then we can determine the minimum buffer size required to avoid proxy jitter. There are three ranges which we are interested in. The first segment, the second segment and the segment upto the ratio Bs/Bt and the segments thereafter will have same minimum buffer size but they may or may not be able to avoid proxy jitter. The threshold for the number of segments we need to prefetch is Bs/Bt.

Sunday, April 5, 2015

Http(s) Video Proxy and VPN services

Video content from providers such as Netflix, Hulu or even YouTube are not available everywhere. Furthermore, we may want anonymity when viewing the video. Together they provide some challenge to us today to get high quality video streaming services with the same level of privacy as say the Tor project.

In this document, we will review some options and talk about the designs of a streaming proxy.

First of all, we should be specific about the video content. Most providers only do an IP level check to determine whether they should restrict the viewing. This can easily be tricked by one of the following methods:

1) VPN – We can easily create an IP tunnel over the internet to the domain where we can access the video content. This poses little or no risk over the internet and the quality may be as good as local given the ubiquity of such a technique in workplace.

2) Proxying – We can hide our IP address from the machine that we want to view the video content and check the same on sites that offer to look up our ip address. By doing so, we trick the providers into thinking we are local to the country where the service may be unrestricted.

However both of these are not necessarily guaranteed to be a working option in most cases for reasons such as:

1) we may not be at liberty to use the workplace VPN service to watch internet content that is not related to workplace

2) even if we do hide our IP address, most Internet service providers may have issues with such strategy or there might be already be address translations that affect our viewing.

3) They may require buffering or caching and this does not work well for live video.

4) Even proxy caching strategies such as segment based are actually partially caching video content.

5) We may still see startup latency or be required to start/stop the content again and again.

6) And then the delay by the proxy aka proxy jitter affects continuous streaming

Let us now look at some strategies to overcome this.

There are really two problems to tackle:

First, media content broken into segments requires a segment-based proxy caching strategies. Some of these strategies reduce the startup latency as seen by the client. They attempt to fix it by giving higher priority to caching the beginning segments of media objects. The other type of these strategies aim to reduce operational efficiency of the proxy by improving the byte hit ratio. The highest byte hit ratio can be assumed to be achieved when the segmentation is delayed as late as possible and till some realtime information can be collected

None of the segmentation strategies can automatically ensure continuous streaming delivery to the client. Such a proxy has to fetch and relay the uncached segments whenever necessary and if there is a delay it results in proxy jitter, something that affects the client rightaway and is very annoying.

Reducing this proxy jitter is foremost priority. This is where different pre-fetching schemes are involved. One way is to keep the pre-fetching window and fill in the missing data.

The trouble is that improving byte hit ratio and reducing proxy jitter conflict with each other. Proxy jitter occurs if the prefetching of uncached segments is delayed. Aggressive prefetching on the other hand reduces proxy efficiency. Prefetched segments may even be thrown away. That is why there is a tendency to prefetch uncached segments as late as possible. Secondly, the improvement in byte hit ratio also conflicts with reducing the delayed startup ratio.

Chen,Wee and Zhang in their paper “designs of high quality streaming proxy systems” discuss an active prefetching technique that they use to solve proxy jitter. And they also improve the lazy segmentation scheme which addresses the conflicts between startup latency and byte hit ratio.
#codingexercise
GetAllNumberRangeProductCubeRootPowerFourteen) (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerFourteen();
}

Saturday, April 4, 2015

Today we start to wrap up reading the WRL research report on Swift Java compiler. We were discussing the results of the performance studies specifically from global CSE, Class Hierarchy analysis, method inlining, method splitting, field analysis. etc. and we were looking at those that dominated across most applications as well as those that helped in specific cases.
We now look at related work specifically comparing it to say Marmot which is a research compiler from Microsoft, BulletTrain which is a commercial compiler, HotSpot another commercial compiler - this one from Sun, TurboJ that uses a different language, Diwan, Cytron and others.
Marmot does a form of Class hierarchy analysis but has little intra procedural analysis, code motion and does not do instruction scheduling Moreover its IR is not SSA based as is the case with Swift. This is a significant difference and we consequently rule out other such compilers such as Jalapeño.
BulletTrain uses SSA for its IR and even does check elimination, loop unrolling, type propagation and method inlining HotSpot dynamically compiles code that is frequently executed and can use runtime profiling information. It also does method inning based on CHA. TurboJ translates to C for compilation by a C compiler and can do method resolution, inlining, CSE and code motion during the translation.
Marmot keeps memory operations in order except for promoting loads out of loops. Jalapeño builds an instruction level dependence graph that is not available until later. Diwan uses type based alias analysis but does not incorporate the results into the SSA. Cytron represents alias information in an SSA graph by explicitly inserting calls that may modify values if the associated operation may modify the value. The difference between this strategy and Swift's strategy is that Cytron can greatly increase the size of the SSA graph where as Swift not only enforces strict memory ordering via the global store inputs but also relax dependences where it can be proved that there are no aliases.
Diwan uses a form of aggregate analysis to detect when a polymorphic data structure is used in only one way. For example, it can show that a linked list of general objects may in fact be objects of a certain class or its subclasses. Swifts field analysis is more comprehensive and determines the exact types. Dolby and Chien describe an object inlining optimization for C++ programs that does context sensitive intra procedural analysis but it takes minutes as compared to seconds that Swift takes. Moreover Swift allows objects to be inlined even when there is no local reference. This is usually referred to as unboxing and exists for functional languages. Lastly, Swift has exploited field properties to do more escape analysis than others. In a way Swift claims to be considered as a complete compiler.

We will close this study of the WRL research report on Swift Java compiler with the conclusion section from the paper next.

#codingexercise
GetAllNumberRangeProductCubeRootPowerTwelve) (Double [] A)
{
if (A == null) return 0;
Return A.AllNumberRangeProductCubeRootPowerTwelve();
}

As we have seen, Swift IR simplified many aspects of the compiler, the use of SSA form made it easy to express optimizations. The Swift IR includes machine dependent operations and this allows all passes to operate directly on the SSA form. The manipulation operations on the SSA graph and CFG are common in all these passes.
Swift makes extensive use of interprocedural analysis. The most effective optimizations in Swift are method inlining, class hierarchy analysis, and global CSE. Swift also introduced field analysis and store resolution. Much of the overhead in Java appears to result from the object oriented style which results in greater memory latencies. There is room to improve optimizations and increase performance with such techniques as prefetching, co-locating objects or more aggressive object inlining.

Friday, April 3, 2015

Today we continue reading the WRL research report on Swift Java compiler. We were discussing the results of the performance studies specifically from global CSE, Class Hierarchy analysis, method inlining, method splitting, field analysis. etc. We discussed how stack allocation and synchronization removal could matter on a case by case basis. Today we continue to discuss storage resolution Programs such as compress and mtrt have important loops where memory operations and runtime checks are optimized only when memory dependences are relaxed. In the case of compress program, sign extension elimination is effective because this program does many memory accesses to byte arrays. Branch removal is especially effective in one program because it made heavy use of a method that involved the computation of a boolean.
A study was made to count the number of null checks and bound checks executed in each application The count was made with all optimizations except CSE, check eliminations, field analysis and loop peeling and then counts were made with each of them successfully added. It was found the CSE is highly effective in eliminating null checks but does not eliminate any bound checks. Similarly, loop peeling eliminates a large number of null checks in several optimizations but does not remove bound checks. This is because bounds checks are not typically loop invariant.
In order to find the max benefit which is the cost of runtime checks that Swift could not eliminate, all runtime checks were removed with the possible loss of correctness, a comparison was made between full optimizations and replacement of remaining run-time checks with pin operations. The pin operations are moved upwards as much as possible without going past a branch or memory store operation It was found that these remaining runtime checks that Swift could not eliminate cost about 10-15%.
The effects of several optimizations on the number of virtual method and interface calls were also studied. A plot was drawn on the number of total unresolved calls when only virtual calls to private or to final methods were resolved. This was repeated by success fully adding resolutions from type propagation, CHA, field analysis and method return values. Again it was clear that the CHA had the most impact In one case, the return value resolved calls and improved the performance of that program.
Type propagation, field analysis and method return values have small impact compared to CA