Cluster computing

Monday, February 13, 2017

Title: Improvements in stream processing of big data

Introduction: In my discussion of online versus batch mode of processing for big data as shared here (http://1drv.ms/1OM29ee), I discussed that the online mode is made possible because of the summation form, however I took the liberty of assuming that the merge operation of the summation form is linear after each node computes its summation. This could be improved further when the summaries are maintained as Fibonacci heaps because this data structure offers the following advantages:

The merge operations is less than linear if not constant time and the potential does not change

The insert operation takes constant amortized time, rather than logarithmic time and

The decrease key operation also takes constant amortized time rather than the logarithmic time.

Moreover, the Fibonacci heaps have the property that the size of a subtree rooted in a node of degree k is at least the (k+2)th Fibonacci number. This lets us make approximations much faster by making approximations on each node. The distribution of the Fibonacci heap also lets us propagate these approximations between heap additions and deletions especially when the potential does not change.

In addition to the data structure, Fibonacci series also plays an interesting role in the algorithm of the online processing with its nature of utilizing the next computation based on the last two steps only. In other words, we don’t repeat the operations of the online processing in every iteration. We skip some and re-evaluate or adjust the online predictions at the end of the iteration corresponding to Fibonacci numbers. We use Fibonacci numbers versus any other series such as binomial or exponential or powers of two because intuitively we are readjusting our approximations in our online processing and Fibonacci gives us a way to adjust them based on previous and current approximations.

Straightforward aggregations using summation forms show a property that the online prediction is improving the prediction from the past in a straight line and uses only the previous approximation in the current prediction. However, if we were to use the predictions from the iterations corresponding to the Fibonacci series, then we refine the online iterations to not just a linear extrapolation but also Fibonacci number based smoothing.

The Fibonacci series is better than exponential because it has better behavior near the asymptote.

Cluster computing

Monday, February 13, 2017

No comments:

Post a Comment