Wednesday, December 2, 2015

Ofer, Ran, Ohad and Lin's paper focus on 'large-scale' and 'high-rate' online prediction problems. Here parallel and distributed computing is critical to providing a realtime service. They define their problem precisely to which they want to apply online stochastic prediction methods. There is  a stream of inputs z1, z2, ... where each zi is sampled independently from a fixed unknown distribution over a sample space Z. Before observing each zi,  a point wi is predicted from a set W. After making the prediciton Wi,  zi is observed and a loss f(wi, zi) is suffered, where f is  a predefined loss function.  Then zi is used to improve the prediction mechanism for the future eg. using a stochastic gradient method. The goal is to accumulate the smallest possible loss as we process the sequence of inputs. The quality of predictions is measured using the notion of regret which is merely a sum of the differences in the loss on the prediction of the input and the loss of the fixed predictor w* which is optimal with respect to the underlying distribution. This regret is a random variable since it relies on stochastic inputs zi each of which is sampled independently from a fixed unknown distribution. Hence we are free to apply Markov chains wherever appropriate. Moreover, the  expected regret is bounded for simplicity and then these results are used to obtain high probability bounds on the actual regret.Also this discussion is restricted to convex prediction problems, those with a global minima such as the ones we have discussed earlier. For example, we get a convex surface when we take  a positive definite matrix A for use as in the linear equation Ax = b.
The distributed computing system in this case is modified as a set of k nodes, each of which is an independent processor, and a network that enables the nodes to communicate with each other.  Each node receives an incoming stream of examples from an outside source so each participates as in a splitter of work load and each performs the same operations under the algorithm.
The performance of the algorithm can be bounded on the distributed environment. On one hand an ideal solution is to run a serial algorithm on a single processor that is k times faster than a single node. The solution is optimal because there is no communication and no partitioning of data and aggregation. Besides most distributed algorithms can work on one processor.For distributed algorithms, there is another paper that already bounds the optimal regret that can be achieved by a gradient based serial algorithm  on an arbitrary convex loss and this has a complexity of the order of square root (m). On the other hand, a trivial solution is to run the serial algorithm entirely and one per each of the nodes.This does not require any communication and can be called the no-communication algorithm. The only trouble is that it scales poorly with the network size k In fact it can be shown to have a square root(k) performance worse than the ideal solution by assuming each node processes m/k inputs, the expected regret per node is sq.rt(m/k)

No comments:

Post a Comment