Cluster computing

Wednesday, December 23, 2015

Today we continue with the paper optimal distributed online prediction using mini-batches by Dekel, Gilad-BachRach, Shamir and Xiao. We now review the Distributed Mini-Batch for Stochastic online prediction where the authors show that the mini-batch idea can be exploited to obtain nearly optimal regret bounds. They begin with the assumption that any communication over the network incurs a latency. The network is an undirected graph G over the set of nodes, where each edge represents a bi-directional network link. The latency incurred between u and v is therefore proportional to the graph distance between them.and is maximum at the diameter The network is also assumed to have limited bandwidth but instead of quantifying it, they require that the communication load at each node remains constant, and does not grow with the number of nodes k or with the rate at which the incoming functions arrive.
The communication model used with this network is one that allows a distributed vector sum operation. This operation begins with each node holding a vector vj, and ends with each node holding the sum of the vectors vj from 1 to k. The messages are transmitted along a rooted minimum depth spanning tree where the leaves send their vectors to their parents who aggregate and in turn send the result to their parents until it reaches root. The root computes the sum from all and broadcasts it back to all the nodes.
There are two advantages with this model : one that each up-link and each down-link in the minimum spanning tree is used only once The vector sum operations can be started back to back. These vector sum operations will run concurrently without creating network congestion on any edge of the tree and these network operations are non-blocking where each node can continue processing inputs while the vector sum operations take place in the background. This lets us overcome the restriction imposed by network latency.
The DMB algorithm is a distributed mini-batch algorithm. where it runs on parallel on each node in the network again utilizing the summation form we discussed with the communication model above.
#codingquestion
Given an integer array, find the first element with an even number of occurrences.
First of all this problem is not the same as finding the first duplicate because the total number of times that duplicate can occur can still be odd. Hence we must process all the numbers in the array into a count table with elements as keys and their count of occurrences as values. Therefore a hashtable with constant time lookup will help. Given that the count of occurrences can be arbitrary, we must safeguard against no even number count of occurrences or other edge conditions. the first pass over the integer array populates the count table and the second pass over the integer array checks for the first element with a parity. To conserve space, we can use a bitmap where the bit at the index given by the number or element of the integer array is tested for parity but we would need another data structure to say whether an element was ever encountered.

Cluster computing

Wednesday, December 23, 2015

No comments:

Post a Comment