We were discussing reduction in vectors and graphs. We now proceed to the hybrid graph processing of parallel Breadth First Search. The hybrid version uses intra node multi-threading.Its called hybrid because it uses both distributed memory graph partitioning and shared memory traversal parallelism which can enable scalability to several thousands of cores.
The partitioning may simply be 1D before the level synchronization happens. The distance array is distributed among the processes. Every process maintains the status of the vertices it owns and it can utilize multi-threading to enumerate the adjacencies but only the owner process of a vertex determines if it is visited already or not. Therefore all the adjacencies of the vertices in the current frontier need to be sent to their corresponding owner. This happens in the all-to-all communication step. Most of the activity is data parallel. The only synchronization that happens is at the barriers. The only improvement over the serial level synchronous algorithm is the computation involved from distributed graph which is the preparing of messages and the all to all communication.
Let FS be the stack to store the vertices at the current level which represent the frontier
and NS be the stack to store the newly visited vertices one hop away from the current level.
the serial algorithm is:
After initialization and pushing the root on the stack,
while FS not empty:
for each u in FS {
for each neighbor v of u:
if v is not visited:
push v onto NS
set level of v
}
Set FS to NS, empty NS, increment level
The hybrid algorithm is:
After initialization, push the root on the stack after determining owner,
and For each of the the p processors, initialize send and receive buffers and thread local stack tbuf-ij for thread i at processor j
while FS not empty:
for each u in FS in parallel{
for each neighbor v of u do
pv = find_owner(v)
push v to tbuf-ipv
Thread barrier
for each of the processors
merge thread-local tbuf-ij in parallel and form send-buf-j
Thread barrier
Now perform all to all collective step with master:
send data in send-buf and aggregate newly visited vertices in recv-buf
Thread barrier
for each u in the recv-buf in parallel do
if d is infinity then
set level of u
push u to NS-i
Thread barrier
Perform parallel merge of NS-i to FS
Thread barrier
#detailed discussion in https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs
The partitioning may simply be 1D before the level synchronization happens. The distance array is distributed among the processes. Every process maintains the status of the vertices it owns and it can utilize multi-threading to enumerate the adjacencies but only the owner process of a vertex determines if it is visited already or not. Therefore all the adjacencies of the vertices in the current frontier need to be sent to their corresponding owner. This happens in the all-to-all communication step. Most of the activity is data parallel. The only synchronization that happens is at the barriers. The only improvement over the serial level synchronous algorithm is the computation involved from distributed graph which is the preparing of messages and the all to all communication.
Let FS be the stack to store the vertices at the current level which represent the frontier
and NS be the stack to store the newly visited vertices one hop away from the current level.
the serial algorithm is:
After initialization and pushing the root on the stack,
while FS not empty:
for each u in FS {
for each neighbor v of u:
if v is not visited:
push v onto NS
set level of v
}
Set FS to NS, empty NS, increment level
The hybrid algorithm is:
After initialization, push the root on the stack after determining owner,
and For each of the the p processors, initialize send and receive buffers and thread local stack tbuf-ij for thread i at processor j
while FS not empty:
for each u in FS in parallel{
for each neighbor v of u do
pv = find_owner(v)
push v to tbuf-ipv
Thread barrier
for each of the processors
merge thread-local tbuf-ij in parallel and form send-buf-j
Thread barrier
Now perform all to all collective step with master:
send data in send-buf and aggregate newly visited vertices in recv-buf
Thread barrier
for each u in the recv-buf in parallel do
if d is infinity then
set level of u
push u to NS-i
Thread barrier
Perform parallel merge of NS-i to FS
Thread barrier
#detailed discussion in https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs
No comments:
Post a Comment