bfs preconditioning for high locality data parallel bfs algorithm n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
BFS preconditioning for high locality, data parallel BFS algorithm PowerPoint Presentation
Download Presentation
BFS preconditioning for high locality, data parallel BFS algorithm

Loading in 2 Seconds...

play fullscreen
1 / 17

BFS preconditioning for high locality, data parallel BFS algorithm - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

BFS preconditioning for high locality, data parallel BFS algorithm. N.Vasilache , B. Meister, M. Baskaran , R.Lethin. Problem. Streaming Graph Challenge Characteristics: Large scale Highly dynamic Scale-free Massive parallelism, data movement and synchronization is key

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'BFS preconditioning for high locality, data parallel BFS algorithm' - rory


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bfs preconditioning for high locality data parallel bfs algorithm
BFS preconditioning for high locality, data parallel BFS algorithm

N.Vasilache, B. Meister, M. Baskaran, R.Lethin

problem
Problem
  • Streaming Graph Challenge Characteristics:
    • Large scale
    • Highly dynamic
    • Scale-free
    • Massive parallelism, data movement and synchronization is key
    • Completely unpredictable
    • In this talk, focusing on breadth-first search
breadth first search
Breadth-First Search
  • Dynamic exploration algorithm:
    • Computes a single source shortest path
    • O (V + E) complexity -> important
    • First graph500 problem
    • Comes in a variety of base implementation:
      • sequential list
      • sequential csr, ompcsr
      • mpi
      • MTGL
    • Best 2010 implementation is IBM’s MPI
  • We propose a solution to optimize the run of a single BFS (graph500 requirement). Rely on a test run, BFS_0 to precondition placement, locality and parallelism.
high level ideas
High-Level Ideas
  • Assume the result of a first BFS run is available (BFS_0)
    • In the form provided by graph500 (list of fathers or -1 if root)
    • BFS_0 can be viewed as an ordering (a traversal) of connected vertices consistent with an ordering of the edges.
  • Construct a representation that exploits BFS_0
    • Parallel distributed construction
    • Data parallel programming idioms
  • Reuse the representation for subsequent BFS runs
    • Improve parallelism and locality
    • Must be profitable including the overhead of the representation
    • Must bring improvement on a single BFS run (graph500 requirement)
    • Very preliminary work
bfs 0 interesting properties
BFS_0 interesting properties
  • In BFS_0 order: siblings are contiguous, children are localized (recursively), parent is not too far, potential neighbors not too far
  • Given a graph and a potential BFS:
    • Full edges are actually used in BFS, dashed edges are potential edges, red dashed edges are illegal edges
bfs 0 interesting properties continued
BFS_0 interesting properties (continued)
  • Additional “structural information” on the graph carried by BFS_0
    • Potential neighbors of “f” are in the grey region
    • Distance between “f” and “d” is 1 or 2 at most
    • Clear separation of potential vertices in 3 classes depending on the depth in BFS_0 relative to the depth of “f”
sketch of proposed algorithm
Sketch of proposed algorithm
  • We want to reuse as much information from BFS_0 as possible
    • Given 1 visited node “N”, 3 classes are really data independent regions (-1 depth, same depth, +1 depth)
    • Additionally, we distinguish Children of N from other Nephew nodes
sketch of proposed algorithm1
Sketch of proposed algorithm
  • We give highest importance to NChildren relation:
      • position of first child > position of any node in {PlusOne – Children}  simple criterion for parallel processing
      • Children are contiguous, Nephews are not
      • visiting Children in the BFS_0 order must be stable under recursion
        • suppose I want a shortest path from i to g:
          • ibg is impossible
          • idg is ok but lacks

structure

          • ifg is much better

(recursively contiguous and data parallel)

      • Children are important, we hope there

many

sketch of proposed algorithm2
Sketch of proposed algorithm
  • “Discover-and-Merge-and-Mark” algorithm
  • Given a single starting node we can explore 4 regions in parallel (C,P,S,M)
    • Order of commit is M,S,P,C
      • for a node at distance 2 from N and at same depth as N in BFS_0, a transition MC must be favored over a transition SS
      • This order guarantees recursive consistency of children relation
      • In general, nodes should be marked in the BFS_0 order
  • Order of commit is relevant for nodes discovered at the same distance and same depth in BFS_0 wrt the starting node
  • 3 parameters to order traversal: distance, depth, list of transitions lattice of transitions and synchronizations
lattice of transitions
Lattice of Transitions

D=0

Start Node

D=1

d=-1

d=0

d=1

D=2

d=-2

d=-1

d=0

d=1

d=2

D=3

d=-3

d=-2

d=-1

d=0

d=1

d=2

d=3

  • Let height the height of BFS_0, the maximal distance is 2*height
    • Maximal depth difference is [-height, height] (can be refined)
  • Arrows represent producer/consumer dependences:
    • 2-D and uniform dependences  pipelined parallelism (for free !)
    • Transitions and edge direction are related:
      • Bottom-Left edge is an M transition ([D,d]  [D+1, d-1])
      • Vertical edge is an S transition ([D,d]  [D+1, d])
      • Bottom-Right edge is a (C | P) transition ([D,d]  [D+1, d+1])
available parallelism
Available Parallelism

D=0

Start Node

D=1

d=-1

d=0

d=1

D=2

d=-2

d=-1

d=0

d=1

d=2

D=3

d=-3

d=-2

d=-1

d=0

d=1

d=2

d=3

  • Little less constrained than real pipelined parallelism
    • Some tasks have only 1 or 2 predecessors, relaxed ordering
  • CPSM transitions allow inter-region parallelism and gives a third dimension of parallelism/synchronizations:
    • Unable to exploit it yet (need dynamic dependences otherwise too many empty tasks are created)
overhead representation from bfs 0
Overhead Representation From BFS_0
  • Graph500 output:
    • bfs_0_list (for each vertex his father)
    • xadj (list of edges in compact array)
    • xoff (for each node, offset of first and last edge in xadj)
  • Overhead representation we propose:
    • bfs_order (for each vertex id, the order it was discovered)  slight extension of original seq-csr
    • bfs_0_list (for each position in BFS_0, get the vertex id)  sort (used for finalization, maybe not needed)
    • bfs_0_list_of_positions (for each vertex id, get the list of positions)
    • num_children(tmp, doall), ordered_num_children(tmp, doall), pps_num_children(PPS, implemented sequentially), pps_depths (PPS)
    • xoff + xadjwrt BFS_0 (doall + PPS), categorized in v2 (doall)
    • Takes as much as 1 sequential run at the moment
implementation
Implementation
  • CnC and C++ implementation:
    • Pointers to helper data structures
    • All discovered nodes are copied using data collections
  • CnC task granularity is a pair (D,d):
    • Generate exactly height * (2*height + 1) / 2 tasks
    • Synchronization is easy to write:
      • Each tasks gets input from its predecessors at (D-1, d-1) and/or (D-1,d) and/or (D-1,d+1)
      • Each tasks puts data at (D,d)
  • Within a CnC task:
    • Get input from (D-1, d-1), discover/mark C transition, discover/mark P transition. Get input from (D-1, d), discover/mark S transition. Get input from (D-1, d+1), discover/mark M transition.
implementation continued
Implementation (continued)
  • Within a CnC task, everything is sequentialized:
    • Ability to spawn asynchronous tasks would be useful
    • Very coarse-grained parallelism (1 task is 4 regions, each region may touch many elements in parallel)
  • 2 implementations:
    • “intvec” uses the list of edges in the original graph
    • “intvec.cat” categorizes the edges by region for faster region traversal
    • A lot of untunedoverhead
    • Slowdown … but still valuable information
statistical analysis
Statistical Analysis
  • Biggest example I ran ("size 22" in graph500 terminology):
    • Scale free graph, 4M vertices and 88M edges
    • The height of the BFS tree is only 7  small world property
    • The total number of CnCtasks created is only 14*7 / 2 = 49
    • Of these 49 tasks, only a fraction actually perform work, maybe 10
    • The work performed is extremely unbalanced:
      • one task can discover up to 1M new nodes
      • others discover only 1.
    • ~70% of discovery and marks happen by visiting C transitions
      • children are all contiguous in BFS_0 which gives great locality
      • children have good synchronization properties: they can all be processed in parallel
  • Need to spawn subtasks
another implementation
Another Implementation
  • There is literally almost no parallelism exploited, but a lot is available
    • Spawn async tasks
    • Reduce+prefixto deal efficiently with large C regions
  • Tried another implementation:
    • CnC task is now (D, d, last_transition)
    • For each (D,d), fan-out 4 “discover” tasks CP // S // M
    • For each (D,d), reduce 1 “merge-and-mark” task dependent on these 4 tasks
    • Additionally, each discover task can be broken down into a static number of pieces to try and process in parallel
    • VERY crude way of representing async and prefix-reduction
    • Huge overhead (between 2 and 4x over the previous CnC version)
future work
Future Work

Examine overhead (memory leaks, spurious copies, inefficient hashing, too many tasks created, no dependence specified etc)

Hierarchical parallelism

Distributed implementation

Complement with DFS preconditioning (all recursive children become contiguous)