- 161 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Parallel Graph Algorithms' - irving

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Parallel Graph Algorithms

Lecture OutlineLecture Outline

KameshMadduri

Lawrence Berkeley National Laboratory

CS267, Spring 2011

April 7, 2011

Lecture Outline

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

Routing in transportation networks

Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds

H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”,Science 27, 2007.

Internet and the WWW

- The world-wide web can be represented as a directed graph
- Web search and crawl: traversal
- Link analysis, ranking: Page rank and HITS
- Document classification and clustering
- Internet topologies (router networks) are naturally modeled as graphs

Scientific Computing

- Reorderings for sparse solvers
- Fill reducing orderings
- Partitioning, eigenvectors
- Heavy diagonal to reduce pivoting (matching)
- Data structures for efficient exploitation

of sparsity

- Derivative computations for optimization
- graph colorings, spanning trees
- Preconditioning
- Incomplete Factorizations
- Partitioning for domain decomposition
- Graph techniques in algebraic multigrid
- Independent sets, matchings, etc.
- Support Theory
- Spanning trees & graph embedding techniques

Image source: YifanHu, “A gallery of large graphs”

Image source: Tim Davis, UF Sparse Matrix Collection.

B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf

Large-scale data analysis

- Graph abstractions are very useful to analyze complex data sets.
- Sources of data: petascale simulations, experimental devices, the Internet, sensor networks
- Challenges: data size, heterogeneity, uncertainty, data quality

Social Informatics:new analytics

challenges, data uncertainty

Astrophysics: massive datasets,

temporal variations

Bioinformatics: data quality,

heterogeneity

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com

Data Analysis and Graph Algorithms in Systems Biology

- Study of the interactions between various components in a biological system
- Graph-theoretic formulations are pervasive:
- Predicting new interactions: modeling
- Functional annotation of novel proteins: matching, clustering
- Identifying metabolic pathways: paths, clustering
- Identifying new protein complexes: clustering, centrality

Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”,

Science 302, 1722-1736, 2003.

Graph –theoretic problems in social networks

- Targeted advertising: clustering and centrality
- Studying the spread of information

Image Source: Nexus (Facebookapplication)

Network Analysis for Intelligence and Survelliance

- [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information
- Plot masterminds correctly identified from interaction patterns: centrality
- A global view of entities is often more insightful
- Detect anomalous activities by exact/approximate subgraph isomorphism.

Image Source: http://www.orgnet.com/hijackers.html

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies

for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Research in Parallel Graph Algorithms

Graph

Algorithms

Application

Areas

Methods/

Problems

Architectures

GPUs

FPGAs

x86 multicore

servers

Massively

multithreaded

architectures

Multicore

Clusters

Clouds

Traversal

Shortest Paths

Connectivity

Max Flow

…

…

…

Social Network

Analysis

WWW

Computational

Biology

Scientific

Computing

Engineering

Find central entities

Community detection

Network dynamics

Data size

Marketing

Social Search

Gene regulation

Metabolic pathways

Genomics

Problem

Complexity

Graph partitioning

Matching

Coloring

VLSI CAD

Route planning

Characterizing Graph-theoretic computations

Factors that influence

choice of algorithm

Input data

Graph kernel

- graph sparsity (m/n ratio)
- static/dynamic nature
- weighted/unweighted,weight distribution
- vertex degree distribution
- directed/undirected
- simple/multi/hyper graph
- problem size
- granularity of computation at nodes/edges
- domain-specific characteristics

- traversal
- shortest path algorithms
- flow algorithms
- spanning tree algorithms
- topological sort
- …..

Problem: Find ***

- paths
- clusters
- partitions
- matchings
- patterns
- orderings

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

Lecture Outline

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

History

- 1735: “Seven Bridges of Königsberg” problem, resolved by Euler, one of the first graph theory results.
- …
- 1966: Flynn’s Taxonomy.
- 1968: Batcher’s “sorting networks”
- 1969: Harary’s “Graph Theory”
- …
- 1972: Tarjan’s “Depth-first search and linear graph algorithms”
- 1975: Reghbati and Corneil, Parallel Connected Components
- 1982: Misra and Chandy, distributed graph algorithms.
- 1984: Quinn and Deo’s survey paper on “parallel graph algorithms”
- …

The PRAM model

- Idealized parallel shared memory system model
- Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead
- EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write)
- Measuring performance: space and time complexity; total number of operations (work)

PRAM Pros and Cons

- Pros
- Simple and clean semantics.
- The majority of theoretical parallel algorithms are designed using the PRAM model.
- Independent of the communication network topology.
- Cons
- Not realistic, too powerful communication model.
- Algorithm designer is misled to use IPC without hesitation.
- Synchronized processors.
- No local memory.
- Big-O notation is often misleading.

PRAM Algorithms for Connected Components

- Reghbati and Corneil [1975], O(log2n) time and O(n3) processors
- Wyllie [1979], O(log2n) time and O(m) processors
- Shiloach and Vishkin [1982], O(log n) time and O(m) processors
- Reif and Spirakis [1982], O(log log n) time and O(n) processors (expected)

Building blocks of classical PRAM graph algorithms

- Prefix sums
- Symmetry breaking
- Pointer jumping
- List ranking
- Euler tours
- Vertex collapse
- Tree contraction

Data structures: graph representation

Static case

- Dense graphs (m = O(n2)): adjacency matrix commonly used.
- Sparse graphs: adjacency lists

Dynamic

- representation depends on common-case query
- Edge insertions or deletions? Vertex insertions or deletions? Edge weight updates?
- Graph update rate
- Queries: connectivity, paths, flow, etc.
- Optimizing for locality a key design consideration.

Graph representation

Vertex

Degree

Adjacencies

- Compressed Sparse Row-like

5

8

1

0

7

3

4

6

9

Flatten

adjacency

arrays

2

Index into

adjacency

array

Size: n+1

Adjacencies

Size: 2*m

Distributed Graph representation

- Each processor stores the entire graph (“full replication”)
- Each processor stores n/p vertices and all adjacencies out of these vertices (“1D partitioning”)
- How to create these “p” vertex partitions?
- Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition)
- Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same

2D graph partitioning

- Consider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graph
- Assign each processor a sub-matrix (i.e, the edges within the sub-matrix)

9 vertices, 9 processors, 3x3 processor grid

5

8

1

0

7

3

4

6

Flatten

Sparse matrices

2

Per-processor local graph

representation

Data structures in (parallel) graph algorithms

- A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, tree
- Implementations are typically array-based for performance considerations.
- Key data structure considerations in parallel graph algorithm design
- Practical parallel priority queues
- Space-efficiency
- Parallel set/multisetoperations, e.g., union, intersection, etc.

Graph Algorithms on current systems

- Concurrency
- Simulating PRAM algorithms: hardware limits of memory bandwidth, number of outstanding memory references; synchronization
- Locality
- Try to improve cache locality, but avoid “too much” superfluous computation
- Work-efficiency
- Is (Parallel time) * (# of processors) = (Serial work)?

The locality challenge“Large memory footprint, low spatial and temporal locality impede performance”

Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)

Input: Synthetic R-MAT graphs (# of edges m = 8n)

No Last-level Cache (LLC) misses

~ 5X drop in

performance

O(m) LLC misses

The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems”

- Graph topology assumptions in classical algorithms do not match real-world datasets
- Parallelization strategies at loggerheads with techniques for enhancing memory locality
- Classical “work-efficient” graph algorithms may not fully exploit new architectural features
- Increasing complexity of memory hierarchy, processor heterogeneity, wide SIMD.
- Tuning implementation to minimize parallel overhead is non-trivial
- Shared memory: minimizing overhead of locks, barriers.
- Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation.

Lecture Outline

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

Graph traversal (BFS) problem definition

1

2

distance from

source vertex

Input:

Output:

5

8

4

1

3

1

3

4

2

0

7

3

4

6

9

source

vertex

1

2

- Memory requirements (# of machine words):
- Sparse graph representation: m+n
- Stack of visited vertices: n
- Distance array: n

Optimizing BFS on cache-based multicore platforms, for networks with “power-law” degree distributions

Test data

(Data: Mislove et al.,

IMC 2007.)

Synthetic R-MAT

networks

Parallel BFS Strategies

5

8

1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)

1

- O(D) parallel steps
- Adjacencies of all vertices
- in current frontier are
- visited in parallel

0

7

3

4

6

9

source

vertex

2

2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs)

5

8

- path-limited searches from “super vertices”
- APSP between “super vertices”

1

source

vertex

0

7

3

4

6

9

2

A deeper dive into the “level synchronous” strategy

Locality (where are the random accesses originating from?)

1. Ordering of vertices in the “current frontier” array, i.e., accesses to adjacency indexing array, cumulative accesses O(n).

2. Ordering of adjacency list of each vertex, cumulative O(m).

3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m).

84

53

93

31

0

44

74

63

11

26

1. Access Pattern: idx array -- 53, 31, 74, 26

2,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11

Improving locality: Vertex relabeling

x

x

x

x

x

x

x

x

x

x

x

x

x

- Well-studied problem, slight differences in problem formulations
- Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal dense blocks.
- Databases/data mining: reordering bitmap indices for better compression; permuting vertices of WWW snapshots, online social networks for compression
- NP-hard problem, several known heuristics
- We require fast, linear-work approaches
- Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit overlap in adjacency lists, dimensionality reduction

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Improving locality: Optimizations

- Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited).
- e.g., access order: 53, 31, 31, 26, 74, 84, 0, …
- Objective: Reduce TLB misses, private

cache misses, exploit shared cache.

- Optimizations:
- Sort the adjacency lists of each vertex – helps

order memory accesses, reduce TLB misses.

- Permute vertex labels – enhance spatial locality.
- Cache-blocked edge visits – exploit temporal locality.

84

53

93

31

0

44

74

63

11

26

Improving locality: Cache blocking

Metadata denoting

blocking pattern

Process

high-degree

vertices

separately

Adjacencies (d)

Adjacencies (d)

- Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses
- Requires multiple passes through the frontier array, tuning for optimal block size.
- Note: frontier array size may be O(n)

3

2

1

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

frontier

frontier

x

x

Tune to L2

cache size

x

x

x

x

x

x

x

x

x

x

x

x

New: cache-blocked approach

linear processing

Vertex relabeling heuristic

Similar to older heuristics, but tuned for small-world networks:

- High percentage of vertices with (out) degrees 0, 1, and 2 in social and information networks => store adjacencies explicitly (in indexing data structure).
- Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit)
- Process “high-degree vertices” adjacencies in linear order, but other vertices with d-array cache blocking.
- Form dense blocks around high-degree vertices
- Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices

Architecture-specific Optimizations

1. Software prefetchingon the Intel Core i7 (supports 32 loads and 20 stores in flight)

- Speculative loads of index array and adjacencies of frontier vertices will reduce compulsory cache misses.

2. Aligning adjacency lists to optimize memory accesses

- 16-byte aligned loads and stores are faster.
- Alignment helps reduce cache misses due to fragmentation
- 16-byte aligned non-temporal stores (during creation of new frontier) are fast.

3. SIMD SSE integer intrinsicsto process “high-degree vertex” adjacencies.

4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsicshave very low overhead)

5. Hugepagesupport (significant TLB miss reduction)

6. NUMA-aware memory allocation exploiting first-touch policy

Experimental Setup

- Intel Xeon 5560 (Core i7, “Nehalem”)
- 2 sockets x 4 cores x 2-way SMT
- 12 GB DRAM, 8 MB shared L3
- 51.2 GBytes/sec peak bandwidth
- 2.66 GHz proc.

Performance averaged over

10 different source vertices, 3 runs each.

Impact of optimization strategies

* Optimization speedup (performance on 4 cores) w.r.t baseline parallel approach, on a synthetic R-MAT graph (n=223,m=226)

Cache locality improvement

Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words)

Theoretical count of the number of non-contiguous memory accesses: m+3n

Parallel performance (Orkut graph)

Execution time:

0.28 seconds (8 threads)

Parallel speedup: 4.9

Speedup over

baseline: 2.9

Graph: 3.07 million vertices, 220 million edges

Single socket of Intel Xeon 5560 (Core i7)

Performance Analysis

- How well does my implementation match theoretical bounds?
- When can I stop optimizing my code?
- Begin with asymptotic analysis
- Express work performed in terms of machine-independent performance counts
- Add input graph characteristics in analysis
- Use simple kernels or micro-benchmarks to provide an estimate of achievable peak performance
- Relate observed performance to machine characteristics

Graph 500 “Search” Benchmark (graph500.org)

- BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16.
- Evaluation criteria: largest problem size that can be solved on a system, minimum execution time.
- Reference MPI, shared memory implementations provided.
- NERSC Franklin system is ranked #2 on current list (Nov 2010).
- BFS using 500 nodes of Franklin

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

Parallel Single-source Shortest Paths (SSSP) algorithms

- Edge weights: concurrency primary challenge!
- No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work
- Parallel priority queues: relaxed heaps [DGST88], [BTZ98]
- Ullman-Yannakakis randomized approach [UY90]
- Meyer and Sanders, ∆ - stepping algorithm [MS03]
- Distributed memory implementations based on graph partitioning
- Heuristics for load balancing and termination detection

K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

∆ - stepping algorithm [MS03]

- Label-correcting algorithm: Can relax edges from unsettled vertices also
- ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm”
- ∆: bucket width
- Vertices are ordered using buckets representing priority range of size ∆
- Each bucket may be processed in parallel

∆ - stepping algorithm: illustration

∆ = 0.1 (say)

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

∞

Buckets

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

0

Buckets

Initialization:

Insert s into bucket, d(s) = 0

0

0

∆ - stepping algorithm: illustration

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.05

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

0

R

2

Buckets

0

0

.01

S

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

0

2

Buckets

R

0

.01

0

S

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

0

.01

R

Buckets

0

2

0

S

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

0

.01

Buckets

R

1

3

0

2

.03

.06

0

S

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

0

.01

R

Buckets

1

3

0

.03

.06

S

0

2

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

0

.03

.01

.06

Buckets

R

3

0

1

S

0

2

∆ - stepping algorithm: illustration

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R
- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

0

.29

.03

.01

.06

.16

.62

Buckets

R

1

4

2

5

S

0

1

3

2

6

6

Average shortest path weight for various graph families~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

Betweenness Centrality

- Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph
- degree, closeness, eigenvalue, betweenness
- Betweenness Centrality

( : No. of shortest paths between sand t)

- Applied to several real-world networks
- Social interactions
- WWW
- Epidemiology
- Systems biology

Algorithms for Computing Betweenness

- All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3)time), sum up the fractional dependency values (O(n2) space).
- Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn)work and O(m+n) space.

Our New Parallel Algorithms

- Madduri, Bader(2006): parallel algorithms for computing exact and approximate betweenness centrality
- low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n))
- Exact algorithm: O(mn) work, O(m+n)space, O(nD+nm/p) time.
- Madduri et al.(2009): New parallel algorithm with lower synchronization overheadand fewer non-contiguous memory references
- In practice, 2-3X faster than previous algorithm
- Lock-free => better scalability on large parallel systems

Parallel BC Algorithm

- Consider an undirected, unweighted graph
- High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scores
- Two steps
- traversal and path counting
- dependency accumulation

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distance

and path counts.

5

8

1

0

7

3

4

6

9

source

vertex

2

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distance

and path counts.

P

S

D

5

8

0

1

1

0

0

7

3

4

6

9

1

0

source

vertex

2

7

1

0

0

2

5

0

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distance

and path counts.

P

S

D

5

8

0

1

1

0

2

2

7

0

7

3

4

6

9

8

3

1

0

source

vertex

2

7

1

0

0

2

5

2

5

7

0

Level-synchronous approach: The adjacencies of all vertices

in the current frontier can be visited in parallel

Parallel BC Algorithm Illustration

Traversal step: at the end, we have all reachable vertices,

their corresponding predecessor multisets, and D values.

P

S

D

5

8

2

0

1

6

1

6

1

0

4

2

2

7

0

7

3

4

6

9

8

3

8

3

1

0

source

vertex

2

8

7

1

0

0

2

5

2

5

7

0

6

Level-synchronous approach: The adjacencies of all vertices

in the current frontier can be visited in parallel

Graph traversal step analysis

- Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n)
- Upper bound on size of each predecessor multiset: In-degree
- Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts
- New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths.
- simplifies the accumulation step
- reduces an atomic operation in traversal step
- cache-friendly!

Graph Traversal Step locality analysis

All the vertices are in a

contiguous block (stack)

for all vertices u at level d in parallel do

for all adjacencies v of u in parallel do

dv = D[v];

if (dv < 0)

vis = fetch_and_add(&Visited[v], 1);

if (vis == 0)

D[v] = d+1;

pS[count++] = v;

fetch_and_add(&sigma[v], sigma[u]);

fetch_and_add(&Scount[u], 1);

if (dv == d + 1)

fetch_and_add(&sigma[v], sigma[u]);

fetch_and_add(&Scount[u], 1);

All the adjacencies of a vertex are

stored compactly (graph rep.)

Non-contiguous

memory access

Non-contiguous

memory access

Non-contiguous

memory access

Store to S[u]

Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously

Parallel BC Algorithm Illustration

2. Accumulation step: Pop vertices from stack,

update dependence scores.

Delta

P

S

5

8

2

1

6

1

6

0

4

2

7

0

7

3

4

6

9

8

3

8

3

0

source

vertex

2

8

7

0

0

2

5

5

7

0

6

Parallel BC Algorithm Illustration

2. Accumulation step: Can also be done in a

level-synchronous manner.

Delta

P

S

5

8

2

1

6

1

6

0

4

2

7

0

7

3

4

6

9

8

3

8

3

0

source

vertex

2

8

7

0

0

2

5

5

7

0

6

Accumulation step locality analysis

All the vertices are in a

contiguous block (stack)

for level d = GraphDiameter-2 to 1 do

for all vertices v at level d in parallel do

for all w in S[v] in parallel do reduction(delta)

delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];

BC[v] = delta[v] = delta_sum_v;

Each S[v] is a

contiguous

block

Only floating point operation in code

Centrality Analysis applied to Protein Interaction Networks

43 interactions

Protein Ensembl ID

ENSG00000145332.2

Kelch-like protein 8

Designing parallel algorithms for large sparse graph analysis

System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.

Solution: Efficiently utilize available memory bandwidth.

Algorithmic innovation

to avoid corner cases.

Improve

locality

“RandomAccess”-like

Locality

Problem

Complexity

Data reduction/

Compression

“Stream”-like

Faster

methods

Peta+

O(n)

104

1012

106

108

O(nlog n)

O(n2)

Work

performed

Problem size (n: # of vertices/edges)

Review of lecture

- Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance
- Overview of parallel graph algorithms
- PRAM algorithms
- graph representation
- Parallel algorithm case studies
- BFS: locality, level-synchronous approach, multicore tuning
- Shortest paths: exploiting concurrency, parallel priority queue
- Betweenness centrality: importance of locality, atomics

Thank you!

- Questions?

Download Presentation

Connecting to Server..