- 131 Views
- Uploaded on
- Presentation posted in: General

Parallel Graph Algorithms

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Parallel Graph Algorithms

KameshMadduri

KMadduri@lbl.gov

Lawrence Berkeley National Laboratory

CS267, Spring 2011

April 7, 2011

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds

H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”,Science 27, 2007.

- The world-wide web can be represented as a directed graph
- Web search and crawl: traversal
- Link analysis, ranking: Page rank and HITS
- Document classification and clustering

- Internet topologies (router networks) are naturally modeled as graphs

- Reorderings for sparse solvers
- Fill reducing orderings
- Partitioning, eigenvectors

- Heavy diagonal to reduce pivoting (matching)

- Fill reducing orderings
- Data structures for efficient exploitation
of sparsity

- Derivative computations for optimization
- graph colorings, spanning trees

- Preconditioning
- Incomplete Factorizations
- Partitioning for domain decomposition
- Graph techniques in algebraic multigrid
- Independent sets, matchings, etc.

- Support Theory
- Spanning trees & graph embedding techniques

Image source: YifanHu, “A gallery of large graphs”

Image source: Tim Davis, UF Sparse Matrix Collection.

B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf

- Graph abstractions are very useful to analyze complex data sets.
- Sources of data: petascale simulations, experimental devices, the Internet, sensor networks
- Challenges: data size, heterogeneity, uncertainty, data quality

Social Informatics:new analytics

challenges, data uncertainty

Astrophysics: massive datasets,

temporal variations

Bioinformatics: data quality,

heterogeneity

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com

- Study of the interactions between various components in a biological system
- Graph-theoretic formulations are pervasive:
- Predicting new interactions: modeling
- Functional annotation of novel proteins: matching, clustering
- Identifying metabolic pathways: paths, clustering
- Identifying new protein complexes: clustering, centrality

Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”,

Science 302, 1722-1736, 2003.

- Targeted advertising: clustering and centrality
- Studying the spread of information

Image Source: Nexus (Facebookapplication)

- [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information
- Plot masterminds correctly identified from interaction patterns: centrality
- A global view of entities is often more insightful
- Detect anomalous activities by exact/approximate subgraph isomorphism.

Image Source: http://www.orgnet.com/hijackers.html

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies

for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Graph

Algorithms

Application

Areas

Methods/

Problems

Architectures

GPUs

FPGAs

x86 multicore

servers

Massively

multithreaded

architectures

Multicore

Clusters

Clouds

Traversal

Shortest Paths

Connectivity

Max Flow

…

…

…

Social Network

Analysis

WWW

Computational

Biology

Scientific

Computing

Engineering

Find central entities

Community detection

Network dynamics

Data size

Marketing

Social Search

Gene regulation

Metabolic pathways

Genomics

Problem

Complexity

Graph partitioning

Matching

Coloring

VLSI CAD

Route planning

Factors that influence

choice of algorithm

Input data

Graph kernel

- graph sparsity (m/n ratio)
- static/dynamic nature
- weighted/unweighted,weight distribution
- vertex degree distribution
- directed/undirected
- simple/multi/hyper graph
- problem size
- granularity of computation at nodes/edges
- domain-specific characteristics

- traversal
- shortest path algorithms
- flow algorithms
- spanning tree algorithms
- topological sort
- …..

Problem: Find ***

- paths
- clusters
- partitions
- matchings
- patterns
- orderings

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

- 1735: “Seven Bridges of Königsberg” problem, resolved by Euler, one of the first graph theory results.
- …
- 1966: Flynn’s Taxonomy.
- 1968: Batcher’s “sorting networks”
- 1969: Harary’s “Graph Theory”
- …
- 1972: Tarjan’s “Depth-first search and linear graph algorithms”
- 1975: Reghbati and Corneil, Parallel Connected Components
- 1982: Misra and Chandy, distributed graph algorithms.
- 1984: Quinn and Deo’s survey paper on “parallel graph algorithms”
- …

- Idealized parallel shared memory system model
- Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead
- EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write)
- Measuring performance: space and time complexity; total number of operations (work)

- Pros
- Simple and clean semantics.
- The majority of theoretical parallel algorithms are designed using the PRAM model.
- Independent of the communication network topology.

- Cons
- Not realistic, too powerful communication model.
- Algorithm designer is misled to use IPC without hesitation.
- Synchronized processors.
- No local memory.
- Big-O notation is often misleading.

- Reghbati and Corneil [1975], O(log2n) time and O(n3) processors
- Wyllie [1979], O(log2n) time and O(m) processors
- Shiloach and Vishkin [1982], O(log n) time and O(m) processors
- Reif and Spirakis [1982], O(log log n) time and O(n) processors (expected)

- Prefix sums
- Symmetry breaking
- Pointer jumping
- List ranking
- Euler tours
- Vertex collapse
- Tree contraction

Static case

- Dense graphs (m = O(n2)): adjacency matrix commonly used.
- Sparse graphs: adjacency lists
Dynamic

- representation depends on common-case query
- Edge insertions or deletions? Vertex insertions or deletions? Edge weight updates?
- Graph update rate
- Queries: connectivity, paths, flow, etc.
- Optimizing for locality a key design consideration.

Vertex

Degree

Adjacencies

- Compressed Sparse Row-like

5

8

1

0

7

3

4

6

9

Flatten

adjacency

arrays

2

Index into

adjacency

array

Size: n+1

Adjacencies

Size: 2*m

- Each processor stores the entire graph (“full replication”)
- Each processor stores n/p vertices and all adjacencies out of these vertices (“1D partitioning”)
- How to create these “p” vertex partitions?
- Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition)
- Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same

- Consider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graph
- Assign each processor a sub-matrix (i.e, the edges within the sub-matrix)

9 vertices, 9 processors, 3x3 processor grid

5

8

1

0

7

3

4

6

Flatten

Sparse matrices

2

Per-processor local graph

representation

- A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, tree
- Implementations are typically array-based for performance considerations.
- Key data structure considerations in parallel graph algorithm design
- Practical parallel priority queues
- Space-efficiency
- Parallel set/multisetoperations, e.g., union, intersection, etc.

- Concurrency
- Simulating PRAM algorithms: hardware limits of memory bandwidth, number of outstanding memory references; synchronization

- Locality
- Try to improve cache locality, but avoid “too much” superfluous computation

- Work-efficiency
- Is (Parallel time) * (# of processors) = (Serial work)?

Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)

Input: Synthetic R-MAT graphs (# of edges m = 8n)

No Last-level Cache (LLC) misses

~ 5X drop in

performance

O(m) LLC misses

- Graph topology assumptions in classical algorithms do not match real-world datasets
- Parallelization strategies at loggerheads with techniques for enhancing memory locality
- Classical “work-efficient” graph algorithms may not fully exploit new architectural features
- Increasing complexity of memory hierarchy, processor heterogeneity, wide SIMD.

- Tuning implementation to minimize parallel overhead is non-trivial
- Shared memory: minimizing overhead of locks, barriers.
- Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation.

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

1

2

distance from

source vertex

Input:

Output:

5

8

4

1

3

1

3

4

2

0

7

3

4

6

9

source

vertex

1

2

- Memory requirements (# of machine words):
- Sparse graph representation: m+n
- Stack of visited vertices: n
- Distance array: n

Test data

(Data: Mislove et al.,

IMC 2007.)

Synthetic R-MAT

networks

5

8

1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)

1

- O(D) parallel steps
- Adjacencies of all vertices
- in current frontier are
- visited in parallel

0

7

3

4

6

9

source

vertex

2

2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs)

5

8

- path-limited searches from “super vertices”
- APSP between “super vertices”

1

source

vertex

0

7

3

4

6

9

2

Locality (where are the random accesses originating from?)

1. Ordering of vertices in the “current frontier” array, i.e., accesses to adjacency indexing array, cumulative accesses O(n).

2. Ordering of adjacency list of each vertex, cumulative O(m).

3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m).

84

53

93

31

0

44

74

63

11

26

1. Access Pattern: idx array -- 53, 31, 74, 26

2,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11

Flickr social network

Youtube social network

Edge filtering

Graph expansion

x

x

x

x

x

x

x

x

x

x

x

x

x

- Well-studied problem, slight differences in problem formulations
- Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal dense blocks.
- Databases/data mining: reordering bitmap indices for better compression; permuting vertices of WWW snapshots, online social networks for compression

- NP-hard problem, several known heuristics
- We require fast, linear-work approaches
- Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit overlap in adjacency lists, dimensionality reduction

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

- Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited).
- e.g., access order: 53, 31, 31, 26, 74, 84, 0, …

- Objective: Reduce TLB misses, private
cache misses, exploit shared cache.

- Optimizations:
- Sort the adjacency lists of each vertex – helps
order memory accesses, reduce TLB misses.

- Permute vertex labels – enhance spatial locality.
- Cache-blocked edge visits – exploit temporal locality.

- Sort the adjacency lists of each vertex – helps

84

53

93

31

0

44

74

63

11

26

Metadata denoting

blocking pattern

Process

high-degree

vertices

separately

Adjacencies (d)

Adjacencies (d)

- Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses
- Requires multiple passes through the frontier array, tuning for optimal block size.
- Note: frontier array size may be O(n)

- Requires multiple passes through the frontier array, tuning for optimal block size.

3

2

1

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

frontier

frontier

x

x

Tune to L2

cache size

x

x

x

x

x

x

x

x

x

x

x

x

New: cache-blocked approach

linear processing

Similar to older heuristics, but tuned for small-world networks:

- High percentage of vertices with (out) degrees 0, 1, and 2 in social and information networks => store adjacencies explicitly (in indexing data structure).
- Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit)

- Process “high-degree vertices” adjacencies in linear order, but other vertices with d-array cache blocking.
- Form dense blocks around high-degree vertices
- Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices

1. Software prefetchingon the Intel Core i7 (supports 32 loads and 20 stores in flight)

- Speculative loads of index array and adjacencies of frontier vertices will reduce compulsory cache misses.
2. Aligning adjacency lists to optimize memory accesses

- 16-byte aligned loads and stores are faster.
- Alignment helps reduce cache misses due to fragmentation
- 16-byte aligned non-temporal stores (during creation of new frontier) are fast.
3. SIMD SSE integer intrinsicsto process “high-degree vertex” adjacencies.

4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsicshave very low overhead)

5. Hugepagesupport (significant TLB miss reduction)

6. NUMA-aware memory allocation exploiting first-touch policy

- Intel Xeon 5560 (Core i7, “Nehalem”)
- 2 sockets x 4 cores x 2-way SMT
- 12 GB DRAM, 8 MB shared L3
- 51.2 GBytes/sec peak bandwidth
- 2.66 GHz proc.

Performance averaged over

10 different source vertices, 3 runs each.

* Optimization speedup (performance on 4 cores) w.r.t baseline parallel approach, on a synthetic R-MAT graph (n=223,m=226)

Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words)

Theoretical count of the number of non-contiguous memory accesses: m+3n

Execution time:

0.28 seconds (8 threads)

Parallel speedup: 4.9

Speedup over

baseline: 2.9

Graph: 3.07 million vertices, 220 million edges

Single socket of Intel Xeon 5560 (Core i7)

- How well does my implementation match theoretical bounds?
- When can I stop optimizing my code?
- Begin with asymptotic analysis
- Express work performed in terms of machine-independent performance counts
- Add input graph characteristics in analysis
- Use simple kernels or micro-benchmarks to provide an estimate of achievable peak performance
- Relate observed performance to machine characteristics

- BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16.
- Evaluation criteria: largest problem size that can be solved on a system, minimum execution time.
- Reference MPI, shared memory implementations provided.
- NERSC Franklin system is ranked #2 on current list (Nov 2010).
- BFS using 500 nodes of Franklin

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

- Edge weights: concurrency primary challenge!
- No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work
- Parallel priority queues: relaxed heaps [DGST88], [BTZ98]
- Ullman-Yannakakis randomized approach [UY90]
- Meyer and Sanders, ∆ - stepping algorithm [MS03]
- Distributed memory implementations based on graph partitioning
- Heuristics for load balancing and termination detection

K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

- Label-correcting algorithm: Can relax edges from unsettled vertices also
- ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm”
- ∆: bucket width
- Vertices are ordered using buckets representing priority range of size ∆
- Each bucket may be processed in parallel

∆ = 0.1 (say)

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

∞

Buckets

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

0

Buckets

Initialization:

Insert s into bucket, d(s) = 0

0

0

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.05

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

0

R

2

Buckets

0

0

.01

S

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

∞

0

2

Buckets

R

0

.01

0

S

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

0

.01

R

Buckets

0

2

0

S

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

0

.01

Buckets

R

1

3

0

2

.03

.06

0

S

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

∞

∞

0

.01

R

Buckets

1

3

0

.03

.06

S

0

2

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

∞

∞

∞

0

.03

.01

.06

Buckets

R

3

0

1

S

0

2

0.05

- One parallel phase
- while (bucket is non-empty)
- Inspect light edges
- Construct a set of “requests” (R)
- Clear the current bucket
- Remember deleted vertices (S)
- Relax request pairs in R

- Relax heavy request pairs (from S)
- Go on to the next bucket

0.56

3

6

0.07

0.01

0.15

0.23

4

2

0

0.02

0.13

0.18

5

1

d array

0 1 2 3 4 5 6

0

.29

.03

.01

.06

.16

.62

Buckets

R

1

4

2

5

S

0

1

3

2

6

6

Classify edges as “heavy” and “light”

Relax light edges (phase) Repeat until B[i] Is empty

Relax heavy edges. No reinsertions in this step.

high

diameter

low

diameter

Fewer buckets,

more parallelism

- Applications
- Designing parallel graph algorithms, performance on current systems
- Case studies: Graph traversal-based problems, parallel algorithms
- Breadth-First Search
- Single-source Shortest paths
- Betweenness Centrality

- Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph
- degree, closeness, eigenvalue, betweenness

- Betweenness Centrality
( : No. of shortest paths between sand t)

- Applied to several real-world networks
- Social interactions
- WWW
- Epidemiology
- Systems biology

- All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3)time), sum up the fractional dependency values (O(n2) space).
- Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn)work and O(m+n) space.

- Madduri, Bader(2006): parallel algorithms for computing exact and approximate betweenness centrality
- low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n))
- Exact algorithm: O(mn) work, O(m+n)space, O(nD+nm/p) time.

- Madduri et al.(2009): New parallel algorithm with lower synchronization overheadand fewer non-contiguous memory references
- In practice, 2-3X faster than previous algorithm
- Lock-free => better scalability on large parallel systems

- Consider an undirected, unweighted graph
- High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scores
- Two steps
- traversal and path counting
- dependency accumulation

5

8

1

0

7

3

4

6

9

2

1. Traversal step: visit adjacent vertices, update distance

and path counts.

5

8

1

0

7

3

4

6

9

source

vertex

2

1. Traversal step: visit adjacent vertices, update distance

and path counts.

P

S

D

5

8

0

1

1

0

0

7

3

4

6

9

1

0

source

vertex

2

7

1

0

0

2

5

0

1. Traversal step: visit adjacent vertices, update distance

and path counts.

P

S

D

5

8

0

1

1

0

2

2

7

0

7

3

4

6

9

8

3

1

0

source

vertex

2

7

1

0

0

2

5

2

5

7

0

Level-synchronous approach: The adjacencies of all vertices

in the current frontier can be visited in parallel

Traversal step: at the end, we have all reachable vertices,

their corresponding predecessor multisets, and D values.

P

S

D

5

8

2

0

1

6

1

6

1

0

4

2

2

7

0

7

3

4

6

9

8

3

8

3

1

0

source

vertex

2

8

7

1

0

0

2

5

2

5

7

0

6

Level-synchronous approach: The adjacencies of all vertices

in the current frontier can be visited in parallel

- Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n)
- Upper bound on size of each predecessor multiset: In-degree
- Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts
- New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths.
- simplifies the accumulation step
- reduces an atomic operation in traversal step
- cache-friendly!

All the vertices are in a

contiguous block (stack)

for all vertices u at level d in parallel do

for all adjacencies v of u in parallel do

dv = D[v];

if (dv < 0)

vis = fetch_and_add(&Visited[v], 1);

if (vis == 0)

D[v] = d+1;

pS[count++] = v;

fetch_and_add(&sigma[v], sigma[u]);

fetch_and_add(&Scount[u], 1);

if (dv == d + 1)

fetch_and_add(&sigma[v], sigma[u]);

fetch_and_add(&Scount[u], 1);

All the adjacencies of a vertex are

stored compactly (graph rep.)

Non-contiguous

memory access

Non-contiguous

memory access

Non-contiguous

memory access

Store to S[u]

Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously

2. Accumulation step: Pop vertices from stack,

update dependence scores.

Delta

P

S

5

8

2

1

6

1

6

0

4

2

7

0

7

3

4

6

9

8

3

8

3

0

source

vertex

2

8

7

0

0

2

5

5

7

0

6

2. Accumulation step: Can also be done in a

level-synchronous manner.

Delta

P

S

5

8

2

1

6

1

6

0

4

2

7

0

7

3

4

6

9

8

3

8

3

0

source

vertex

2

8

7

0

0

2

5

5

7

0

6

All the vertices are in a

contiguous block (stack)

for level d = GraphDiameter-2 to 1 do

for all vertices v at level d in parallel do

for all w in S[v] in parallel do reduction(delta)

delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];

BC[v] = delta[v] = delta_sum_v;

Each S[v] is a

contiguous

block

Only floating point operation in code

43 interactions

Protein Ensembl ID

ENSG00000145332.2

Kelch-like protein 8

System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.

Solution: Efficiently utilize available memory bandwidth.

Algorithmic innovation

to avoid corner cases.

Improve

locality

“RandomAccess”-like

Locality

Problem

Complexity

Data reduction/

Compression

“Stream”-like

Faster

methods

Peta+

O(n)

104

1012

106

108

O(nlog n)

O(n2)

Work

performed

Problem size (n: # of vertices/edges)

- Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance
- Overview of parallel graph algorithms
- PRAM algorithms
- graph representation

- Parallel algorithm case studies
- BFS: locality, level-synchronous approach, multicore tuning
- Shortest paths: exploiting concurrency, parallel priority queue
- Betweenness centrality: importance of locality, atomics

- Questions?