Oak Ridge National Laboratory Computing and Computational Sciences

Download Presentation

Oak Ridge National Laboratory Computing and Computational Sciences

Loading in 2 Seconds...

- 54 Views
- Uploaded on
- Presentation posted in: General

Oak Ridge National Laboratory Computing and Computational Sciences

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Oak Ridge National LaboratoryComputing and Computational Sciences

Analyzing the R-MAT graph generator

using occupancy theory

Blair D. Sullivan

Joint work with:Christopher Groer

Steve Poole

Rice University

CAAM Colloquium

April 12, 2010

- Introduced by Chakrabarti, Faloutsos, Zhan (2004) as a “scale-free” digraph generator (power law degree distribution).
- Recursively partitions the adjacency matrix of a graph G according to four probabilities to select position of an edge. The number of vertices must be a power of two, say n = 2k.
- Repeats process M times, and may choose an edge multiple times. Duplicates are discarded at the end to form G’ with M’ distinct edges.
- Used in many applications, including the SSCA#2 HPC benchmark.

- Let α + β + γ + δ = 1.
- Edges are generated by recursively using parameters to choose a location in the adjacency matrix.
- Alternatively, you can think of each choice as specifying a pair of digits in the binary representations of the edge endpoints.

- For the remainder of this talk, we will think of the vertices of G as length-k binary strings.
- Let the eα, eβ, eγ, and eδ be the number of positions in the paired binary representations of an edge’s endpoints corresponding to (0,0), (0,1), (1,0), and (1,1), respectively.
- Example: e = (u,v) in a graph with 26 vertices.

eα= 1 eβ = 2

eγ = 1 eδ= 2

u = 0 0 0 1 1 1

v = 0 1 1 0 1 1

- The probability of generating e is then:

- We proved the probability of generating any edge that starts at a vertex u depends solely on the number of zeros in u’s binary string, say uz.
- Let λ = α + β and μ = α + γ be the probabilities of choosing “up” and “left” in the matrix, respectively.
- Given a vertex u, one can show the probability of an edge of the form (u,v) for some v is:
- Similarly, the probability of an edge (v,u) is:

- R-MAT naturally generates a multi-graph before duplicate edges are removed.
- The probability of out-degree d is binomial:
- The expected number of vertices with out-degree d is:
- The probability distribution for the total degree is given by:

α = β = γ = δ = .25

α = .55, β = γ = .1, δ = .25

Note that the total degree distribution varies

with the choice of your quadrant probabilities.

α = δ = .15, β = .5, γ = .2

n = 26 vertices, M = 29 edges

n = 26 vertices, M = 29 edges, M’ ~ 28.4 edges

α = .55, β = γ = .1, δ = .25

- A classical occupancy problem is often described in terms of tossing r indistinguishable balls into m distinguishable urns and finding the probability that exactly n of these urns are non-empty.
- The R-MAT generator can be modeled as such a problem by envisioning the 4k positions in the adjacency matrix as the set of urns, and the M randomly generated edges as the set of balls tossed into these urns. The number of edges M’ in the graph G’ then corresponds to the number of non-empty urns.

- Traditionally, when throwing balls into urns, the probability of “hitting” every urn is the same. R-MAT matrix positions have unequal probabilities, so let q = {q1, q2, …, qm} be the urn probabilities.
- Let U(r, l, m, q, t) be the probability that exactly t of the first l urns are empty after tossing r balls into the set of m urns with probability vector q.
- Johnson & Kotz proved:
- Note this quantity is independent of the ordering of elements in q.

- One can now derive an expression for the probability of outdegree d by letting = {p(uv)} v=0,1,…n-1:
- Note that since the function U is independent of the ordering of , this quantity is the same for all vertices u with a given value of uz.
- Unfortunately, this is not a computationally convenient formula.

- A straightforward corollary to Johnson & Kotz allows us to calculate U when l is not equal to m:
- We can now think of throwing balls into the 2k urns in a row plus a “big urn” encompassing all other possible edges, with probability 1 - . Let be the vector obtained by appending 1- to . Then,

KEY FACT: The out-degree distribution of a vertex is completely determined by the parameters k, M, α, β, γ, δ, and the number of zeros in its binary representation (uz).

The exact out-degree distribution for the 7 values of uz & the overall out-degree distribution for a 64-node graph with M = 8*64 and α = .55, β = γ = .1, δ = .25.

- Problem: calculating the out-degree distribution using these formulas requires massive amounts of computation, e.g. a naïve approach requires O(267) operations for a 64-node graph!
- Solution: we analyzed the limiting distributions.
- There has been a lot of work on the necessary conditions on a set of probability vectors to get certain distributions. For example, when the probabilities are all equal, the limiting distribution is Poisson.

Theorem (Chistyakov, 1964)

Given a set of m urns with probabilities q = {q1, q2, …, qm} which sum to 1, let

X be the r.v. corresponding to the number of empty urns after tossing r balls.

Then if r, m tend to ∞ with r/m → C1(non-negative and finite), and

m ∙ qi≤ C2 < ∞ for each i, then

X ~ N(E[X], Var[X])

Theorem (Chistyakov, 1964)

Given a set of m urns with probabilities q = {q1, q2, …, qm} which sum to 1, let

X be the r.v. corresponding to the number of empty urns after tossing r balls.

Then if r, m tend to ∞ with r/m → C1(non-negative and finite), and

m ∙ qi≤ C2 < ∞ for each i, then

X ~ N(E[X], Var[X])

- We first proved a corollary showing that the number of empty urns among the first m-1 of the m urns is also asymptotically normally distributed with the expected mean and variance.

Theorem (Chistyakov, 1964)

Given a set of m urns with probabilities q = {q1, q2, …, qm} which sum to 1, let

X be the r.v. corresponding to the number of empty urns after tossing r balls.

Then if r, m tend to ∞ with r/m → C1(non-negative and finite), and

m ∙ qi≤ C2 < ∞ for each i, then

X ~ N(E[X], Var[X])

- We first proved a corollary showing that the number of empty urns among the first m-1 of the m urns is also asymptotically normally distributed with the expected mean and variance.
- Assuming that the ratio of r to m is bounded (m= O(n)), it remains to prove that m∙qi≤ C2 <∞ for each i. In the case of R-MAT, for every vertex u, we need to show n ∙ p(uv) → cv for all vertices v.

- Case 1: 0 < α, β, γ, δ ≤ 0.5
- This is straightforward, as the quantityn∙p(uv) is uniformly bounded above by the constant 1.

- Case 2: 0 < min(α, β, γ, δ) & max(α, β, γ, δ) > 0.5
- We were able to prove that all but a vanishing proportion of the vertices satisfy the necessary criterion:
- This requires the use of Chebyshev’s inequality to prove the limit of a weighted sum of binomial coefficients.

- These results allow us to prove that the limiting distributions for in-, out-, and total-degree are asymptotically normal when all parameters are strictly positive and M = O(n) :
- The overall degree distribution for G’ is thus a mixture of normal distributions (one for each value of uz).

Comparison of observed versus limiting distribution,

averaged over 2048 graphs with n = 212 and M = 217.

- We can also approximate the variance of M’. We believe M’ is normally distributed, but this is still an open problem.

This is a histogram of the observed values of M’ for 216 graphs generated with n = 220, M = 223 and R-MAT parameters

α = .55, β = γ = .1, δ = .25.

The red line shows a normal distribution with mean and variance calculated according to our formulas.

Joint work with Chris Groer, Steve Poole

Funded by Department of Defense

- Development of consistent representations and metrics for compression.
- Computational study comparing variants of MDL (minimum description length) and binary matrix reordering (TSP)-based algorithms.
- Proved finding optimal MDL representation is NP-hard & formulated as a mixed integer program.

Joint work with Chris Groer

Funded by DOE Office of Advanced Scientific Computing Research

- Objective: Leverage theoretical work on graph decompositions to create efficient computational framework for graph-based data.

- Approach:
- Low width decompositions of sparse application graphs
- Algorithm complexity becomes exponential in width, but polynomial in number of nodes
- Integrate parallel computing with decompositions for massive graph analysis

- Challenges:
- Low width decompositions are insufficient
- Need to control structure of the decomposition (balanced bag sizes & tree far from being a path)
- Modify dynamic programming to run in parallel

This work was supported by the United States Department of Defense and the United States Department of Energy’s Office of Advanced Scientific Computing Research (OASCR). Resources of the Extreme Scale Systems Center at Oak Ridge National Laboratory were used for computational results.