1 / 39

Weiren Yu 1 , Jiajin Le 2 , Xuemin Lin 1 , Wenjie Zhang 1

SSDBM 2012. On the Efficiency of Estimating Penetrating Rank on Large Graphs. Weiren Yu 1 , Jiajin Le 2 , Xuemin Lin 1 , Wenjie Zhang 1. 1 University of New South Wales & NICTA, Australia 2 Donghua University, China. Contents. 4. Experimental Results. 1. Introduction.

derora
Download Presentation

Weiren Yu 1 , Jiajin Le 2 , Xuemin Lin 1 , Wenjie Zhang 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SSDBM 2012 On the Efficiency of Estimating Penetrating Rank on Large Graphs Weiren Yu1, Jiajin Le2, Xuemin Lin1, Wenjie Zhang1 1 University of New South Wales & NICTA, Australia 2Donghua University, China

  2. Contents 4. Experimental Results 1. Introduction 2. Problem Definition 3. Optimization Techniques

  3. P-Rank : A New Link-based Similarity Measure • Structural Similarity Measure • PageRank [Page et. al, 1999] • SimRank [Jeh and Widom, KDD 02] • P(enetrating)-Rank similarity • A new promising structural measure [Zhao et. al. , CIKM 09] • An extension of SimRank metrics • Basic Philosophy • Two entities are similar, if(1) they are referenced by similar entities(2) they reference similar entities

  4. P-Rank Overview • Features • Avoiding “limited information problem” of SimRank • --- By taking account of both in- and out-links • Defined recursively and is computed iteratively • Applicable to any domain with object-to-object relationships • Challenges • Costly to compute P-Rank on large graphs • Naïve Iteration O(Kn4) [Zhao et. al. , CIKM 09] • Partial Sums Amortization O(Kn3)[Lizorkin et. al. , PVLDB 08] • Hard to estimate the error for P-Rank approximation • Radius- and category-based Pruning Rule O(Kd2n2) • [Zhao et. al. , CIKM 09]

  5. P-Rank Formulation • Mathematical Formula • Iterative Paradigm

  6. Contributions • Characterizing P-Rank as two forms • matrix inversion --- deterministic optimization • power series --- probabilistic computation • Deterministic optimization (off-line) • eliminating neighborhood structure redundancy • quadratic-time with an error bound • Probabilistic computation (on-line) • a sampling approach • linear-time with controlled accuracy

  7. What is P-Rank? • The similarity in a domain can be modeled as graphs. [ vertices objects , edges  relationships ] • SimRank is an important similarity measure which exploits the relationships between vertices on web graphs. (Glen Jeh & Jennifer Widom , ’02) • Basic intuition: • Two objects are similar if their neighbors are similar. (the recursive definition) • Objects are maximally similar to themselves. (the base case )

  8. Existing Similarity measures • Textual-Content Similarity (text-based) • Vector-cosine similarity, Pearson correlation in IR • Structural-Context Similarity (link-based) • PageRank • One page’s authority is decided by its neighbors’ authorities. • SimRank • Two objects are similar if they are referenced by similar objects.

  9. Contents 4. Experimental Results 1. Introduction 2. Problem Definition 3. Optimization Techniques

  10. G vs. G2 Model • Basic Graph Model: G = (V, E) For each vertex v∈V , we define: • N(v): all the neighbors of vertex v • Ni(v): individual member of N(v) • Node-pair Graph: G2 = (V2, E2) • ∀ (a, b) ∈ V2 represents a pair (a, b) of nodes in G. • ∀ ⟨(a1, b1) , (a2, b2)⟩ ∈ E2 denotes the edges ⟨a1, a2⟩ and ⟨ b1, b2 ⟩ exist in G. N(u) SimRank propagating similarity from node to node in G is associated with the propagation from pair to pair in G2. N(v) v u

  11. SimRank Equation • Definition 1 (SimRank similarity) • Let s: V2 → [0, 1] be a similarity function on G2 • if a = b,  s (a, b) = 1, • if N(a) or N(b) = ∅,  s (a, b) = 0, • otherwise: • where c is a decay factor btw. 0 & 1 Similarity btw. a & b is the average similarity btw. neighbors of a and neighbors of b.

  12. Existing Techniques for SimRank Optimization • Deterministic Method [VLDB J. ’10, EDBT ’10, APWEB ’10, etc] • ( computing s(∙, ∙) iteratively for finding a fixed point ) • Advantage: accurate • Disadvantage: high time complexity (O(Kn3) in the worse case) • Probabilistic Method [WWW ’05, SIGIR ’06] • ( estimating s(∙, ∙) stochastically by using Monte-Carlo ) s(a,b) = E (cT(a,b)) , where T (a,b) : the first meeting time btw. a & b • Advantage: scalable (linear time) • Disadvantage: low similarity quality

  13. Existing Techniques for SimRank Deterministic Computation • Jeh and Widom first proposed a SimRank model, [WWW ’02]taking O(Kn4) worst-case time. • Li et al. proposed a non-iterative approximate algorithm, [EDBT ’10]yielding O(r4n2) time for dynamic information networks. • Lizorkin et al. used a partial sum function, [VLDB ’10]reducing the time to O(Kn3) in the worst case. • Yu et al. showed a fast matrix multiplication for digraph, [APWEB ’10] requiring O(K·min (m·n, nr)), where 2<r<log27.

  14. Motivations • The time required for SimRank deterministic algorithms is still about cubic in the number of vertices for each iteration, which is costly over large graphs. • As for SimRank deterministic computation, parallel implementation has not been addressed in scientific literature yet.

  15. Our Contributions • We present an efficient spectral decomposition based algorithm for SimRank computation over undirected graphs, which reduces the time complexity from O(Kn3) to O(n3 + Kn2). • We develop a block partition technique in combination with the Parallel Linear Algebra Package (PLAPACK) to parallelize SimRank algorithm on distributed memory multi-processors. • We perform extensive evaluations of our proposed methods demonstrating the efficiency and effectiveness of our algorithms.

  16. Contents 4. Experimental Results 1. Introduction 2. Problem Definition 3. Optimization Techniques

  17. 3.1 AUG-SimRank

  18. Graph Spectrum • Definition 1. (Graph Spectrum) Given a web graph G, let QG denote its transition probability matrix. The spectrum of G is defined to be the set of the eigenvalues of QG. In symbols, directed graph undirected graph

  19. Graph Spectrum • For a digraph G, • some elements in σ(G) might be complex numbers. • For an undirected graph G, • all elements in σ(G) must be real numbers. • Theorem 1. Given an undirected graph G, all the eigenvalues of its transition probability matrix QG are real numbers associated with a complete set of orthonormal eigenvectors.

  20. Graph Spectrum • Theorem 2. For an undirected graph G, let Q = U·Λ·U−1 be a complete spectral decomposition of Q, where • U is an orthogonal matrix with real entities whose columns are eigenvectors of Q, • Λ is a real diagonal matrix whose diagonal entities give the corresponding eigenvalues. • Then we can construct the following iteration: • And SimRank similarity can be thereby obtained as follows:

  21. Key Observations • 0. SimRank Matrix Representation • 1. Spectral Predecomposition [Theorem 1] • 2. Iterative Element-wise Matrix Multiplication • 3. SimRank Matrix Computation O(n3) O(Kn2) O(n3)

  22. Key Observations (cont.) • Notice that Λ· S ·Λ =[diag (Λ) · diag(Λ)T ] ⊙ S is our trick to reduce the time complexity from O(n3) to O(n2) per iteration. = = =

  23. AUG-SimRank Algorithm • Theorem 3. For undirected graphs, SimRank can be performed for K iterations in O(n3 + K·n2) time in the worst case, where n is the number of vertices, and n ≫ K. • Preconditioning techniques may be adopted when we calculate diag (Λ) · diag(Λ)T . Once computed, this rank-1 matrix is memorized and is therefore not recomputed when subsequently required.

  24. 3.2 PAUG-SimRank

  25. Parallel AUG-SimRank • To parallelize the AUG-SimRank algorithm, we utilize PLAPACK in combination with matrix partition techniques on distributed memory architectures. • PLAPACK is a parallel ARPACK version based on MPI (Message Passing Interface) for constructing parallel linear algebra libraries. • It provides a high-level object-oriented programming interface. • The coding of parallel linear algebra routines becomes a straightforward translation of algorithms.

  26. Parallel AUG-SimRank • In the spectral predecomposition phase, • Use a PLAPACK eigen-solver to decompose Q → U · Λ · U−1. • Partition the row vector diag(Λ)T→ (Λ(1) Λ(2) · · · Λ(N) ). • In the iterative element-wise matrix multiplication phase, • Initialize the upper triangular part of M(i) ← c · diag (Λ) · Λ(i) . • Partition the similarity matrix as • Calculate each partition in parallel as

  27. Parallel AUG-SimRank (cont.) • In the SimRank matrix computation phase, • parallel computation of Sk can be performed by the following substeps: • A symmetric matrix-matrix multiplicationcan be parallelized in PLAPACK. • The upper (or lower) triangular part of Sk can be updated as • hence, Sk can be computed as

  28. Contents 4. Experimental Results 1. Introduction 2. Problem Definition 3. Optimization Techniques

  29. Experimental Studies • Hardware • 2.0GHz Pentium(R) Dual-Core / 2GB RAM • Windows Vista OS / Visual C++ 6.0 • Data Sets • Synthetic • graph with an average of 8 links per page. • 10 sample adjacency matrices from 1K to 10K with ξ ∼uniform[0; 16] out-links on each row. • Real-life • Wikipedia (3.2M articles with 110M intra-wiki links / Oct. ’07) • We choose the relationship : “a category contains an article to be a link btw. the category and the article”.

  30. Experimental Studies • Algorithms for Comparison • SimRank with partial sums. [VLDB ’10] • SOR SimRank. [APWEB ’10] • AUG SimRank. & Parallel AUG SimRank. • Evaluation Measures • CPU time : computational complexity • absolute speedup : parallel efficiency • Parameter Settings • c = 0.8, ω = 1.3, ϵ = 0.05 • p number of processors • T1 execution time of the sequential algorithm on one processor • Tp time taken on p processors

  31. Time Efficiency Evaluation

  32. Time Efficiency Evaluation

  33. Time Efficiency Evaluation

  34. Time Efficiency Evaluation

  35. Time Efficiency Evaluation

  36. Parallel Efficiency Evaluation

  37. Parallel Efficiency Evaluation

  38. Thank You !

More Related