1 / 30

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics

This research paper discusses the challenges and algorithms involved in mining frequent subgraph patterns in uncertain graph databases. It provides an overview of uncertain graph data and its inherent uncertainties, as well as the computational complexities of mining such data. The paper also presents recent work on mining uncertain graph data and introduces a mining algorithm using probabilistic semantics.

mmarin
Download Presentation

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics Zhaonian Zou Hong Gao Jianzhong Li Data and Knowledge Engineering Research Center (DKERC) Harbin Institute of Technology (HIT), China July 26, 2010

  2. Outline • Overview • Model of uncertain graph data • Problem statement • Algorithm • Experiments • Summary and future work

  3. Graph mining is very important Social networks Biological networks Chemical compounds VLSI Traffic networks Internet topology

  4. 0.75 0.95 TIF34 0.639 FET3 0.375 0.867 0.651 0.88 SMT3 0.147 0.639 0.698 NTG1 0.92 0.69 RAD59 RPC40 Uncertainties are inherent in graph data • Uncertainties are caused by data errors, incompleteness, imprecision, noise, etc. • There are a large number of uncertain graphs in practice. • Protein-Protein Interaction (PPI) networks • Topologies of wireless sensor networks (WSNs) Probability of the PPI existing in practice Probability of the wireless link working normally

  5. Challenges in mining uncertain graph data • Different data models • Graphs + uncertainties • Semantics • Existing graph mining problems were defined on certain graph data and do not make sense on uncertain graph data. • The computational complexity of uncertain graph mining problems is largely even higher than the counterparts on certain graph data.

  6. Recent work on mining uncertain graph data • Mining frequent subgraph patterns under expected semantics [CIKM’09, TKDE’10] • Mining top-k maximal cliques [ICDE’10]

  7. 0.9 v1 0.4 0.9 v3 v2 0.7 0.7 0.8 Uncertain graphs The probability of v1 existing in practice is 0.9. The conditional probability of edge (v1,v2) existing in practice whilev1 and v2exist is 0.9. Uncertain graph By tossing a biased coin for each vertex, we obtain a subset V’ of vertices. Then, by tossing a biased coin for each edge between the vertices just selected, we get a subset E’ of edges. Thus, a certain graph (V’, E’) is obtained.

  8. 0.9 v1 v1 0.4 0.9 v2 v3 v3 v2 0.7 0.7 0.8 Uncertain graphs Implicated graph Uncertain graph

  9. 0.9 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 0.4 0.9 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v3 v2 0.7 0.7 0.8 Uncertain graphs Uncertain graph An uncertain graph represents a probability distribution over all its implicated graphs.

  10. Uncertain graph databases D = {G1, G2, …, Gn} Uncertain graph database D’ = {G’1, G’2, …, G’m} Implicated graph database There is an injection such that . An uncertain graph database represents a probability distribution over all its implicated graph databases.

  11. Frequent subgraph pattern (FSP) mining problem • The support of a subgraph pattern S in a certain graph database D is the proportion of certain graphs in D that contains S, denoted by supD(S). • A subgraph pattern S is frequent if the support of S is no less than a user-specified threshold 0 < minsup < 1. • Input: a certain graph database D and a threshold 0 < minsup < 1 • Output: all subgraph patterns in D with support no less than minsup • The concept of support does not make sense on uncertain graph data since a subgraph pattern is not certainly contained in an uncertain graph.

  12. FSP mining problem on uncertain graph databases under probabilistic semantics • Let Imp(D) denote the set of all implicated graph databases of an uncertain graph database D. • The φ-frequent probability of a subgraph pattern S in an uncertain graph database D is the probability of S having support no less than φ across all implicated graph databases of D, denoted PrD,φ(S). • Input: an uncertain graph database D, a support threshold 0 < φ< 1 and a confidence threshold 0 < τ< 1 • Output: all subgraph patterns in D with φ-frequent probability no less than τ

  13. How hard is the FSP mining problem? • #P is a complexity class for enumeration problems such as DNF Counting, Hamiltonian Circuit Counting, Perfect Matching Counting, etc. • It is #P-hard to count the number of frequent subgraph patterns in an uncertain graph database. • Polynomial-time reducible from the problem of counting the number of frequent subgraphs in a certain graph database [Yang 04] • It is #P-hard to compute the φ-frequent probability of a subgraph pattern in an uncertain graph database. • Polynomial-time reducible from the Monotone k-DNF Counting problem [Valiant 79] • All existing algorithms for mining FSPs on certain graph databases can not solve this problem. • Approximate mining is an important approach when small errors are irrelevant.

  14. Goal of approximate mining Subgraph patterns -frequent probability 1.0 0.5 It is intractable to exactly compute all frequent subgraph patterns. 0 The -frequent probabilities must be no less than .

  15. A B B 0.8 0.5 A y x x y 0.1 0.8 x y 0.7 0.6 0.7 z B B B B Uncertain graph G2 Uncertain graph G1 Overview of mining algorithm Organize all subgraph patterns into a search tree according to their DFS codes [Yan & Han ICDM’01]. If S is subgraph isomorphic to S’, then . The key of the algorithm is fast determining whether the phi-frequent probability of a subgraph pattern must be no less than and probably no less than .

  16. -frequent probability 0 1 Method for verifying subgraph patterns • Step 1: Approximate the φ-frequent probability of S by an interval [l, u] having width at most ε. • Step 2: Test the following conditions to determine whether to output S or discard it. Output Discard

  17. Dynamic programming for exactly computing φ-frequent probabilities • Let T[0..n, 0..n, 0..n] be a three-dimensional table. • T[i, j, k] memoires the probability that an implicated graph database of {G1, G2, …, Gk} contains i + j certain graphs and that S is subgraph isomorphic to i certain graphs in it. • Recursive equation (general case) • is the probability of S occurring in G. • We obtain PrD,φ(S) by summing up all T[i, j, n] such thati/(i + j) ≥φ.

  18. Dynamic programming for exactly computing φ-frequent probabilities • Let n = 3 and φ = 0.5. Give n, φ, and as input. • Substitute with an estimated value . It is #P-hard to compute it [Zou et al. TKDE’10]. i 0 1 2 3 0 1 j 2 3 k = 0 k = 1 k = 2 k = 3

  19. Making dynamic programming practical • A randomized algorithm has been proposed in [Zou et al. TKDE’10] to compute an estimated value in polynomial time for any 0 < ε, δ < 1 such that • To guarantee the output of the dynamic programming is within error ε with probability at least 1 – δ, how accurate should be? • Within error ε/2n. • Succeed with probability at least (1 – δ)1/n.

  20. Algorithm for computing approximate intervals of φ-frequent probabilities • Preprocessing: For i = 1 to n, compute at the beginning of the algorithm. • Step 1: For i = 1 to n, compute an estimated value of that is within error ε/2n with probability at least (1 – δ)1/n. • Step 2: Compute an estimated value X of PrD,φ(S) using the dynamic programming method with input n, φ, and . • Step 3: Return [l, u] = [X – ε/2, X + ε/2]. Time complexity: O(n3m2s ln(2n/δ)/ε2) |u – l| ≤ε and Pr(l ≤ PrD,φ(S) ≤u) ≥ 1 – δ

  21. Theoretical guarantees of the mining algorithm Any infrequent subgraph pattern S with -frequent probability less than is output as a result with probability at most . Any frequent subgraph pattern S is output as a result with probability at least ((1 – )/2)s, where s is the number of edges of S.

  22. How to set parameter δ? • To guarantee any frequent subgraph pattern to be output as a result with probability at least 1 – Δ, parameter δ should be at most 1 – 2·(1 – Δ)1/ℓ, where ℓis the maximum number of edges of frequent subgraph patterns.

  23. Experiments • Test execution time and approximation quality. • Dataset • Source: the BioGRID database and the STRING database • PPI networks of six organisms

  24. Execution time vs. φandτ (ε = δ = 0.05) The execution time of the algorithm rapidly decreases as φ and τ gets larger because the number of subgraph patterns need to be examined by the algorithm significantly decreases with the increasing of φ and τ.

  25. Execution time vs. εandδ (φ= 0.2, τ = 0.9) The execution time of the algorithm rapidly decreases asεandδgets larger because the number of subgraph patterns need to be examined by the algorithm does not vary significantly, but the running time of the procedure for computing the approximate interval ofPrD,φ(S) isO(n3m2s ln(2n/δ)/ε2).

  26. Approximation quality vs. ε(δ = 0.05) • Precision: the proportion of frequent ones in output subgraph patterns • Recall: the proportion of output ones in frequent subgraph patterns • #OFS: the number of output frequent subgraph patterns • #OIS: the number of output infrequent subgraph patterns • #FS: the number of frequent subgraph patterns (ε= 0.02, δ = 0.001) • Since precision = #OFS/(#OFS + #OIS), it decreases as εgets larger. • Since recall = #OFS/#FS, it is stable and almost independent of ε.

  27. Approximation quality vs. δ (ε = 0.05) • The precision of the algorithm is almost independent of δ. • In theory, the recall of the algorithm should decrease as δincreases. However, the experimental results is counterintuitive. This is because the practical failure probability of the algorithm for computing the approximate interval of PrD,φ(S) is much lower than its theoretical bound.

  28. Summary • Model of uncertain graph data • An uncertain graph represents a probability distribution over all its implicated graphs. • An uncertain graph database represents a probability distribution over all its implicated graph databases. • FSP mining problem on uncertain graph databases under probabilistic semantics • Hardness results • This FSP mining problem is NP-hard. • It is #P-hard to compute the φ-frequent probability of a subgraph pattern. • Algorithm • A dynamic programming-based randomized algorithm for computing approximate intervals of φ-frequent probabilities • Thorough analysis on global and/or local theoretical guarantees

  29. Future work • De-randomize the proposed algorithm • Qualitative evaluation of mining results • Succinct models of uncertain graph data • …

  30. References • Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Mining Frequent Subgraph Patterns from Uncertain Graph Data. TKDE, 2010. • Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Finding Top-k Maximal Cliques in an Uncertain Graph. ICDE, 2010. • Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Frequent Subgraph Pattern Mining on Uncertain Graph Data. CIKM, 2009. Thank you! See you in today’s poster session. For more information, please visit our group at http://db.cs.hit.edu.cn.

More Related