Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics Zhaonian Zou Hong Gao Jianzhong Li Data and Knowledge Engineering Research Center (DKERC) Harbin Institute of Technology (HIT), China July 26, 2010

Outline • Overview • Model of uncertain graph data • Problem statement • Algorithm • Experiments • Summary and future work

Graph mining is very important Social networks Biological networks Chemical compounds VLSI Traffic networks Internet topology

0.75 0.95 TIF34 0.639 FET3 0.375 0.867 0.651 0.88 SMT3 0.147 0.639 0.698 NTG1 0.92 0.69 RAD59 RPC40 Uncertainties are inherent in graph data • Uncertainties are caused by data errors, incompleteness, imprecision, noise, etc. • There are a large number of uncertain graphs in practice. • Protein-Protein Interaction (PPI) networks • Topologies of wireless sensor networks (WSNs) Probability of the PPI existing in practice Probability of the wireless link working normally

Challenges in mining uncertain graph data • Different data models • Graphs + uncertainties • Semantics • Existing graph mining problems were defined on certain graph data and do not make sense on uncertain graph data. • The computational complexity of uncertain graph mining problems is largely even higher than the counterparts on certain graph data.

Recent work on mining uncertain graph data • Mining frequent subgraph patterns under expected semantics [CIKM’09, TKDE’10] • Mining top-k maximal cliques [ICDE’10]

0.9 v1 0.4 0.9 v3 v2 0.7 0.7 0.8 Uncertain graphs The probability of v1 existing in practice is 0.9. The conditional probability of edge (v1,v2) existing in practice whilev1 and v2exist is 0.9. Uncertain graph By tossing a biased coin for each vertex, we obtain a subset V’ of vertices. Then, by tossing a biased coin for each edge between the vertices just selected, we get a subset E’ of edges. Thus, a certain graph (V’, E’) is obtained.

0.9 v1 v1 0.4 0.9 v2 v3 v3 v2 0.7 0.7 0.8 Uncertain graphs Implicated graph Uncertain graph

0.9 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 0.4 0.9 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2 v3 v2 0.7 0.7 0.8 Uncertain graphs Uncertain graph An uncertain graph represents a probability distribution over all its implicated graphs.

Uncertain graph databases D = {G1, G2, …, Gn} Uncertain graph database D’ = {G’1, G’2, …, G’m} Implicated graph database There is an injection such that . An uncertain graph database represents a probability distribution over all its implicated graph databases.

Frequent subgraph pattern (FSP) mining problem • The support of a subgraph pattern S in a certain graph database D is the proportion of certain graphs in D that contains S, denoted by supD(S). • A subgraph pattern S is frequent if the support of S is no less than a user-specified threshold 0 < minsup < 1. • Input: a certain graph database D and a threshold 0 < minsup < 1 • Output: all subgraph patterns in D with support no less than minsup • The concept of support does not make sense on uncertain graph data since a subgraph pattern is not certainly contained in an uncertain graph.

FSP mining problem on uncertain graph databases under probabilistic semantics • Let Imp(D) denote the set of all implicated graph databases of an uncertain graph database D. • The φ-frequent probability of a subgraph pattern S in an uncertain graph database D is the probability of S having support no less than φ across all implicated graph databases of D, denoted PrD,φ(S). • Input: an uncertain graph database D, a support threshold 0 < φ< 1 and a confidence threshold 0 < τ< 1 • Output: all subgraph patterns in D with φ-frequent probability no less than τ

How hard is the FSP mining problem? • #P is a complexity class for enumeration problems such as DNF Counting, Hamiltonian Circuit Counting, Perfect Matching Counting, etc. • It is #P-hard to count the number of frequent subgraph patterns in an uncertain graph database. • Polynomial-time reducible from the problem of counting the number of frequent subgraphs in a certain graph database [Yang 04] • It is #P-hard to compute the φ-frequent probability of a subgraph pattern in an uncertain graph database. • Polynomial-time reducible from the Monotone k-DNF Counting problem [Valiant 79] • All existing algorithms for mining FSPs on certain graph databases can not solve this problem. • Approximate mining is an important approach when small errors are irrelevant.

Goal of approximate mining Subgraph patterns -frequent probability 1.0 0.5 It is intractable to exactly compute all frequent subgraph patterns. 0 The -frequent probabilities must be no less than .

A B B 0.8 0.5 A y x x y 0.1 0.8 x y 0.7 0.6 0.7 z B B B B Uncertain graph G2 Uncertain graph G1 Overview of mining algorithm Organize all subgraph patterns into a search tree according to their DFS codes [Yan & Han ICDM’01]. If S is subgraph isomorphic to S’, then . The key of the algorithm is fast determining whether the phi-frequent probability of a subgraph pattern must be no less than and probably no less than .

-frequent probability 0 1 Method for verifying subgraph patterns • Step 1: Approximate the φ-frequent probability of S by an interval [l, u] having width at most ε. • Step 2: Test the following conditions to determine whether to output S or discard it. Output Discard

Dynamic programming for exactly computing φ-frequent probabilities • Let T[0..n, 0..n, 0..n] be a three-dimensional table. • T[i, j, k] memoires the probability that an implicated graph database of {G1, G2, …, Gk} contains i + j certain graphs and that S is subgraph isomorphic to i certain graphs in it. • Recursive equation (general case) • is the probability of S occurring in G. • We obtain PrD,φ(S) by summing up all T[i, j, n] such thati/(i + j) ≥φ.

Dynamic programming for exactly computing φ-frequent probabilities • Let n = 3 and φ = 0.5. Give n, φ, and as input. • Substitute with an estimated value . It is #P-hard to compute it [Zou et al. TKDE’10]. i 0 1 2 3 0 1 j 2 3 k = 0 k = 1 k = 2 k = 3

Making dynamic programming practical • A randomized algorithm has been proposed in [Zou et al. TKDE’10] to compute an estimated value in polynomial time for any 0 < ε, δ < 1 such that • To guarantee the output of the dynamic programming is within error ε with probability at least 1 – δ, how accurate should be? • Within error ε/2n. • Succeed with probability at least (1 – δ)1/n.

Algorithm for computing approximate intervals of φ-frequent probabilities • Preprocessing: For i = 1 to n, compute at the beginning of the algorithm. • Step 1: For i = 1 to n, compute an estimated value of that is within error ε/2n with probability at least (1 – δ)1/n. • Step 2: Compute an estimated value X of PrD,φ(S) using the dynamic programming method with input n, φ, and . • Step 3: Return [l, u] = [X – ε/2, X + ε/2]. Time complexity: O(n3m2s ln(2n/δ)/ε2) |u – l| ≤ε and Pr(l ≤ PrD,φ(S) ≤u) ≥ 1 – δ

Theoretical guarantees of the mining algorithm Any infrequent subgraph pattern S with -frequent probability less than is output as a result with probability at most . Any frequent subgraph pattern S is output as a result with probability at least ((1 – )/2)s, where s is the number of edges of S.

How to set parameter δ? • To guarantee any frequent subgraph pattern to be output as a result with probability at least 1 – Δ, parameter δ should be at most 1 – 2·(1 – Δ)1/ℓ, where ℓis the maximum number of edges of frequent subgraph patterns.

Experiments • Test execution time and approximation quality. • Dataset • Source: the BioGRID database and the STRING database • PPI networks of six organisms

Execution time vs. φandτ (ε = δ = 0.05) The execution time of the algorithm rapidly decreases as φ and τ gets larger because the number of subgraph patterns need to be examined by the algorithm significantly decreases with the increasing of φ and τ.

Execution time vs. εandδ (φ= 0.2, τ = 0.9) The execution time of the algorithm rapidly decreases asεandδgets larger because the number of subgraph patterns need to be examined by the algorithm does not vary significantly, but the running time of the procedure for computing the approximate interval ofPrD,φ(S) isO(n3m2s ln(2n/δ)/ε2).

Approximation quality vs. ε(δ = 0.05) • Precision: the proportion of frequent ones in output subgraph patterns • Recall: the proportion of output ones in frequent subgraph patterns • #OFS: the number of output frequent subgraph patterns • #OIS: the number of output infrequent subgraph patterns • #FS: the number of frequent subgraph patterns (ε= 0.02, δ = 0.001) • Since precision = #OFS/(#OFS + #OIS), it decreases as εgets larger. • Since recall = #OFS/#FS, it is stable and almost independent of ε.

Approximation quality vs. δ (ε = 0.05) • The precision of the algorithm is almost independent of δ. • In theory, the recall of the algorithm should decrease as δincreases. However, the experimental results is counterintuitive. This is because the practical failure probability of the algorithm for computing the approximate interval of PrD,φ(S) is much lower than its theoretical bound.

Summary • Model of uncertain graph data • An uncertain graph represents a probability distribution over all its implicated graphs. • An uncertain graph database represents a probability distribution over all its implicated graph databases. • FSP mining problem on uncertain graph databases under probabilistic semantics • Hardness results • This FSP mining problem is NP-hard. • It is #P-hard to compute the φ-frequent probability of a subgraph pattern. • Algorithm • A dynamic programming-based randomized algorithm for computing approximate intervals of φ-frequent probabilities • Thorough analysis on global and/or local theoretical guarantees

Future work • De-randomize the proposed algorithm • Qualitative evaluation of mining results • Succinct models of uncertain graph data • …

References • Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Mining Frequent Subgraph Patterns from Uncertain Graph Data. TKDE, 2010. • Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Finding Top-k Maximal Cliques in an Uncertain Graph. ICDE, 2010. • Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Frequent Subgraph Pattern Mining on Uncertain Graph Data. CIKM, 2009. Thank you! See you in today’s poster session. For more information, please visit our group at http://db.cs.hit.edu.cn.

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics

Presentation Transcript

Probabilistic/Uncertain Data Management

Probabilistic Databases

Graph databases

Evaluating Probabilistic Queries over Uncertain Matching

Probabilistic/Uncertain Data Management -- IV

Lineage Processing over Correlated Probabilistic Databases

Mining Frequent Itemsets over Uncertain Databases

Fast Frequent Free Tree Mining in Graph Databases

Benchmarking traversal operations over graph databases

Discovering Informative Subgraphs in RDF Graphs

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Keyword Search Over Graph Databases

Frequent Subgraph Pattern Mining on Uncertain Graph Data

COMP9315 Uncertain and Probabilistic Data

Mining Frequent Itemsets over Uncertain Databases

Probabilistic Similarity Queries in Uncertain Databases

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Mining Probabilistically Frequent Sequential Patterns in Uncertain Databases

Semantics of Probabilistic Programs