Output Space Sampling for (sub) Graph Patterns

Output Space Sampling for (sub) Graph Patterns Mohammad Hasan, Mohammed Zaki RPI, Troy, NY

Graph Mining: Given a Database of graphs and a support threshold, find all frequent subgraphs Output Space: The set of frequent subgraphs Sampling: Returning a random subgraph from the output space without complete enumeration

Motivation

Consider the following problem from Medical Informatics Tissue Images Cell Graphs DiscriminatorySubgraphs Classifier Damaged Healthy Diseased

Mining Task • Dataset • 30 graphs • Average vertex count: 2154 • Average edge count: 36945 • Support • 40% • Result • No Result (used gSpan, Gaston) in a week of running on 2 GHz dual-core PC with 4 GB running Linux

Limitations of Existing Subgraph Mining Algorithms • Work only for small graphs • The most popular datasets in graph mining are chemical graphs • Chemical graphs are mostly tree • In DTP dataset (most popular dataset) average vertex count is 43 and average edge count is 45 • Perform a complete enumeration • For large input graph, output set is neither enumerable nor usable • They follow a fixed enumeration order • Partial run does not efficiently generate the interesting subgraphs avoid complete enumeration to sample a set of interesting subgraphs from the output set

Why sampling a solution? • Observation 1: • Mining is only exploratory step, mined patterns are generally used in subsequent KD task • Not all frequent patterns are equally important for the desired task at hand • Large output set leads to information overload problem complete enumeration is generally unnecessary • Observation 2: • Traditional mining algorithms explore the output space with a fixed enumeration order • Good for generating non-duplicate candidate patterns • But, subsequent patterns in that order are very similar Sampling can change enumeration order to sample interesting and non-redundant subgraphs with a higher chance

Problem Definition

Output Space • Traditional frequent subgraphs for a given support threshold • Can also augment with other constraint • To find good patterns for the desired KD task Output Space for FPM with support = 2 Input Space

Sampling from Output Space • Return a random pattern from the output set • Random pattern is obtained by sampling from a desired distribution • Define an interestingness function, f : FR+; f(p) returns the score of pattern p • The desired sampling distribution is proportional to the interestingness score • If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution • Efficiency consideration • Enumerate as few auxiliary patterns as possible

How to choose f? • Depends on application needs • For exploratory data analysis (EDA), every frequent pattern can have a uniform score • For Top-K pattern mining, support values can be used as scores, which is support biased sampling. • For subgraph summarization task, only maximal graph patterns has uniform non-zero score • For graph classification, discriminatory subgraphs should have high scores

Algorithm

Challenges g2 g1 g3 • The output space can not be instantiate • Complete statistics about the output space is not known. • Target distribution is not known entirely g5 g4 Output Space of Graph Mining We want, Graphs Scores s1 s2 s3 sn

MCMC Sampling POG as transition graph • Solution Approach (MCMC Sampling) • Perform random walk in the output space • Represent the output space as a transition graph to allow local transitions • Edges of transition graph are chosen based on structural similarity • Make sure that the random walk is ergodic In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge

Algorithm • Define the transition graph (for instance, POG) • Define interestingness function that select desired sampling distribution • Perform random walk on the transition graph • Compute the neighborhood locally • Compute Transition probability • Utilize the interestingness score • makes the method generic • Return the currently visiting pattern after k iterations.

Local Computation of Output Space g1 Super Patterns g3 g2 Pattern that are not part of the output space is discarded during local neighborhood computation Sub Patterns g5 g4 g0 =1 Σ g1 g2 g3 g4 g5 u

Compute P to achieve Target Distribution • If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have, • Main task is to choose P, so that the desired stationary distribution is achieved • In fact, we compute only one row of P (local computation) We want, Graphs Scores s1 s2 s3 sn

Use Metropolis-Hastings (MH) Algorithm 3 Select • Fix an arbitrary proposal distribution beforehand (q) • Find a neighbor j (to move to) by using the above distribution • Compute acceptance probability and accept the move with this probability • If accept move to j; otherwise, go to step 2 1 2 3 0 4 5

Uniform Sampling of Frequent Patterns • Target Distribution • 1/n, 1/n, . . . , 1/n • How to achieve it? • Use uniform proposal distribution • Acceptance probability is: • dx: Degree of a vertex x

Uniform Sampling, Transition Probability Matrix 1 A A P 4 D D B

Discriminatory Subgraph Sampling • Database graphs are labeled • Subgraphs may be used as • Feature for supervised classification • Graph Kernel Subgraph Mining Embedding Counts Or Binary

Sampling in Proportion to Discriminatory Score (f) • Interestingness score (feature quality) • Entropy • Delta score = abs (positive support – negative support) P • Direct Mining is difficult • Score values (entropy, delta score) are neither monotone nor anti-monotone C Score(P) <=> Score(C) ?

Discriminatory Subgraph Sampling • Use Metropis-Hastings Algorithm • Choose neighbor uniformly as proposal distribution • Compute acceptance probability from the delta score Delta Score of j and i Ratio of degree of i and j

Results

Datasets

Result Evaluation Metrics • Sampling Quality • Our sampling distribution vs target sampling distribution • Median and standard deviation of visit count • How the sampling converges (convergence rate) • Variation Distance: • Scalability Test • Experiments on large datasets • Quality of Sampled Patterns

Uniform Sampling Results • Experiment Setup • Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution • For a dataset with n frequent patterns, we perform 200*n iterations Result on DTP Chemical Dataset

Sampling Quality • Depends on the choice of proposal distribution • If the vertices of POG have similar degree values, sampling is good • Earlier dataset have patterns with widely varying degree values • [ • For clique dataset, sampling quality is almost perfect Result on Chess (Itemset) Dataset (100*n iterations)

Discriminatory sampling results (Mutagenicity dataset) Distribution of Delta Score among all frequent Patterns Relation between sampling rate and Delta Score

Discriminatory sampling results (cont)

Discriminatory sampling results (cell Graphs) • Total graphs 30, min-sup = 6 • No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine)

SummaryExisting Algorithms Output Space Sampling • Depth-first or Breadth first walk on the subgraph space • Rightmost Extension • Complete algorithm • Random walk on the subgraph space • Arbitrary Extension • Sampling algorithm • Quality: Sampling quality guaranty • Scalability:Visits only a small part of the search space • Non-Redundant: finds verydissimilar patterns by virtue of randomness • Genericity: In terms of pattern type and sampling objective

Future Works and Discussion • Important to choose proposal distribution wisely to get better sampling • For large graph, support counting is still a bottleneck • How to scrap the isomorphism checking entirely • How to effectively parallelize the support counting • How to make the random walk to converge faster • The POG graph generally have smaller spectral gap, as a result the convergence is slow. • This makes the algorithm costly (more steps to find good samples)

Thank you!

Additional Slides

Acceptance Probability Computation Desired Distribution Interestingness value Proposal Distribution

Support Biased Sampling What proposal distribution to choose? We want, Graphs u Support s1 s2 s3 sn α=1, if Nup(u) = ø, α=0, if Ndown(u) = ø link

Example of Support Biased Sampling α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9 s(u) = 2 s(v) = 3 3x1/9 A A 1 P 3 2 X 1/2 D D B

Sampling Convergence

Support Biased Sampling • Scatter plot of Visit count and Support shows positive Correlation Correlation: 0.76

Specific Sampling Examples and Utilization • Uniform Sampling of Frequent Pattern • To explore the frequent patterns • To set a proper value of minimum support • To make an approximate counting • Support Biased Sampling • To find Top-k Pattern in terms of support value • Discriminatory subgraph sampling • Finding subgraphs that are good features for classification

Output Space Sampling for (sub) Graph Patterns