1 / 41

Output Space Sampling for (sub) Graph Patterns

Output Space Sampling for (sub) Graph Patterns. Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Graph Mining : Given a Database of graphs and a support threshold, find all frequent subgraphs. Output Space : The set of frequent subgraphs.

fola
Download Presentation

Output Space Sampling for (sub) Graph Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Output Space Sampling for (sub) Graph Patterns Mohammad Hasan, Mohammed Zaki RPI, Troy, NY

  2. Graph Mining: Given a Database of graphs and a support threshold, find all frequent subgraphs Output Space: The set of frequent subgraphs Sampling: Returning a random subgraph from the output space without complete enumeration

  3. Motivation

  4. Consider the following problem from Medical Informatics Tissue Images Cell Graphs DiscriminatorySubgraphs Classifier Damaged Healthy Diseased

  5. Mining Task • Dataset • 30 graphs • Average vertex count: 2154 • Average edge count: 36945 • Support • 40% • Result • No Result (used gSpan, Gaston) in a week of running on 2 GHz dual-core PC with 4 GB running Linux

  6. Limitations of Existing Subgraph Mining Algorithms • Work only for small graphs • The most popular datasets in graph mining are chemical graphs • Chemical graphs are mostly tree • In DTP dataset (most popular dataset) average vertex count is 43 and average edge count is 45 • Perform a complete enumeration • For large input graph, output set is neither enumerable nor usable • They follow a fixed enumeration order • Partial run does not efficiently generate the interesting subgraphs avoid complete enumeration to sample a set of interesting subgraphs from the output set

  7. Why sampling a solution? • Observation 1: • Mining is only exploratory step, mined patterns are generally used in subsequent KD task • Not all frequent patterns are equally important for the desired task at hand • Large output set leads to information overload problem complete enumeration is generally unnecessary • Observation 2: • Traditional mining algorithms explore the output space with a fixed enumeration order • Good for generating non-duplicate candidate patterns • But, subsequent patterns in that order are very similar Sampling can change enumeration order to sample interesting and non-redundant subgraphs with a higher chance

  8. Problem Definition

  9. Output Space • Traditional frequent subgraphs for a given support threshold • Can also augment with other constraint • To find good patterns for the desired KD task Output Space for FPM with support = 2 Input Space

  10. Sampling from Output Space • Return a random pattern from the output set • Random pattern is obtained by sampling from a desired distribution • Define an interestingness function, f : FR+; f(p) returns the score of pattern p • The desired sampling distribution is proportional to the interestingness score • If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution • Efficiency consideration • Enumerate as few auxiliary patterns as possible

  11. How to choose f? • Depends on application needs • For exploratory data analysis (EDA), every frequent pattern can have a uniform score • For Top-K pattern mining, support values can be used as scores, which is support biased sampling. • For subgraph summarization task, only maximal graph patterns has uniform non-zero score • For graph classification, discriminatory subgraphs should have high scores

  12. Algorithm

  13. Challenges g2 g1 g3 • The output space can not be instantiate • Complete statistics about the output space is not known. • Target distribution is not known entirely g5 g4 Output Space of Graph Mining We want, Graphs Scores s1 s2 s3 sn

  14. MCMC Sampling POG as transition graph • Solution Approach (MCMC Sampling) • Perform random walk in the output space • Represent the output space as a transition graph to allow local transitions • Edges of transition graph are chosen based on structural similarity • Make sure that the random walk is ergodic In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge

  15. Algorithm • Define the transition graph (for instance, POG) • Define interestingness function that select desired sampling distribution • Perform random walk on the transition graph • Compute the neighborhood locally • Compute Transition probability • Utilize the interestingness score • makes the method generic • Return the currently visiting pattern after k iterations.

  16. Local Computation of Output Space g1 Super Patterns g3 g2 Pattern that are not part of the output space is discarded during local neighborhood computation Sub Patterns g5 g4 g0 =1 Σ g1 g2 g3 g4 g5 u

  17. Compute P to achieve Target Distribution • If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have, • Main task is to choose P, so that the desired stationary distribution is achieved • In fact, we compute only one row of P (local computation) We want, Graphs Scores s1 s2 s3 sn

  18. Use Metropolis-Hastings (MH) Algorithm 3 Select • Fix an arbitrary proposal distribution beforehand (q) • Find a neighbor j (to move to) by using the above distribution • Compute acceptance probability and accept the move with this probability • If accept move to j; otherwise, go to step 2 1 2 3 0 4 5

  19. Uniform Sampling of Frequent Patterns • Target Distribution • 1/n, 1/n, . . . , 1/n • How to achieve it? • Use uniform proposal distribution • Acceptance probability is: • dx: Degree of a vertex x

  20. Uniform Sampling, Transition Probability Matrix 1 A A P 4 D D B

  21. Discriminatory Subgraph Sampling • Database graphs are labeled • Subgraphs may be used as • Feature for supervised classification • Graph Kernel Subgraph Mining Embedding Counts Or Binary

  22. Sampling in Proportion to Discriminatory Score (f) • Interestingness score (feature quality) • Entropy • Delta score = abs (positive support – negative support) P • Direct Mining is difficult • Score values (entropy, delta score) are neither monotone nor anti-monotone C Score(P) <=> Score(C) ?

  23. Discriminatory Subgraph Sampling • Use Metropis-Hastings Algorithm • Choose neighbor uniformly as proposal distribution • Compute acceptance probability from the delta score Delta Score of j and i Ratio of degree of i and j

  24. Results

  25. Datasets

  26. Result Evaluation Metrics • Sampling Quality • Our sampling distribution vs target sampling distribution • Median and standard deviation of visit count • How the sampling converges (convergence rate) • Variation Distance: • Scalability Test • Experiments on large datasets • Quality of Sampled Patterns

  27. Uniform Sampling Results • Experiment Setup • Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution • For a dataset with n frequent patterns, we perform 200*n iterations Result on DTP Chemical Dataset

  28. Sampling Quality • Depends on the choice of proposal distribution • If the vertices of POG have similar degree values, sampling is good • Earlier dataset have patterns with widely varying degree values • [ • For clique dataset, sampling quality is almost perfect Result on Chess (Itemset) Dataset (100*n iterations)

  29. Discriminatory sampling results (Mutagenicity dataset) Distribution of Delta Score among all frequent Patterns Relation between sampling rate and Delta Score

  30. Discriminatory sampling results (cont)

  31. Discriminatory sampling results (cell Graphs) • Total graphs 30, min-sup = 6 • No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine)

  32. SummaryExisting Algorithms Output Space Sampling • Depth-first or Breadth first walk on the subgraph space • Rightmost Extension • Complete algorithm • Random walk on the subgraph space • Arbitrary Extension • Sampling algorithm • Quality: Sampling quality guaranty • Scalability:Visits only a small part of the search space • Non-Redundant: finds verydissimilar patterns by virtue of randomness • Genericity: In terms of pattern type and sampling objective

  33. Future Works and Discussion • Important to choose proposal distribution wisely to get better sampling • For large graph, support counting is still a bottleneck • How to scrap the isomorphism checking entirely • How to effectively parallelize the support counting • How to make the random walk to converge faster • The POG graph generally have smaller spectral gap, as a result the convergence is slow. • This makes the algorithm costly (more steps to find good samples)

  34. Thank you!

  35. Additional Slides

  36. Acceptance Probability Computation Desired Distribution Interestingness value Proposal Distribution

  37. Support Biased Sampling What proposal distribution to choose? We want, Graphs u Support s1 s2 s3 sn α=1, if Nup(u) = ø, α=0, if Ndown(u) = ø link

  38. Example of Support Biased Sampling α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9 s(u) = 2 s(v) = 3 3x1/9 A A 1 P 3 2 X 1/2 D D B

  39. Sampling Convergence

  40. Support Biased Sampling • Scatter plot of Visit count and Support shows positive Correlation Correlation: 0.76

  41. Specific Sampling Examples and Utilization • Uniform Sampling of Frequent Pattern • To explore the frequent patterns • To set a proper value of minimum support • To make an approximate counting • Support Biased Sampling • To find Top-k Pattern in terms of support value • Discriminatory subgraph sampling • Finding subgraphs that are good features for classification

More Related