1 / 42

Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo

Seminar 2009. Frequent Subgraph/ Substructure Mining. Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo. Outline . Introduction Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary. Graphs are everywhere.

aulii
Download Presentation

Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seminar 2009 Frequent Subgraph/ Substructure Mining Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo

  2. Outline • Introduction • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary

  3. Graphs are everywhere

  4. Graph Mining Problems • Graph Pattern Mining • Frequent subgraph pattern mining • Pattern summarization • Optimal graph patterns • Graph patterns with constraints • Approximate graph patterns …. • Graph Classification • Graph clustering • Important node identification • Bridge and hub identification • Other Important Topics • Graph compression • Graph model • Social network analysis.

  5. Subgraph pattern Mining • Frequent subgraph • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Application of subgraph pattern mining • Mining biochemical structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.

  6. Frequent Subgraph Example C A A C B A B B A B C A A C A Support 1 3 3 subgraph (1) (2) (3)

  7. Key Challenges in Subgraph Mining • Graph isomorphism • to detect if two graphs are identical in structure • Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph. • Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices. • Subgraph candidate generation • generate candidate frequent subgraphs from datasets

  8. Subgraph Mining Approaches • Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001 • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) • FTOSM: Horvath et al. (KDD’06) • Pattern growth based • Subdue: Holder et al. (KDD’94) • MoFa: Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721 • Gaston: Nijssen and Kok (KDD’04) • CMTreeMiner: Chi et al. (TKDE’05) • LEAP: Yan et al. (SIGMOD’08)

  9. Outline • Introduction and Background • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary

  10. Apriori-based Approach • FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001M.Kuramochi and G. Karypis. • Flattened Representation as Canonical Labeling • Apriori-based method to generate subgraph candidate

  11. Graph Representation in FSG • Flattened Representation

  12. Graph Representation in FSG • Flatterned Representation Lexicographic order or dictionary order

  13. Apriori-based method • Apriori Property • If a graph is frequent, all of its subgraphs are frequent. • Candidate Generation • Create a set of candidate size k+1 -from given two frequent k-subgraphs -containing the same (k-1)-subgraph -Result in several candidates size k+1

  14. Apriori-based method • Graph candidate generated Example

  15. Apriori-based method • FlowChart

  16. Apriori-based method • Experiment Result -Chemical Compound Dataset, which contains 340 compounds,24 different atoms (vertices)

  17. Outline • Introduction • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary

  18. Motivation of gSpan • Weakness of Apriori-based approach • The generation of size (k+1) subgraph candidates from size k frequent subgraph too complicated and complex. • Pruning false positive : subgraph isomorphism is an NP complete problem which is costly. • gSpan: Graph-Based Substructure Pattern Mining • Change the way to represent a graph (DFS: Depth First Search) • Using pattern growth to generate new subgraph candidate.

  19. gSpan: Graph-Based Substructure Pattern Mining • DFS (Depth First Search) Code • First Step: DFS the graph and use edges on the path to represent the graph. • Second Step: DFS Lexicographic Order • Pattern Growth subgraph generation

  20. DFS code An edge is presented by 5 tuples.

  21. DFS code • Second Step: DFS Lexicographic Order

  22. Pattern Growth Approach • Pattern Growth (free extension)

  23. Pattern Growth Approach • Duplicate Graphs

  24. Pattern Growth Approach • Free extension

  25. Pattern Growth Approach • Right most extension

  26. Pattern Growth Approach • Exmaples (cont.)

  27. gSpan

  28. gSpan

  29. Pattern Growth Approach • 340 molecules 66 atom types and 4 bond types as labels • On average only 27 vertices with 28 edges • Experimental result using Chemical data

  30. Summary • Graph representation Flattern representation vs. DFS code • Generation of Candidate Patterns apriori vs. pattern growth

  31. Pattern-Growth Approach

  32. Frequent Graph Pattern Given a graph dataset D, find subgraph g, s.t. Where is the percentage of graphs in D that contain g. Problem 1 : Exponential Pattern Set Problem 2 : Threshold Setting

  33. Difference between frequent itemset and frequent subgraph discovery

  34. Frequent itemset discovery

  35. subgraph Mining Algorithms • Apriori-based approach – AGM/AcGM: Inokuchi, et al. (PKDD’00) – FSG: Kuramochi and Karypis (ICDM’01) – PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) – FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) – FTOSM: Horvath et al. (KDD’06) • Pattern growth approach – Subdue: Holder et al. (KDD’94) – MoFa: Borgelt and Berthold (ICDM’02) – gSpan: Yan and Han (ICDM’02) – Gaston: Nijssen and Kok (KDD’04) – CMTreeMiner: Chi et al. (TKDE’05) – LEAP: Yan et al. (SIGMOD’08)

  36. Framework of subraph Mining Algorithms • Search Order breadth vs. depth complete vs. incomplete • Generation of Candidate Patterns apriori vs. pattern growth • Discovery Order of Patterns DFS order path tree graph • Elimination of Duplicate Subgraphs passive vs. active • Support Calculation embedding store or not

  37. Frequent Subgraph Examples:

  38. Example (cont.)

  39. Subgraph Mining Approaches Apriori-based approach • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001 • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) • FTOSM: Horvath et al. (KDD’06) Pattern growth approach • Subdue: Holder et al. (KDD’94) • MoFa: Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721 • Gaston: Nijssen and Kok (KDD’04) • CMTreeMiner: Chi et al. (TKDE’05) • LEAP: Yan et al. (SIGMOD’08)

  40. Outline • Introduction and Background • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary DFS code Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

  41. Pattern Growth Approach

More Related