1 / 45

CSCE822 Data Mining and Warehousing

CSCE822 Data Mining and Warehousing. Lecture 16 Graph Mining MW 4:00PM-5:15PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce822. University of South Carolina Department of Computer Science and Engineering. Roadmap. What, Why Graph Mining? Methods for Mining Frequent Subgraphs

clem
Download Presentation

CSCE822 Data Mining and Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE822 Data Mining and Warehousing Lecture 16 Graph Mining MW 4:00PM-5:15PM Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csce822 University of South Carolina Department of Computer Science and Engineering

  2. Roadmap • What, Why Graph Mining? • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Classification and Clustering • Graph Indexing • Similarity Search • Summary Mining and Searching Graphs in Graph Databases

  3. Graph, Graph, Everywhere Relationships Interactions connections from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Mining and Searching Graphs in Graph Databases Co-author network Internet

  4. Why Graph Mining? • Graphs are ubiquitous • Chemical compounds (Cheminformatics) • Protein structures, biological pathways/networks (Bioinformactics) • Program control flow, traffic flow, and workflow analysis • XML databases, Web, and social network analysis • Graph is a general model • Trees, lattices, sequences, and items are degenerated graphs • Diversity of graphs • Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) • Complexity of algorithms: many problems are of high complexity Mining and Searching Graphs in Graph Databases

  5. Graph Pattern Mining • Frequent subgraphs • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Applications of graph pattern mining • Mining biochemical structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classification, clustering, compression, comparison, and correlation analysis Mining and Searching Graphs in Graph Databases

  6. Example: Frequent Subgraphs GRAPH DATASET (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2) Mining and Searching Graphs in Graph Databases

  7. Graph Mining Algorithms • Incomplete beam search – Greedy (Subdue 1994) • Inductive logic programming (WARMR 1998) • Graph theory-based approaches • Apriori-based approach • Pattern-growth approach Mining and Searching Graphs in Graph Databases

  8. Graph Definitions

  9. SUBDUE (Holder et al. KDD’94) • Start with single vertices • Expand best substructures with a new edge • Limit the number of best substructures • Substructures are evaluated based on their ability to compress input graphs • Using minimum description length (DL) • Best substructure S in graph G minimizes: DL(S) + DL(G\S) • Terminate until no new substructure is discovered Mining and Searching Graphs in Graph Databases

  10. Frequent Subgraph Mining Approaches • Apriori-based approach • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) • Pattern growth approach • MoFa, Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) • Gaston: Nijssen and Kok (KDD’04) Mining and Searching Graphs in Graph Databases

  11. Apriori-like Algorithm • Find frequent 1-subgraphs • Repeat • Candidate generation • Use frequent (k-1)-subgraphs to generate candidate k-subgraph • Candidate pruning • Prune candidate subgraphs that contain infrequent (k-1)-subgraphs • Support counting • Count the support of each remaining candidate • Eliminate candidate k-subgraphs that are infrequent In practice, it is not as easy. There are many other issues

  12. Apriori-Based Approach (k+1)-edge k-edge G1 Pruning G G1 SupportCounting G2 G’ G2 G1 Elimination … G1 G2 G’’ Gn JOIN Mining and Searching Graphs in Graph Databases

  13. Candidate Generation • In Apriori: • Merging two frequent k-itemsets will produce a candidate (k+1)-itemset • In frequent subgraph mining (vertex/edge growing) • Merging two frequent k-subgraphs may produce more than one candidate (k+1)-subgraph

  14. Vertex Growing

  15. Edge Growing

  16. FFSM (Huan, et al. ICDM’03) • Represent graphs using canonical adjacency matrix (CAM) • Join two CAMs or extend a CAM to generate a new graph • Store the embeddings of CAMs • All of the embeddings of a pattern in the database • Can derive the embeddings of newly generated CAMs Mining and Searching Graphs in Graph Databases

  17. Example: Dataset

  18. Example

  19. Graph Isomorphism • A graph is isomorphic if it is topologically equivalent to another graph

  20. Graph Isomorphism

  21. Graph Isomorphism • Test for graph isomorphism is needed: • During candidate generation step, to determine whether a candidate has been generated • During candidate pruning step, to check whether its (k-1)-subgraphs are frequent • During candidate counting, to check whether a candidate is contained within another graph

  22. Graph Isomorphism • Use canonical labeling to handle isomorphism • Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding • Example: • Lexicographically largest adjacency matrix • Find the permutations of the vertices so that the adjacency matrix is lexicographically maximized when read off from left to right, one row at a time.

  23. Graph Isomorphism • Example: • Lexicographically largest adjacency matrix Canonical: 0111101011001000 String: 0010001111010110

  24. Lexicographically largest adjacency matrix • Find the permutations of the vertices so that the adjacency matrix is lexicographically maximized when read off from left to right, one row at a time.

  25. Graph Pattern Explosion Problem • If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • An n-edge frequent graph may have 2n subgraphs • Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% Mining and Searching Graphs in Graph Databases

  26. Closed Frequent Graphs • Motivation: Handling graph pattern explosion problem • Closed frequent graph • A frequent graph G is closed if there exists no supergraph of G that carries the same support as G • If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) • Lossless compression: still ensures that the mining result is complete Mining and Searching Graphs in Graph Databases

  27. CLOSEGRAPH(Yan & Han, KDD’03) A Pattern-Growth Approach (k+1)-edge At what condition, can we stopsearching their children i.e., early termination? G1 k-edge G2 G If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. … Gn Mining and Searching Graphs in Graph Databases

  28. Experimental Result • The AIDS antiviral screen compound dataset from NCI/NIH • The dataset contains 43,905 chemical compounds • Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI Mining and Searching Graphs in Graph Databases

  29. Discovered Patterns 20% 10% 5% Mining and Searching Graphs in Graph Databases

  30. Graph Mining • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Graph Indexing • Similarity Search • Classification and Clustering • Summary Mining and Searching Graphs in Graph Databases

  31. Constrained Patterns • Degree • Density • Diameter • Connectivity • Min, Max, Avg Mining and Searching Graphs in Graph Databases

  32. Pattern-Growth Approach • Find a small frequent candidate graph • Remove vertices (shadow graph) whose degree is less than the connectivity • Decompose it to extract the subgraphs satisfying the connectivity constraint • Stop decomposing when the subgraph has been checked before • Extend this candidate graph by adding new vertices and edges • Repeat Mining and Searching Graphs in Graph Databases

  33. Graph Mining • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Classification and Clustering • Graph Indexing • Similarity Search • Summary Mining and Searching Graphs in Graph Databases

  34. Graph Clustering • Graph similarity measure • Feature-based similarity measure • Each graph is represented as a feature vector • The similarity is defined by the distance of their corresponding vectors • Frequent subgraphs can be used as features • Structure-based similarity measure • Maximal common subgraph • Graph edit distance: insertion, deletion, and relabel • Graph alignment distance Mining and Searching Graphs in Graph Databases

  35. Graph Classification • Local structure based approach • Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length • Graph pattern-based approach • Subgraph patterns from domain knowledge • Subgraph patterns from data mining • Kernel-based approach • Random walk (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) • Optimal local assignment (Fröhlich et al. ICML’05) Mining and Searching Graphs in Graph Databases

  36. Graph Pattern-Based Classification • Subgraph patterns from domain knowledge • Molecular descriptors • Subgraph patterns from data mining • General idea • Each graph is represented as a feature vector x = {x1, x2, …, xn}, where xi is the frequency of the i-th pattern in that graph • Each vector is associated with a class label • Classify these vectors in a vector space Mining and Searching Graphs in Graph Databases

  37. Graph Mining • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Classification and Clustering • Graph Indexing • Similarity Search • Summary Mining and Searching Graphs in Graph Databases

  38. query graph graph database Graph Search • Querying graph databases: • Given a graph database and a query graph, find all the graphs containing this query graph Mining and Searching Graphs in Graph Databases

  39. Scalability Issue • Sequential scan • Disk I/Os • Subgraph isomorphism testing • An indexing mechanism is needed • DayLight: Daylight.com (commercial) • GraphGrep: Dennis Shasha, et al. PODS'02 • Grace: Srinath Srinivasa, et al. ICDE'03 Mining and Searching Graphs in Graph Databases

  40. Indexing Strategy Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks • Index substructures of a query graph to prune graphs that do not contain these substructures Mining and Searching Graphs in Graph Databases

  41. Indexing Framework • Two steps in processing graph queries Step 1. Index Construction • Enumerate structuresin the graph database, build an inverted index between structures and graphs Step 2. Query Processing • Enumerate structuresin the query graph • Calculate the candidate graphs containing these structures • Prune the false positive answers by performing subgraph isomorphism test Mining and Searching Graphs in Graph Databases

  42. Graph Mining • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Classification and Clustering • Graph Indexing • Similarity Search • Summary Mining and Searching Graphs in Graph Databases

  43. Structure Similarity Search • CHEMICAL COMPOUNDS (a) caffeine (b) diurobromine (c) viagra • QUERY GRAPH Mining and Searching Graphs in Graph Databases

  44. Feature-Graph Matrix graphs in database features Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold Mining and Searching Graphs in Graph Databases

  45. Summary: Graph Mining • Graph mining has wide applications • Frequent and closed subgraph mining methods • gSpan and CloseGraph: pattern-growth depth-first search approach • Graph indexing techniques • Frequent and discriminative subgraphs are high-quality indexing features • Similarity search in graph databases • Indexing and feature-based matching • Further development and application exploration Mining and Searching Graphs in Graph Databases

More Related