610 likes | 731 Views
This course, led by Instructor Lei Zou at the Institute of Computer Science and Technology, Peking University, explores advanced algorithms for subgraph search in extensive graph databases. Topics include subgraph isomorphism algorithms (Ullmann, VF2, QuickSI), large-scale graph search techniques (GraphGrep, gIndex), and indexing mechanisms to address scalability challenges. Applications span areas like chemical informatics and bioinformatics. Gain insights into the intricacies of efficient subgraph matching and the theoretical underpinnings of subgraph searching in large datasets.
E N D
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Subgraph Search Over Large Graph Database Instructor: Lei Zou
Outline • Subgraph Isomorphism Algorihtm Ullmann Algorithm; VF2 Algorithm QuickSI • Subgraph Search Over a large collection of graphs GraphGrep, gIndex, Closure-Tree, Gcode • Subgraph Search Over a Single Large Graph
Problem Definition Given a graph database and a query graph, discover all graphs containing this query graph. Sample database query graph (a) (b) (c) Query graph
Applications • Chemical Informatics (chemical compound) • Bioinformatics (protein structure, pathway) • Workflow • XML • … Graph Database Management
Scalability Issue • Sequential scan is not scalable • Disk I/O • Subgraph isomorphism testing • An indexing mechanism is needed • DayLight: Daylight.com (commercial) • GraphGrep: Dennis Shasha etc. PODS'02 • Grace: Srinath Srinivasa etc. ICDE'03
GraphGrep (shasha et al. PODS02) • Fingerprinting: to filter the database • A subgraph matching algorithm
Concept Use small components of the query graph and of the database graphs to filter the database and to do the matching
Graph == Sets of “Paths” 0 3 C B lp = 4 A={(1)} AB={(1, 0), (1,2)} AC ={(1, 3)} ABC={(1,0,3), (1,2,3)} ACB={(1, 3, 0), (1,3,2)} ABCA={(1 ,0 ,3 ,1),(1, 2, 3, 1)} ABCB ={(1 ,2,3 ,0),(1, 0, 3, 2)} B={(0),(2)} BA={(0,1),(2,1)} BC={(0,3), (2, 3)} ….……. B 1 2 A 1 A 1 A 1 A 1 A lp = 2 0 2 B B 3 3 C C C C lp = 3 B B 3 3 0 2 lp = 4 2 B 0 B 1 A 1 A
Fingerprint D 1 0 B 0 3 C B C B 2 3 C 4 B C E B 1 2 A 1 2 3 A B A 4 5 6 Graph g3 Graph g2 Graph g1
Patterns in a Query lp = 4 A*BCA*CB 1 A C 2 B 0 B 3 2 3 C 0 A B B lp= 3 A* BC, CB CA* 1
Filter the Database 0 3 C B 0 B 2 1 A C 4 B B C A 1 2 3 Graph g1 1 Discarded Graph g3 C D 1 B 0 B C 2 3 2 3 E B A B A 4 5 6 Query Discarded Graph g2
1 C B 0 2 3 A B Subgraph Matching 0 3 C B A*BCA* CB 2 1 A B Graph g1 Query ABCA = {(1, 0, 3, 1),(1, 2, 3, 1)} CB = {(3,0),(3,2)} Select the set of paths in g1 matching the patterns of the query ABCACB = {((1, 0, 3, 1),(3, 0)), ((1, 0, 3, 1),(3, 2)),((1, 2, 3, 1),(3, 0)), ((1, 2, 3, 1),(3, 2))} Combine any list from ABCA with any list of CB accordingly ‘*’ and ‘_’ ABCACB ={removed, ((1, 0, 3, 1),(3, 2)),((1, 2, 3, 1),(3, 0)), removed} Remove lists if they contains equal nodes in the positions not involved above
gIndex (Yan et al. @SIGMOD 04) Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks • Index substructures of a query graph to prune graphs that do not contain these substructures
Framework • Two steps in processing graph queries Step 1. Index Construction • Enumerate structuresin the graph database, build an inverted index between structures and graphs Step 2. Query Processing • Enumerate structuresin the query graph • Calculate the candidate graphs containing these structures • Prune the false positive answers by performing subgraph isomorphism test
Cost Analysis Query Response Time Disk I/O time Isomorphism testing time Query indexing time Size of candidate answer set Remark: make |Cq| as small as possible
Path-Based Approach Sample database (a) (b) (c) Paths 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ... Built an inverted index between paths and graphs
Path-Based Approach (cont.) Query graph 0-length: SC={a, b, c}, SN={a, b, c} 1-length: SC-C={a, b, c}, SC-N={a, b, c} 2-length: SC-N-C = {a, b}, … … Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph.
Problems of Path-Based Approach Sample database (a) (b) (c) Query graph Graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we can not prune graph (a) and (b).
Disadvantages of Path-Based Approach • Paths are simple, structural information is lost • There are too many paths We propose • Use structures instead of paths • Use discriminative structures
gIndex: Indexing Graphs by Data Mining • Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database • Prune redundant frequent structures to maintain a small set of discriminative structures • Create an inverted index between discriminative frequent structures and graphs in the database
Frequent Structures Sample database (a) (b) (c) Frequent structures with support 2 (a) (b)
Frequent Structures (cont.) • Efficient frequent graph mining algorithms are available Apriori: • AGM/AcGM: Inokuchi et al (PKDD’00) • FSG, Kuramochi et al (ICDM’01) • Vanetik et al (ICDM’02) Pattern-growth: • MoFa, Borgelt et al (ICDM’02) • gSpan: Yan and Han (ICDM’02) • …
Frequent Structures: Threshold Issue • How to set up the minimum support threshold? • If it is too low, it may generate too many frequent graphs • If it is too high, it may miss important structures • Should we enforce a uniform threshold for the different size of structures? Size-increasing support threshold
Frequent Structures: Threshold Issue • Intuition: large structures with low support will likely be indexed well by their substructures that have the similar support • Size-increasing support threshold • The support threshold increases when the indexed structures become larger
Frequent Structures: Volume Issue • The number of frequent structures may exceed the number of graphs in the database when the support is low • 1,000 graphs may generate 1,000,000 frequent structures • It is time and memory expensive to compute and index all frequent structures discriminative structures
Redundant Structures Sample database • All graphs contain structures: C, C-C, C-C-C • Why bother indexing these redundant frequent structures? • Remove these redundant structures • Only index structures that provide more information than existing structures (a) (b) (c)
Discriminative Structures • Pinpoint the most useful frequent structures • Given a set of sturctures and a new structure , we measure the extra indexing power provided by , When is small enough, is a discriminative structure and should be included in the index • Index discriminative frequent structures only • Reduce the index size by an order of magnitude • Achieve good performance
GIndex - Construction • First generates all frequent fragments while taking out redundant ones • Translates fragments into sequences and holds them in a prefix tree • Each fragment has an id list: the ids of the graphs containing the fragment • Graph Sequentialization (DFS Code) • Labeled edge is a 5-tuple (I,j,li, l(I,j),lj) • Described in another paper
GIndex - Construction • gIndex Tree • each fragment can be mapped to an edge sequence (DFS code), insert the edge sequences of discriminative fragments in a prefix tree called the gIndex Tree
GIndex - Search • Optimization • Apriori Pruning • If a fragment is not in the gIndex tree, we need not check its super-graphs
GIndex - Search • Verification • After getting the candidate answer set, we have to verify that the graphs in the set really contain the query graph • perform a subgraph isomorphism test on each graph one by one
Graph Query Processing • Chemical Compounds (a) caffeine (b) diurobromine (c) viagra • Query Graph
Precise vs. Approximate Search in Graphs • Given a graph database and a query graph Q, • Find graphs containing Q exactly • (Precise Matching, gIndex, SIGMOD’04) • Find graphs containing Q approximately (Approximate Matching, Grafil)
Evaluating Graph Similarity 1. Maximal Common Subgraph (MCS): Given two graphs Q and G, assume that S is subgraph isomorphism to both Q and G. S is called a common subgraph of Q and G. The MCS between Q and G is the common subgraph with the largest number of edges (|E(S)|).
Evaluating Graph Similarity MCS A E B C A B F C Q G
Evaluating Graph Similarity 2. Minimal Graph Edit Distance The minimal edit distance between Q and G is the minimal number of edit operations (insertion, deletion, or relabeling ) in the optimal alignments that make Q reach G.
Evaluating Graph Similarity 2. Minimal Graph Edit Distance A E B C B C F A Q G
Solution (I) • Compute the similarity between the graphs in the database and the query graph directly (costly) • sequential scan • subgraph similarity computation
Solution (II) • Form a set of subgraph queries from the original query graph and use the exact subgraph search (costly) • If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs.
Scalability Issue • Sequential scan is not scalable • Disk I/O • Approximate subgraph isomorphism testing • It takes minutes to finish a graph query • A strategy of indexing and searching is needed Prune candidates as many as possible
Index Needed ! • Precise Search • Use frequent patterns as indexing features • Select features in the database space based on their selectivity • Build the index • Approximate Search • Hard to build indices covering similar subgraphs – explosive number of subgraphs in databases • Idea: (1) keep the index structure (2) select features in the query space
Substructure Similarity Measure • Structure-based similarity measure • The largest overlapping part of two graphs • Relaxation: the number of edges that can be relabeled or deleted (relaxation of the query graph) G Q
Structural Features Graph Database (a) (b) (c) Structural Features (small fragments) • atom • path • bond • subgraph
Substructure Similarity Measure • Feature-based similarity measure • Each graph is represented as a feature vector X = {x1, x2, …, xn} • The similarity is defined by the distance of their corresponding vectors • Easy to index • Very fast • Rough measure
Substructure Similarity Search • Structure-based similarity • Accurate measure • Slow Can we transform structure-based to feature-based? • Feature-based similarity • Rough measure • Fast
Intuition Graph (G1) • If graph G contains the major part of a query graph Q, G should share a number of common features with Q Query (Q) Graph (G2) • Given a relaxation ratio, calculate the maximal number of features that can be missed ! Substructure At least one of them should be contained
Feature-Graph Matrix • An occurrence table between feature and graph Assume a query graph has 4 features and only 1 feature to miss due to the relaxation threshold