iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques VLDB ‘10 Jeffrey Xu Yu et. al. Presented by Tao Yu

Why I choose this paper iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques • Disk-based • Implementation technique • Graph database • Application • Dataset VLDB ‘10 Jeffrey Xu Yu et. al. Presented by Tao Yu

Why I choose this paper • Disk-based • Implementation technique • Graph database • Application • Dataset

Why they write this paper Why I choose this paper • Provide a uniform test framework. • Binary executable wall clock time comparison is not fair. • Some algorithms are in-memory implemented while others are on-disk implemented. • Obtain real disk I/Os by bypassing OS disk cache. • Perform a large number of tests. • Disk-based • Implementation technique • Graph database • Application • Dataset

Why they write this paper • Provide a uniform test framework. • Binary executable wall clock time comparison is not fair. • Some algorithms are in-memory implemented while others are on-disk implemented. • Obtain real disk I/Os by bypassing OS disk cache. • Perform a large number of tests.

Background Why they write this paper • Application • Graph isomorphism • Stream • A large number of small graphs • Undirected labeled graph • G1 = (V;E;Lv;L1e) • Provide a uniform test framework. • Binary executable wall clock time comparison is not fair. • Some algorithms are in-memory implemented while others are on-disk implemented. • Obtain real disk I/Os by bypassing OS disk cache. • Perform a large number of tests.

Background • Application • Graph isomorphism • Stream • A large number of small graphs • Undirected labeled graph • G1 = (V;E;Lv;L1e)

Related work Background • Mining based approaches • Non-mining based approaches • Application • Graph isomorphism • Stream • A large number of small graphs • Undirected labeled graph • G1 = (V;E;Lv;L1e) • Size: #Edges

Related work • Mining based approaches • Non-mining based approaches • Size: #Edges

FG-Index gIndex Related work • Indexing • All frequent subgraphs (maxL) • A subset of infrequent subgraphs (maxL) • Discrimitive features • Query • Enumerate all subgraphs (maxL) • Mining based approaches • Non-mining based approaches • Indexing • All frequent subgraphs • All infrequent edges • Query • Enumerate a subset of subgraphs • Verification-free strategy • Size: #Edges

FG-Index gIndex • Indexing • All frequent subgraphs (maxL) • A subset of infrequent subgraphs (maxL) • Discrimitive features • Query • Enumerate all subgraphs (maxL) • Indexing • All frequent subgraphs • All infrequent edges • Query • Enumerate a subset of subgraphs • Verification-free strategy

SwiftIndex SwiftIndex Tree+Δ FG-Index gIndex Tree+Δ • Indexing • All frequent trees size up to maxL – 1 • All infrequent edges • Generates graph features on the fly • Query • Enumerate all subtrees (maxL) • Indexing • All frequent subgraphs (maxL) • A subset of infrequent subgraphs (maxL) • Discrimitive features • Query • Enumerate all subgraphs (maxL) • Indexing • All frequent trees size up to maxL • All discriminative trees size up to maxL • All infrequent edges • Query • PrefixQuickSI • Indexing • All frequent subgraphs • All infrequent edges • Query • Enumerate a subset of subgraphs • Verification-free strategy • Indexing • All frequent trees size up to maxL – 1 • All infrequent edges • Generates graph features on the fly • Query • Enumerate all subtrees (maxL) • Indexing • All frequent trees size up to maxL • All discriminative trees size up to maxL • All infrequent edges • Query • PrefixQuickSI

SwiftIndex Tree+Δ • Indexing • All frequent trees size up to maxL – 1 • All infrequent edges • Generates graph features on the fly • Query • Enumerate all subtrees (maxL) • Indexing • All frequent trees size up to maxL • All discriminative trees size up to maxL • All infrequent edges • Query • PrefixQuickSI

C-Tree SwiftIndex Tree+Δ GraphGrep • Indexing • All frequent trees size up to maxL – 1 • All infrequent edges • Generates graph features on the fly • Query • Enumerate all subtrees (maxL) • Indexing • All frequent trees size up to maxL • All discriminative trees size up to maxL • All infrequent edges • Query • PrefixQuickSI • Indexing • All paths (maxL) • Query • Enumerate all paths (maxL) • Indexing • A hierarchical tree of graph closure • Query • Pseudo subgraph isomorphism test

C-Tree GraphGrep • Indexing • All paths (maxL) • Query • Enumerate all paths (maxL) • Indexing • A hierarchical tree of graph closure • Query • Pseudo subgraph isomorphism test

Isomorphism Algorithms C-Tree GraphGrep gCode • Indexing • All paths (maxL) • Query • Enumerate all paths (maxL) • Indexing • A hierarchical tree of graph closure • Query • Pseudo subgraph isomorphism test • Indexing • Vertex signature from neighbors • Graph signature from vertex • GCode-Tree • <signature, count> • Query • Index level (graph signature) • Object level (vertex signature) • VF2 • QuickSI

Isomorphism Algorithms gCode • Indexing • Vertex signature from neighbors • Graph signature from vertex • GCode-Tree • <signature, count> • Query • Index level (graph signature) • Object level (vertex signature) • VF2 • QuickSI

Implementation Isomorphism Algorithms gCode • Indexing • Vertex signature from neighbors • Graph signature from vertex • GCode-Tree • <signature, count> • Query • Index level (graph signature) • Object level (vertex signature) • Graph • A list of vertices and a list of edges • If a graph is less than the page size • Store it as a tuple in a heap page • Else • Store it as a BLOB • B+-tree for all graphs by graph ID • Other techniques • CAM code to encode feature • Djb2 hash function • Mini-page • VF2 • QuickSI

Implementation • Graph • A list of vertices and a list of edges • If a graph is less than the page size • Store it as a tuple in a heap page • Else • Store it as a BLOB • B+-tree for all graphs by graph ID • Other techniques • CAM code to encode feature • Djb2 hash function • Mini-page

Dataset Implementation • Small sparse • AIDS: 10000 graphs • 25.42 vertices and 27.40 edges • 51 vertex lables and 4 edge labels • Small dense • GraphGen: 10000 graphs • 7 vertices and 30 edges • 20 vertex lables and 20 edge labels • Large • PubChem: 1000000 graphs • 23.98 vertices and 25.76 edges • 81 vertex lables and 3 edge labels • Graph • A list of vertices and a list of edges • If a graph is less than the page size • Store it as a tuple in a heap page • Else • Store it as a BLOB • B+-tree for all graphs by graph ID • Other techniques • CAM code to encode feature • Djb2 hash function • Mini-page

Dataset • Small sparse • AIDS: 10000 graphs • 25.42 vertices and 27.40 edges • 51 vertex lables and 4 edge labels • Small dense • GraphGen: 10000 graphs • 7 vertices and 30 edges • 20 vertex lables and 20 edge labels • Large • PubChem: 1000000 graphs • 23.98 vertices and 25.76 edges • 81 vertex lables and 3 edge labels

Query sets Dataset • For AIDS: the existing query sets Q4, Q8, · · · , Q24 can be downloaded from [3]. • Each query set Qn contains 1000 graphs where each graph size is n. • For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24. • In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets. • Small sparse • AIDS: 10000 graphs • 25.42 vertices and 27.40 edges • 51 vertex lables and 4 edge labels • Small dense • GraphGen: 10000 graphs • 7 vertices and 30 edges • 20 vertex lables and 20 edge labels • Large • PubChem: 1000000 graphs • 23.98 vertices and 25.76 edges • 81 vertex lables and 3 edge labels

Query sets • For AIDS: the existing query sets Q4, Q8, · · · , Q24 can be downloaded from [3]. • Each query set Qn contains 1000 graphs where each graph size is n. • For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24. • In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets.

Disk schedule Query sets • For AIDS: the existing query sets Q4, Q8, · · · , Q24 can be downloaded from [3]. • Each query set Qn contains 1000 graphs where each graph size is n. • For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24. • In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets. • LRU as buffer replacement algorithm • Page size: 8 K • FILE_FLAG_NO_BUFFERING

Disk schedule • LRU as buffer replacement algorithm • Page size: 8 K • FILE_FLAG_NO_BUFFERING

Experiment • The database construction cost of gIndex is comparable to all feature selectionmethods such as Tree+∆, FG-Index, and SwiftIndex. • “We have communicated with Xifeng Yan who ﬁrst ignored the edge label. He did that simply in order to “make the problem more difﬁcult.” Subsequent work imitated his setting without clear reason.”

Experiment • More features, less candidates. • The gIndex performs the best.

gCode Experiment • Indexing • Vertex signature from neighbors • Graph signature from vertex • GCode-Tree • <signature, count> • Query • Index level (graph signature) • Object level (vertex signature) • For Q4, FG-Index performs the best since it exploits the verification-free strategy. • gCode performs the worst: 1) more candidates 2) lookups over the vertex signature dictionary need more buffering.

Experiment C-Tree • Indexing • A hierarchical tree of graph closure • Query • Pseudo subgraph isomorphism test • As for C-Tree, the number of disk I/Os is slightly reduced compared with a small buffer size, since the database size of C-Tree is still larger than the buffer size, and tree traversal incurs the sequential ﬂooding effect

Experiment gIndex • Indexing • All frequent subgraphs (maxL) • A subset of infrequent subgraphs (maxL) • Discrimitive features • Query • Enumerate all subgraphs (maxL) • gIndex is slightly slower than FG-Index and SwiftIndex due to slow subgraph enumeration from a query. • This fact indicates that the I/O cost must be carefully optimized to obtain good performance.

gCode Experiment gIndex • Indexing • All frequent subgraphs (maxL) • A subset of infrequent subgraphs (maxL) • Discrimitive features • Query • Enumerate all subgraphs (maxL) • Indexing • Vertex signature from neighbors • Graph signature from vertex • GCode-Tree • <signature, count> • Query • Index level (graph signature) • Object level (vertex signature) • Only 37 frequent features. Almost all features in FG-Index, Tree+∆, and SwiftIndex are infrequent features. • gCode use signatures. • gIndex mines all infrequent and discriminative features of size up to 3.

Experiment Tree+Δ • Indexing • All frequent trees size up to maxL – 1 • All infrequent edges • Generates graph features on the fly • Query • Enumerate all subtrees (maxL) • Drastic changes to gCode (I), C-Tree (I), and Tree+∆. • Frequent feature space is small. • Graph features reclaimed at small sizes are used for larger query sizes.

Experiment Tree+Δ • Indexing • All frequent trees size up to maxL – 1 • All infrequent edges • Generates graph features on the fly • Query • Enumerate all subtrees (maxL) • FG-Index does not outperform gIndex even for Q4 since there exist no frequent features of size 4. • Queries in this dense synthetic dataset contain many cycles, and thus, the cost of mining graph features on the ﬂy is very high.

Experiment • The number of index features used by FG-Index or SwiftIndex is much smaller than gIndex. • This result indicates that more features in the index simply do not guarantee better performance.

Experiment • The trends of all curves are consistent with those for the number of I/Os. • gIndex shows the best performance in both cold and hot runs for a moderate dense dataset.

Experiment • gCode performs the best for large query sizes with high density • gIndex performs comparatively better for a larger number of labels since its pruning cost is relatively more effective

Results for Large Graph Database • Since both SeqScan and C-Tree require prohibitive times to finish the experiments even with large buffer sizes, we exclude them from a large graph database. • As for gCode, we can run experiments with a 1 GByte buffer and hot run; with smaller buffer sizes than 1 GByte and cold run, we are unable to ﬁnish the experiments within a week.

Results for Large Graph Database • FG-Index’s pruning power is up to 13.09 times lower than gIndex, since FG-Index uses a strategy to select a subset of features in its index to minimize the filtering cost.

Results for Large Graph Database • For Q4, FG-Index performs the best due to its veriﬁcation-free strategy • For Q8 ∼ Q12, gIndex performs the best since its pruning power is the best • For Q16 ∼ Q24, either SwiftIndex or FG-Index performs the best since their posting list intersection costs are the least.

Results for Large Graph Database • Although gIndex performs worse than SwiftIndex and FG-Index in the number of I/Os for large query sizes, it performs the best for all query sizes except Q4 due to a good combination of the lowest number of candidates and low disk I/O costs.

Conclusion • Overall winner: gIndex. • Large query on dense graph, we recommend gCode. • Souce code: http://www.igraph.or.kr/

iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Presentation Transcript

Graph Sparsification by Effective Resistances

Cross-Tabulation Analysis; Making Comparisons; Controlled Comparisons June 2, 2008

Latent Semantic Indexing

Phylogenetic tree of the major lineages (phyla) of Bacteria based on 16S ribosomal RNA sequence comparisons

The Bar Graph or Bar Chart

Constraint-Based Watermarking Techniques for Design IP Protection

Forensics Book 2: Investigating Hard Disk and File and Operating Systems

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Math 160

Molecular Techniques

Introduction to Graph Data Structure Applications Graph Searching Minimum Spanning Trees

Latent Semantic Indexing

A General Framework for Formalizing Object-Oriented Modeling Techniques

Sequence Indexing Schemes

Chapter 12 Mass-Storage Systems

File Systems

CS 245: Database System Principles Notes 4: Indexing

Latent Semantic Indexing

CS232A: Database System Principles Indexing

DBMS Storage and Indexing

1. The Web services framework