Developing 3-in-1 Index Structures on Complex Structure Similarity Search

Developing 3-in-1 Index Structures on Complex Structure Similarity Search Presented by Xiaoli Wang Supervisor: Anthony K. H. Tung 2014/11/2 SOC@NUS 1

SOC@NUS Outline • Motivation • Complex structures • Limitations on existing systems • Our solutions • Inverted index for graph similarity search • Inverted index for sequence similarity search • A 3-in-1 inverted indexing tool and its applications • Conclusion and future work 2014/11/2

SOC@NUS Why complex structures? • Relational database • Data model: table • Real objects • Sequence, tree, and graph Chemical compound Protein structure Program flow <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> Coil Image Shape Fingerprint Letter 2014/11/2

SOC@NUS How to manage complex structures? • Conventional indexing and searching systems • Restricted to a given data model • Hard to adapt to support various complex structures • Waste resource to re-design various storage systems Access method Access method Access method Like query Graph database Sequence database Tree database Storage … Documents Program flow Biology data Image 2014/11/2

SOC@NUS Real application: social reading systems • Book duplicate detection problem 2014/11/2

SOC@NUS Real application: social reading systems • Book duplicate detection problem • Graph solution[1] Adam Schenker. Graph-theretic techniques for web content mining. PhD thesis. University of South Florida. 2003. 2014/11/2

SOC@NUS Real application: social reading systems • Book duplicate detection problem • Book annotation migration problem • Sequence solution[1] Xiaoli Wang, etc. Efficient and Effective KNN Sequence Search with Approximate n-grams. In PVLDB. 2013. 2014/11/2

SOC@NUS Real application: social reading systems • Book duplicate detection problem • Book annotation migration problem • Annotation search by snapping problem • Sequence solution[1] Xiaoli Wang, etc. Efficient and Effective KNN Sequence Search with Approximate n-grams. In PVLDB. 2013. 2014/11/2

SOC@NUS Real application: social reading systems • Book duplicate detection problem • Graph matching problem • Book annotation migration problem • Sequence matching problem • Annotation search by snapping problem • Sequence similarity search problem • Build an information management tool to support the storage and retrieval of sequences and graphs 2014/11/2

SOC@NUS Objectives • Building a 3-in-1 unified indexing system • Design a unified and effective indexing mechanism • Support similarity search on various complex structures • Scope of this thesis • Focus on the similarity search problems • Select edit distance as similarity measure • Tree is considered as a specific graph 2014/11/2

SOC@NUS Objectives • Building a 3-in-1 unified indexing system • Design a unified and effective indexing mechanism • Support storage and retrieval of all complex structures • Step 1: Propose an effective index to support efficient graph similarity search • Step 2: Extend the above index to further support efficient sequence similarity search • Step 3: Design and implement the unified 3-in-1 index to support various complex structure search 2014/11/2

SOC@NUS Graph similarity search problem • Given a graph database, we want to find its most similar graphs based on graph edit distance A query graph Similar graphs to the query a c b a c a c a c a c a c b Graph edit distance: λ(g1, g2) = minimum # of operations (insertion, deletion, substitution) to transform graph g1 into graph g2 …… b d b b e d f A graph database 2014/11/2

SOC@NUS Existing works • A naive solution: No pruning and poor scalability • C-Star (VLDB'09): Too many GED bound computations • K-AT (TKDE'10): Loose bound – too many false positives Unpractical exact graph edit distance computations Expensive GED bound computation Tighter bound Loose GED bound 2014/11/2 GED computation candidates answers full scan 100,000 50 100,000 GED bound computation GED computation candidates answers full scan 100,000 100 50 GED bound computation GED computation candidates answers 70% scan 90,000 50 100,000 index

SOC@NUS Our solution • Build an index to reduce scan ratio • Use tighter GED bounds to reduce candidate size • Inverted index based on the star decomposition • The graph is decomposed into smaller stars • Inverted index is designed to store the reference between graphs and stars • Efficient CA-based search framework is proposed 2014/11/2 GED bound computation GED computation candidates answers 5% scan 100 50 100,000 index

SOC@NUS Star decomposition • Star structure • A star structure is a labelled, single-level, rooted tree which can be represented by a 3-tuple s = (r, L, l), where r is the root vertex, L is the set of leaves and l is a labelling function • Star representation for a graph • A graph can be broken into a multi-set of star structures c b c a a b c b c c c b a c a c a c g1 S(g1) 2014/11/2

SOC@NUS Star decomposition • Star edit distance Given two star structures s1 and s2, λ(s1, s2) = T(r1, r2) + d(L1, L2) Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1 d(L1, L2) = ||L1| − |L2|| + M(L1, L2) M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2| Example: given s1 = abcc, and s2 = dcc, T(r1, r2) = 1, as l(a) ≠ l(d); d(L1, L2) = |3-2| + 3 – 2 = 2; λ(s1, s2) = 1 + 2 = 3. a d b c c c c s1 s2 2014/11/2

SOC@NUS Mapping distance: distance between two star representations a c a c g2 g1 0 a a c c b b b ε b d s0 s0 s0 s1 ε ε s0 1 b b c a d a s2 s1 s2 s4 s5 c c 1 d b a a s0 s1 s3 ε s4 s3 s0 s5 s2 5 d ε c b s4 S(g1) S(g2) s5 The mapping distance: μ(g1, g2) = 0 + 1 + 1 + 5 = 7 2014/11/2

SOC@NUS Observations on star decomposition • The lower bound based on mapping distance is shown to be tight, and can be computed in cubic time • If we use the star structure as the filtering feature in a feature-based inverted index, then the derived tighter bounds can be applied for graph pruning • Graphs have more similar stars can be more similar to each other • If we can have a method to access database graphs in an increasing order of edit distance to a query graph, then a CA-based filtering framework can work well to reduce the scan ratio 2014/11/2

SOC@NUS a query graph q q: s2 q: s1 Graph identity λ(s, s2), where s is a star in S(g4) g1: 0 g4: 0 g2: 1 g1: 1 g3: 2 g2: 3 g6: 5 … . . … g6: 5 g4: 6 Sorted lists g3: 9 CA-based filtering strategy • Sorted lists for a query graph • Lists are sorted increasingly in order of star edit distances • Stars of a database graph are accessed by increasing dissimilarity to stars of the query graph 2014/11/2

SOC@NUS a query graph q q: s1 q: s2 g1: 0 g4: 0 If λ(g6, q) ≥ ω, g6 can be safely filtered out Seen g2: 1 g1: 1 g3: 2 g2: 3 g6: 5 … . . … ω = sum(2, 3) = 5 > d (= 4) g6: 5 Possibly unseen g4: 6 g3: 9 Threshold value CA-based filtering strategy Use mapping distance as the aggregation value When CA halts, all unseen graphs are sure to have mapping distances larger than the current threshold value. They can be safely filtered out, as mapping distance is a lower bound of graph edit distance. 2014/11/2

SOC@NUS Overview of the search framework Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars a c a c a c a c a c b … b d b b e d f A graph database 2014/11/2

SOC@NUS Overview of the search framework a a c c b b s0 s0 Two lists for s0 and s3 a c g1 Star decomposition b s3: cab s0: abc b c a d a s1 s2 b d gid feq gid frep c c d b a a s4 a c s3 g2 1 g2 1 g2 d g1 1 ε c b s5 b ε Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars a c a c a c a c a c b … b d b b e d f A graph database 2014/11/2

SOC@NUS Overview of the search framework Two lists for vertex labels a and b a a c c b b s0 s0 b b a b c a d a s1 s2 sid feq sid frep c c d b a a s4 s2 1 s3 1 s3 s1 1 s0 1 s5 1 d s3 1 ε c b s5 s4 1 Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars a c a c a c a c a c b … b d b b e d f A graph database 2014/11/2

SOC@NUS Overview of the search framework Similar graphs to the query a query graph q q: s2 q: s1 • The graph similarity search • construct graph score-sorted lists using the top-k stars returned from the lower-level • Run CA algorithm for graph pruning • The top-k sub-unit query • Construct a score-sorted list • Return top-k stars using TA method Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars a c a c a c a c a c b … b d b b e d f A graph database 2014/11/2

SOC@NUS Experimental results: query performance Response Time(sec) Candidate Size(K) Edit Distance Threshold(0 to 20) Data Size(5K to 40K) 2014/11/2

Sequence similarity search problem SOC@NUS KNN Query in a sequence database : Given a sequence database, we want to find its k most similar sequences based on sequence edit distance A query sequence Similar sequences A sequence database Sequence edit distance: λ(s1,s2) = minimum # of operations (insertion, deletion, substitution) to transform sequence s1 into sequence s2 2014/11/2

SOC@NUS Existing works • A naive solution: No pruning and poor scalability • Trie-based methods: Too complex time and space cost • Gram-based methods: Too many false positives Poor scalability on long strings and large databases Poor scalability on large databases Loose GED bound 2014/11/2 SED computation candidates answers full scan 100,000 50 100,000 SED bound computation SED computation SED computation candidates candidates answers answers 70% scan 90,000 50 50 100,000 100,000 index index

SOC@NUS Our solution • Extend the proposed indexing mechanism on graph search to support efficient string search • Use tighter SED bounds to reduce candidate size • Inverted index based on the n-gram decomposition • The sequence is decomposed into smaller n-grams • Similar n-grams are used to tightly bound SED • Efficient CA-based filtering strategy is proposed 2014/11/2 SED bound computation SED computation candidates answers 100 50 100,000 index

SOC@NUS The n-gram based decomposition • Approximate n-gram: n-gram with certain edit distance i n t r o d u c t i o n Sliding Window 5-grams Intuition: Similar strings must share enough number of commonn-grams Intuition: Similar strings must share enough number of approximaten-grams 2014/11/2

Count filtering with approximate n-grams Lemma Consider two strings s1 and s2. If s1 and s2 are within an edit distance of τ, then s1 and s2 must share at least ɸ(s1, s2) = max{|s1|, |s2|} - n + 1 - η(τ, t, n) n-grams with gram edit distance ≤ t. η(τ, t, n) = max{1, n-2×t}+(n-t) ×(τ-1) SOC@NUS • η(τ, t, n) is the maximum # of n-grams affected by τ edit operations to have gram edit distance >t. • For example, η(τ, 0, n) = τ×n. This means that τ edit operations will affect at most τ×n n-grams to have gram edit distance >0. 2014/11/2

SOC@NUS Mapping distance: distance between two n-gram representations s=“Emm Wo” q=“EmmaW” 0 Emm Emm ng0 ng0 1 mm_ mma ng2 ng1 ng0 ng1 ng3 ε ng0 1 m_W maW ng2 ng4 ng3 ng4 _Wo ng5 ng5 3 ε G(s) G(q) The mapping distance: μ(s, q) = 0 + 1 + 1 + 3 = 5 2014/11/2

The lower bound on mapping distance Given two strings s1 and s2. The gram mapping distance μ (s1, s2) between s1 and s2 satisfies μ(s1, s2) ≤ (3n-2)×λ(s1, s2) Given an edit distance threshold value of τ. If we use the summation of gram edit distances as the aggregation function in the CA method, then the threshold value can be computed as τ×(3n-2) SOC@NUS 2014/11/2

SOC@NUS CA-based filtering strategy Algorithm halts: The aggregation value: 2+2+2+2+2+1+1+1+1 = 14 The maximum value in the heap is: max { λs| s∈ top-1} = 1 The CA halts: 14 > 1×(3n-2) = 13 For any unseen sequence, its mapping distance is larger than the aggregation value. This means its edit distance is larger than the maximum edit distance value in the top-k heap. It can be safely filtered out. Top-1 heap 20 GED=0 GED=1 GED=2 2014/11/2

SOC@NUS Overview of the search framework Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2014/11/2

SOC@NUS Overview of the search framework 5-grams Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2014/11/2

SOC@NUS Overview of the search framework 2-grams 5-grams Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2014/11/2

SOC@NUS Overview of the search framework Similar sequences to the query a query sequence q q: ng2 q: ng1 • The sequence KNN search • Construct score-sorted lists using the n1-grams from the lower-level • Accumulate the count of similar n1-grams, and run CA algorithm to check the halting condition • The n1-gram similarity search • Update value of t • Return n1-grams with distance ≤ t Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2014/11/2

Xiaoli WANG: Technical Meeting Experimental results Dataset: 50K, average length=350, max length=5000 100 query sequences Quality of count filtering KNN query performance 2014/11/2 SESAME@IDMI 40 40

SOC@NUS The 3-in-1 storage system • Our proposed inverted index can support efficient graph and sequence similarity search • Challenges on inverted index • Various complex structures => unified key and value data structure • Various query processing algorithms => general list processing method 2014/11/2

SOC@NUS Challenge 1: unified index structures • Key: smaller substructures • graph (star), sequence (n-gram), tree (binary branch[1]) • A key is a string • Posting list: a list of values • value (the complex structure linked to the key) • A value is an integer of document number Unified inverted index (Terms with posting list of document ids) stars n-grams binary branches Graph data Sequence data Tree data Yang, R. etc. Similarity Evaluation on Tree-structured Data. In SIGMOD 2005. 2014/11/2

SOC@NUS Challenge 1: unified index structures • Key: smaller substructures • graph (star), sequence (n-gram), tree (binary branch[1]) • A key is a string • Posting list: a list of values • value (the complex structure linked to the key) • A value is an integer of document number 5-grams Yang, R. etc. Similarity Evaluation on Tree-structured Data. In SIGMOD 2005. 2014/11/2

SOC@NUS Challenge 2: common list processing • List scanning: CA-based method • Mapping distance as the aggregation value • Count filtering as the pruned technique • Implementation on top of Lucene[1] Application layer Graph similarity search Sequence similarity search Tree similarity search List processing Lucene index Index layer http://lucene.apache.org/. 2014/11/2

SOC@NUS Real application: social reading systems • Book duplicate detection problem • Graph matching problem • Book annotation migration problem • Sequence matching problem • Annotation search by snapping problem • Sequence similarity search problem • Build an information management tool to support the storage and retrieval of sequences and graphs 2014/11/2

SOC@NUS Real application: social reading systems • An information management tool for Readpeer.com[1] http://readpeer.com/. 2014/11/2

SOC@NUS Real application: social reading systems • An information management tool for Readpeer.com[1] • Efficient annotation retrieval: about 650ms • Book duplicate detection: about 1 sec http://readpeer.com/. 2014/11/2

SOC@NUS Outline • Motivation • Complex structures • Limitations on existing systems • Our solutions • Inverted index for graph similarity search • Inverted index for sequence similarity search • A 3-in-1 inverted indexing tool and its applications • Conclusions and future work 2014/11/2

SOC@NUS Conclusions • Build a unified 3-in-1 index for various complex structures • Propose an inverted index to support efficient graph similarity search • Further extend the inverted index to support efficient sequence similarity search • Implement a unified inverted index to support various similarity search problems on complex structures • The unified inverted index helps to build an effective information management tool in our real social reading system 2014/11/2

Developing 3-in-1 Index Structures on Complex Structure Similarity Search

Developing 3-in-1 Index Structures on Complex Structure Similarity Search

Presentation Transcript

Protein Structure Similarity

Similarity Search in Visual Data

Similarity Search in Protein Databases

Similarity Search on Uncertain Time Series

Index Structures

Techniques and Data Structures for Efficient Multimedia Similarity Search

Protein Structure Similarity

Structural Similarity Index

Database Similarity Search

Index-based approach to similarity search in protein and nucleotide databases

Complex Structures

Index structures

Similarity Search in Arbitrary Subspaces

Index Structures

Similarity Search

Hierarchical Indexing Structure for Efficient Similarity Search in Video Retrieval

Index Structures

Geometric index structures

Index Structures 13.2 – Secondary Index

Lesson 3 Database Similarity Search

Database Similarity Search

Index Structures [13]