A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)

Problem • We are interested in low cardinality set-values • Retail store transaction logs • Web logs • Biomedical databases etc. • We address the efficient evaluation of containment queries • In which transactions were products ‘a’ and ‘b’ sold together? • Which users visited only the main page or the download page of our site? • We propose the Hybrid Trie-Inverted file (HTI) index Terrovitis et. al., CIKM '06

Outline • Problem definition • The HTI index • Query evaluation • Experiments • Conclusions Terrovitis et. al., CIKM '06

Data and queries Terrovitis et. al., CIKM '06

Data and queries • Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Terrovitis et. al., CIKM '06

Data and queries • Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) • Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’(equality) Terrovitis et. al., CIKM '06

Data and queries • Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) • Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) • Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset) Terrovitis et. al., CIKM '06

Data and queries • Traditional methods • Signature files • Inverted files • Differences from text databases: • Low cardinality • Large number of records in comparison with vocabulary size • New types of queries (equality-superset) Terrovitis et. al., CIKM '06

The HTI index Background – The inverted file Terrovitis et. al., CIKM '06

HTI indexInverted files - problems • The evaluation of containment queries relies on merge-joining the inverted lists • The inverted lists become very long • when the database size is very big compared to the vocabulary • when the items’ distribution is skewed • This is often the case in the real world! Terrovitis et. al., CIKM '06

HTI indexSolution? • We need to break up the lists! • But how? • Lets make a list for every combination of items! Terrovitis et. al., CIKM '06

HTI indexSolution? • We assume a total order based on the frequency of appearance for the items of the database • We order the items in each set-value and we transform it to a sequence • We create a path in the access tree for each sequence Terrovitis et. al., CIKM '06

HTI indexAll combinations? Terrovitis et. al., CIKM '06

HTI indexAll combinations? Maybe, not… Terrovitis et. al., CIKM '06

HTI indexAn access tree for the frequent items Terrovitis et. al., CIKM '06

The HTI index Terrovitis et. al., CIKM '06

HTI indexThe basic points • The access tree is used only for the most frequent items • The inverted lists are restructured so that each node of the access tree points to a different inverted sublist • We keep the access tree in main memory Terrovitis et. al., CIKM '06

Query EvaluationBasic Steps • Find the frequent items of the query set • Use the access tree to detect the sublists which might participate in the answer • Merge-join these sublists with the inverted lists of the non-frequent items Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

ExperimentsSetup • Real Data from UCI • web log from microsoft.com [ 320k records, 294 items] • web log from msnbc.com [1M records, 17 items] • Syntheticdata • Zipfian distribution of order 1 • 100k-1M records • 1k-10k items • Queries with 2-22 items Terrovitis et. al., CIKM '06

ExperimentsQuery performance – DB size Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length Terrovitis et. al., CIKM '06

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Presentation Transcript

Trees for spatial indexing

Inverted Files

Attributes of employees valued by employers

Attributes of a Scholar

Inverted Indexing for Text Retrieval

Design of a Pneumatic Inverted Pendulum

A model for combination of set covering and network connectivity in facility location

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

The TRIE

Junction trees Trees where each node is a set of variables

The set of files includes : Tcl source of the POLYGON program

Compressing and Indexing Strings and (labeled) Trees

From Unordered Files to Indexing

Multi-Level Indexing and B-Trees

Trees for spatial indexing

Decision Trees and Numeric Attributes

Indexing Structures for Files

Indexing Stream Register Files

Indexing Structures for Files

Attributes of a Leader