1 / 60

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes. Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA). Problem. We are interested in low cardinality set-values Retail store transaction logs Web logs

Download Presentation

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)

  2. Problem • We are interested in low cardinality set-values • Retail store transaction logs • Web logs • Biomedical databases etc. • We address the efficient evaluation of containment queries • In which transactions were products ‘a’ and ‘b’ sold together? • Which users visited only the main page or the download page of our site? • We propose the Hybrid Trie-Inverted file (HTI) index Terrovitis et. al., CIKM '06

  3. Outline • Problem definition • The HTI index • Query evaluation • Experiments • Conclusions Terrovitis et. al., CIKM '06

  4. Outline • Problem definition • The HTI index • Query evaluation • Experiments • Conclusions Terrovitis et. al., CIKM '06

  5. Data and queries Terrovitis et. al., CIKM '06

  6. Data and queries • Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Terrovitis et. al., CIKM '06

  7. Data and queries • Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) • Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’(equality) Terrovitis et. al., CIKM '06

  8. Data and queries • Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) • Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) • Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset) Terrovitis et. al., CIKM '06

  9. Data and queries • Traditional methods • Signature files • Inverted files • Differences from text databases: • Low cardinality • Large number of records in comparison with vocabulary size • New types of queries (equality-superset) Terrovitis et. al., CIKM '06

  10. Outline • Problem definition • The HTI index • Query evaluation • Experiments • Conclusions Terrovitis et. al., CIKM '06

  11. The HTI index Background – The inverted file Terrovitis et. al., CIKM '06

  12. HTI indexInverted files - problems • The evaluation of containment queries relies on merge-joining the inverted lists • The inverted lists become very long • when the database size is very big compared to the vocabulary • when the items’ distribution is skewed • This is often the case in the real world! Terrovitis et. al., CIKM '06

  13. HTI indexSolution? • We need to break up the lists! • But how? • Lets make a list for every combination of items! Terrovitis et. al., CIKM '06

  14. HTI indexSolution? • We assume a total order based on the frequency of appearance for the items of the database • We order the items in each set-value and we transform it to a sequence • We create a path in the access tree for each sequence Terrovitis et. al., CIKM '06

  15. HTI indexAll combinations? Terrovitis et. al., CIKM '06

  16. HTI indexAll combinations? Terrovitis et. al., CIKM '06

  17. HTI indexAll combinations? Terrovitis et. al., CIKM '06

  18. HTI indexAll combinations? Terrovitis et. al., CIKM '06

  19. HTI indexAll combinations? Maybe, not… Terrovitis et. al., CIKM '06

  20. HTI indexAn access tree for the frequent items Terrovitis et. al., CIKM '06

  21. HTI indexAn access tree for the frequent items Terrovitis et. al., CIKM '06

  22. The HTI index Terrovitis et. al., CIKM '06

  23. The HTI index Terrovitis et. al., CIKM '06

  24. The HTI index Terrovitis et. al., CIKM '06

  25. The HTI index Terrovitis et. al., CIKM '06

  26. HTI indexThe basic points • The access tree is used only for the most frequent items • The inverted lists are restructured so that each node of the access tree points to a different inverted sublist • We keep the access tree in main memory Terrovitis et. al., CIKM '06

  27. Outline • Problem definition • The HTI index • Query evaluation • Experiments • Conclusions Terrovitis et. al., CIKM '06

  28. Query EvaluationBasic Steps • Find the frequent items of the query set • Use the access tree to detect the sublists which might participate in the answer • Merge-join these sublists with the inverted lists of the non-frequent items Terrovitis et. al., CIKM '06

  29. Subset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  30. Subset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  31. Subset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  32. Subset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  33. Subset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  34. Equality - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  35. Equality - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  36. Equality - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  37. Equality - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  38. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  39. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  40. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  41. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  42. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  43. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  44. Superset - (‘b’, ‘c’, ‘d’’) Terrovitis et. al., CIKM '06

  45. Outline • Problem definition • The HTI index • Query evaluation • Experiments • Conclusions Terrovitis et. al., CIKM '06

  46. ExperimentsSetup • Real Data from UCI • web log from microsoft.com [ 320k records, 294 items] • web log from msnbc.com [1M records, 17 items] • Syntheticdata • Zipfian distribution of order 1 • 100k-1M records • 1k-10k items • Queries with 2-22 items Terrovitis et. al., CIKM '06

  47. ExperimentsQuery performance – DB size Terrovitis et. al., CIKM '06

  48. ExperimentsQuery performance – query length Terrovitis et. al., CIKM '06

  49. ExperimentsQuery performance – query length Terrovitis et. al., CIKM '06

  50. ExperimentsQuery performance – query length Terrovitis et. al., CIKM '06

More Related