IO-Efficient Faceted Search Techniques for Optimizing Data Structures

IO-Efficient Faceted Search Talk at Dagstuhl Seminar „Data Structures” February 20th, 2008 Holger Bast Max-Planck-Institute for Informatics Saarbrücken joint work with: Omid Amini, Hubert Chan, Andreas Karrenbauer

Faceted Search • Data • Set of n objects • for example, scientific papers • Each object has a number of labels; labels are organized into categories (the facets) • for example, year:1990, author:Kurt Mehlhorn, author:Robert Tarjan, venue:JACM • Query • Given: set I с {1,…,n} of object ids (matching docs) • Compute: multi-set of labels of these objects (all their labels) • Objective: space-efficient and IO-efficient

IO-Efficiency • RAM Model • count the number of operations operation = arithmetic or access to single memory cell • important ingredient of time complexity analysis, but … • by itself completely inadequate for running time prediction on modern computers, no matterwhether the data is in cache or in main memory (modern = since about 20 years) • 100 disk seeks take about half a second • in that time can read 200 MB of contiguous data(if stored compressed) • main memory: 100 non-local accesses 10 KB data block

IO-Efficiency • RAM Model • count the number of operations operation = arithmetic or access to single memory cell • important ingredient of time complexity analysis, but … • by itself completely inadequate for running time prediction on modern computers, no matterwhether the data is in cache or in main memory (modern = since about 20 years) • IO / External Memory Model • count the number of block accesses to the data one block access = read / write B consecutive bytes • ignore everything else • good predictor if computation is negligible

Abstract Problem Formulation • Precomputation: • given n elements a1,…,an • organize in array of size N ≥ n • Query: • given I = {i1,…, im} с {1,…,n} • return elements ai1,…, aim using as few IOs as possible • Extreme solutions: • space: n #IOs: min{n / B, |I|} (optimal space) • space: B ∙ (n choose B)#IOs:|I| / B (optimal #IOs) n = 8, N = 24 I = {1, 6, 8}, B = 4 get a1, a6, a8with 1 IO ??? ??? How much space is needed for which IO-efficiency? Called an indexability problem in: Hellerstein et al, PODS’97 / JACM’02

A first simple result • Theorem: • if we want <|I| IOs for every query I • we need ≥ n2 / (4∙B) space • Proof: • construct graph G with n vertices edge {i, j} iff aiand aj can be read in one IO  m ≤ 2B ∙ N more edges  more space • every I = {i, j} can be read with < |I| = 2, that is, one IO, hence edge {i, j} exists  m ≥ (n choose 2) ≈ n2 / 2 better IO-efficiency  more edges n = 4, N = 8 a2 a1 a3 a4 B = 2 The short queries alone make the problem hard

Restrict to large queries • Theorem: • if we want < |I| IOs for all queries with |I| ≥ M • we need ≥ n2 / (4∙B∙M) space • Proof sketch: • construct graph G as before  m ≤ 2B ∙ N more edges  more space 2. Consider arbitrary I with |I| ≥ M  I not independent in G (otherwise |I| IOs necessary)  the minimal independent set is of size MIS < M IO-inefficient query  independent set • Turan’s theorem implies m ≥ (n choose 2) / MIS no large independent sets  more edges a2 a1 a3 a4 n = 4, N = 8 B = 2

Turán numbers (extremal set theory) • Definition: for n ≥ k ≥ r T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k • Turan’s theorem: • limn∞ T(n, k, r) / (n choose r) exists • exact value of limit unknown for k ≥ 2 • Lower bound • T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r) Paul (Pál) Turán *1910 in Budapest †1976 in Budapest Erdös number 1

Near-Optimal IO • Theorem: • if we want ≤ c ∙ |I| / B IOs for all queries with |I| ≥ M • we need ≥ nr / (4∙B∙M)r-1 space, where r = B/c • Proof sketch: • construct hyper-graph G with n vertices edge {i1,…, ir} if the corresponding r elements can be read in one IO as before: more edges  more space • as before: large queries IO-efficient no large independent sets • as before: no large independent sets  many edges need version of Turán’s theorem for hyper-graphs

Near-Optimal IO • Theorem: • if we want ≤ c ∙ |I| / B IOs for all queries with |I| ≥ M • we need ≥ nr / (4∙B∙M)r-1 space, where r = B/c • Proof sketch: • construct hyper-graph G with n vertices edge {i1,…, ir} if the corresponding r elements can be read in one IO as before: more edges  more space • as before: large queries IO-efficient no large independent sets • as before: no large independent sets  many edges there is hope for M linear in n

Fixed set of linear-size queries • Fixed set of queries I = {I1,…, Iℓ}, |I| = ΣIєIindex size assume each Ii is random M-subset of {1,…,n} • Goal ≤ c ∙ |I| / B IOs and ≤ є ∙ |I| space • Algorithm, special case • Pick random pair i, j • If Ii and Ij have B elements in common, remove them from each of the sets, and add to precomputed array • Repeat until total volume left is ≤ є’ ∙ |I| • Cover remainder of each query separately

Fixed set of linear-size queries • Fixed set of queries I = {I1,…, Iℓ}, |I| = ΣIєI |I| index size assume each Ii is random M-subset of {1,…,n} • Goal ≤ c ∙ |I| / B IOs and ≤ є ∙ |I| space • Algorithm, general case • Pick random k-tuple i1,…, ik • If each Ii has B / c elements from a common block of size B, remove them from each of the sets, and add to precomputed array  non-trivial to check this ! • Repeat until total volume left is ≤ є’ ∙ |I| • Cover remainder of each query separately

Main Theorem • For I from I = {I1,…, Iℓ}, each |Ii| = M and random #IOs:≤ c ∙ |I|/B space: ≈log(n/M) / log(c∙n/B) ∙ |I| • For example: c = 1, M = B  space: 100% ∙ |I| c = 1, M = n  space: 0% ∙ |I| c = 1, M = n/10, n/B = 1000  space: 33% ∙ |I| c = 2, M = n/20, n/B = 1000  space: 40% ∙ |I| • Proof: • well …

IO-Efficient Faceted Search Techniques for Optimizing Data Structures

IO-Efficient Faceted Search Techniques for Optimizing Data Structures

Presentation Transcript

Implementing a Faceted Search Framework

Faceted Metadata in Search Interfaces

Flexible Search and Navigation using Faceted Metadata

Efficient Search Engine Measurements

Faceted Classification

Implementation of a faceted catalog search solution

Faceted Navigation An Alternative to Search and Browse

Faceted Search

Faceted Metadata in Search Interfaces

Faceted classification

Efficient Diverse Search

Faceted Search for Hydrologic Data Discovery

Faceted Metadata for Site Navigation and Search

Faceted Search

Beyond Basic Faceted Search

IRI Data Library Faceted Search : an example of

Best Practices for Designing Faceted Search Filters

Dynamic Faceted Search for Discovery-driven Analysis

Beyond Basic Faceted Search Ben-Yitzhak, et al.

Faceted Metadata in Search Interfaces

Faceted Classification

Efficient Search - Overview