lecture 6 indexing gist
Download
Skip this Video
Download Presentation
Lecture 6: Indexing: GiST

Loading in 2 Seconds...

play fullscreen
1 / 27

Lecture 6: Indexing: GiST - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Lecture 6: Indexing: GiST. Sept. 12, 2007 ChengXiang Zhai. Most slides are adapted from Kevin Chang’s lecture slides. Search Trees: Previous Approaches. Specialized search trees (yet another tree!): redundant code: most trees are very similar concurrency control, logging/recovery: tricky

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Lecture 6: Indexing: GiST' - ken


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lecture 6 indexing gist

Lecture 6: Indexing: GiST

Sept. 12, 2007

ChengXiang Zhai

Most slides are adapted from Kevin Chang’s lecture slides

search trees previous approaches
Search Trees: Previous Approaches
  • Specialized search trees (yet another tree!):
    • redundant code: most trees are very similar
    • concurrency control, logging/recovery: tricky
  • Trees for extensible data types:
    • B-tree for any data with linear ordering
      • e.g.: index titles (alph. ordering) with B-tree
    • problem: does not support natural queries
      • e.g.: WHERE book.title has “database”?
gist generalized search tree
GiST: Generalized Search Tree
  • General: cover B+-tree, R-tree, etc…
  • Extensible:
    • domain-specific data types & queries definable
  • Easy to extend: six “key methods” for a new tree
  • Efficient: can match specialized trees
  • Reusable: concurrency, recovery for indexes
example indexing book titles
Example: Indexing Book Titles
  • Titles of 4 books:
    • T1 = “database optimization”
    • T2 = “web database”
    • T3 = “complexity of optimization algorithms”
    • T4 = “algorithms and complexity”
  • Indexable with (extensible) B+-tree?
    • linear ordering: T4, T3, T1, T2
  • Note: Just an example for demonstrating GiST!
    • What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”.
    • stay tuned for text and web search
queries on titles
Queries on Titles
  • Indexing is to accommodate query processing
  • What “predicates” to ask about titles?
queries on titles1
Queries on Titles
  • Equality predicates:
    • WHERE book.title = “web databases”
  • Containment predicates:
    • WHERE book.title has “web”
  • Prefix predicates:
    • WHERE book.title start-with “web”
  • RegEx predicates: (generalize all the others)
    • WHERE book.title like “# web # database”
extensible b tree for titles
Extensible B+-Tree for Titles
  • Observations:
    • indexed values have linear ordering: T4, T3, T1, T2
    • keys are simply separators: T4, c, T3, d, T1, w, T2

d

c

w

T4: alg. …

T3: complexity …

T1: database …

T2: web …

using b tree what s wrong
Using B+-Tree: What’s Wrong?
  • What predicates can B+tree support well?
    • equality, containing, prefix, regex?

d

c

w

T4: alg. …

T3: complexity …

T1: database …

T2: web …

gist generalizing balanced search trees
GiST: Generalizing Balanced Search Trees
  • GiST is not universal (just reasonable generalization)
    • balanced tree of <key, ptr> pairs, keys can overlap
  • GRE test:B-Tree : R-Tree = R-Tree: ________
    • What is the key generalization?

key1 key2 …

internal nodes (directory)

leaf nodes (linked list)

the key generalization the key
The Key Generalization: The Key
  • Key evolution: 1-D separator --> 2-D MBR --> predicates
  • R-Tree : B-Tree
    • generalizing key from 1-D line to 2-D area
      • bounding range to (minimal) bounding region
  • GiST : R-Tree
    • generalizing key from 2-D MBR to “predicates”
      • a predicate that all values v in subtree will satisfy
    • B-tree keys:
      • [k1:k2) --> contains([k1:k2), v)
    • R-tree keys:
      • (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v)
    • RD-tree keys:
      • {x1,…xk}  subset({x1,…,xk},v)
gist for title indexing predicates
Gist for Title Indexing: Predicates

Must first determine predicates:

  • What query predicates to support?
    • equality: equal(v, “web db”)
    • containing: has(v, “web”)
  • What key predicate to use?
    • Criteria for choosing key predicates?
    • What do you suggest?
gist for title indexing predicates1
GiST for Title Indexing: Predicates
  • Key predicates: Contains(S, v)

SL

SR

{alg, comp, opt}

{db, opt, web}

SLL

SLR

SRL

SRR

{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …

gist built in tree operations
GiST: Built-in Tree Operations
  • Search(root R, predicate q)
  • Insert(root R, entry E, level l)
  • Delete(root R, entry E)
gist application specific methods
GiST: Application-Specific Methods

Search:

  • Consistent(E, q): search subtree E for predicate q?

Labeling:

  • Union(E1, …, En): how to label the union of E1, …, En?

Categorization:

  • Penalty(E1, E2): penalty for inserting E2 in subtree E1
  • PickSplit(E1, …, En): how to split into two groups of entries

Compression: (storage/time tradeoff)

  • Compress(E): E --> Ec
  • Decompress(Ec): --> E’ such that E.p implies E’.p
search operation consistent method
Search Operation: Consistent Method
  • Search(root R, predicate q):
    • traverse subtrees where Consistent true
    • return leaf entries that are consistent
consistent method
Consistent Method
  • Consistent(E, q):
    • Can E.p and q both hold?
    • Does E.p imply (not q)?
  • Title GiST:
    • key predicate: p = Contains(S, v) or simply S
      • e.g., SL = {alg, comp, opt}
      • e.g., SR = {db, opt, web}
    • Consistent(SL, has(v, “web”))?
      • how to implement?
    • Consistent(SR, equals(v, “web database”))?
      • how to implement?
insert operation
Insert Operation
  • Insert(root R, entry E, level l)
    • descend tree minimizing potential increase in Penalty
      • stop at level specified
    • if there is room at node, insert there
    • else split according to PickSplit
    • propagate changes using Union to adjust keys
    • Why do we need a “level parameter”?
title gist insert
Title GiST: Insert
  • Where to insert T5:“complexity of web algorithms” ?

SL

SR

{alg, comp, opt}

{db, opt, web}

SLL

SLR

SRL

SRR

{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …

penalty method
Penalty Method
  • Penality(E1, E2):
    • penalty for inserting E2 in subtree E1
  • Title GiST:
    • E2 with S ={comp,web, alg} (i.e., T5:“complexity of web algorithms”)
    • Where to insert?
      • root: SL = {alg, comp, opt} vs. SR = {db, opt, web}?
    • Penalty:
      • how to implement?
picksplit method
PickSplit Method
  • PickSplit(E1, …, En):
    • how to split into two groups of entries
  • Title GiST:
    • suppose we have 3 entries (after an Insert)
      • S1 = {alg, comp}
      • S2 = {comp, opt}
      • S3 = {comp, web, alg} (new)
    • how to split {S1, S2, S3} into two?
    • something similar to what R-tree algorithm will do
union method
Union Method
  • Union(E1, …, En):
    • Generates a label for the subtree with E1, …, En
  • Title GiST:
    • key predicate: p = Contains(S, v) or simply S
      • S1 = {alg, comp}, S2 = {comp, opt}
      • Combined key = ?
    • Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ?
      • how to implement?
compress decompress method
Compress/Decompress Method?

Key storage vs. search time tradeoff

  • Compress(E): E --> Ec
  • Decompress(Ec): --> E’.p can be “looser” than E.p (less pruning power)
  • Lossy compression: may need more time for search
  • Title GiST:
    • Any suggestions?
title gist compress decompress
Title GiST: Compress/Decompress
  • Example 1: no compression
    • Compress(E) --> Ec = E
    • Decompress(Ec) --> E’ = Ec
  • Example 2: compress by taking word initials
    • Compress:

{algorithm, complexity, optimization} --> {al, co, op}

    • Decompress:

{al, co, op} --> {al*, co*, op*}

gist no magic
GiST: No Magic
  • It offers (only) what its model is based on
  • It does not represent all possible index structures:
    • e.g.: duplicate objects by multiple inserts (R+-tree)
    • e.g.: support notion of distance and similarity
      • rather than Boolean based predicates
    • any more?
outlook indexability
Outlook: Indexability
  • Observation:
    • the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree with no efficiency gain
  • Big questions:
    • what is an index machinery? (analog: turing machine)
    • how do we characterize “workload”? (analog: languages)
    • can index always help in search? (analog: decidability, complexity)
      • what are the performance parameters? (analog: size of input)
      • what are the performance measures? (analog: time, space complexity)
  • Initial result: Hellerstein, Koutsoupias and Papadimitriou:

On the Analysis of Indexing Schemes, PODS 97

  • An empirical question:
    • Can we learn an indexing strategy from the characteristics of the workload?
what you should know
What You Should Know
  • What is GiST?
  • What are the six key methods?
  • How does GiST generalize other more specialized trees?
  • What are some limitations of GiST?
carry away messages
Carry Away Messages
  • Once again, generalize whenever it’s possible
    • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) -> Arbitrary objects (GiST, predicate-based)
  • Avoid over-generalization
    • While “predicate” is quite general, it doesn’t guarantee pruning power
    • Where’s the notion of “bounding” in GiST?
  • Whenever you see “yet another X”, think about possibilities for a more general formulation of X
ad