420 likes | 568 Views
Text Indexing. S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]. Why string data are interesting ?. They are ubiquitous: Digital libraries and product catalogues Electronic white and yellow pages Specialized information sources (e.g. Genomic or Patent dbs)
E N D
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Why string data are interesting ? They are ubiquitous: • Digital libraries and product catalogues • Electronic white and yellow pages • Specialized information sources (e.g. Genomic or Patent dbs) • Web pages repositories • Private information dbs • ... String collections are growing at a staggering rate: • ...more than 10Tb of textual data in the web • ...more than 15Gb of base pairs in the genomic dbs
The need for an “index” Brute-force scanning is not a viable approach: • Fast single searches • Multiple simple searches for complex queries The index is a basic block of any IR system. An IR system also encompasses: • IR models • Ranking algorithms • Query languages and operations • ... We will concentrate only on “index design” !!
Types of data DNA sequences Audio-video files Executables Linguistic or tokenizable text Raw sequence of characters or bytes Exact word Word prefix or suffix Phrase Arbitrary substring Complex matches Word-based query Character-based query Types of query Two families of indexes Two indexing approaches : • Word-based indexes, here a concept of “word” must be devised ! • Inverted files, Signature files or Bitmaps. • Full-text indexes, no constraint on text and queries ! • Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.
Vocabulary Postings Doc #1 Now is the time for all good men to come to the aid of their country Doc #2 It was a dark and stormy night in the country manor. The time was past midnight 2 Word-based: Inverted files (or lists) Query answering is a two-phase process: midnight AND time
Full-text indexes Their need is pervasive: • Raw data: DNA sequences, Audio-Video files, ... • Linguistic texts: data mining, statistics, ... • Vocabulary for Inverted Lists • Xpathqueries on XML documents • Intrusion detection, Anti-viruses, ... Classes of indexes: • Suffix array, Suffix tree (variants) • Multi-level indexes: Short Pat array • B-tree based data structures: Prefix B-tree, String B-tree
Terminology • An alphabet, denoted ∑ is a set of (ordered) characters. • A string S is an array of characters, S[1,n] = S[1] S[2] … S[n]. • S[i,j] = S[i] … S[j] is a substring of S. • S[1,j] is a prefix of S; S[i,n] is a suffix of S. • ∑* denotes all strings over alphabet ∑. • Lexicographic order: • Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z, the lexicographic order is the same as in a dictionary. SUF(T) = Sorted set of suffixes of T
Indexed string matching problem • Let T be a set of K strings in ∑*, where N is the total length of all strings in T. • String matching query on T: Given a pattern P find all occurrences of P in the strings in T. • Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index. • Dynamic version: Supports also insertions and deletions of strings in the full-text index.
P i T T[i,n] • T =This is a visual example • This is a visual example • This is a visual example 3,6,12 A simple but crucial observation Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n] Occurrences of P in T = All suffixes of T having P as a prefix Can transform the string matching problem to a “prefix matching problem” over all the suffixes.
5 Q(N2) space T = mississippi# SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# suffix pointer • Suffix Array • SA: array of ints, 4N bytes • Text T: N bytes • 5N bytes of space occupancy Suffix Array [Manber-Myers, 90] Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order. T = mississippi#
T = mississippi# SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# Two key properties [Manber-Myers, 90] Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. T = mississippi# P=si
SA 12 11 8 5 2 1 10 9 7 4 6 3 P is larger 2 accesses for binary step si Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log2 N) time T = mississippi#
SA 12 11 8 5 2 1 10 9 7 4 6 3 P is smaller si Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log2 N) time T = mississippi#
SA 12 11 8 5 2 1 10 9 7 4 6 3 si • Suffix Array search • O (p (log2 N + occ)) time • O (log2 N + occ) in practice P is a prefix sippi occ=2 sissippi P is a prefix • External memory • Simple disk paging for SA • O ((p/B) (log2 N + occ)) I/Os issippi P is not a prefix + occ/B logB N Listing the occurrences [Manber-Myers, 90] Brute-force comparison: O(p x occ) time T = mississippi# 4 6 7 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3
Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA SA SUF(T) Lcp 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# 0 0 1 4 0 0 1 0 2 1 3 • Suffix Array search • O ((p/B) log2 N + (occ/B)) I/Os • 9 N bytes of space P=si occ=2 Scan Lcp until Lcp[i] < p Output-sensitive retrieval T = mississippi# 4 6 7 base B : tricky !! 0 0 1 4 0 0 1 0 2 1 3 0 0 1 4 0 0 1 0 2 1 3 + : incremental search Compare against P
Min Lcp[i,q-1] < P’s > P’s P q Range Minima > P’s known induct. The cost: O (1) memory accesses Incremental search (case 1) Incremental search using the LCP array: no rescanning of pattern chars SA i j
Min Lcp[i,q-1] < P’s > P’s P q Range Minima known induct. The cost: O (1) memory accesses Incremental search (case 2) Incremental search using the LCP array: no rescanning of pattern chars SA i j
Min Lcp[i,q-1] < P’s > P’s • Suffix Array search • O(log2 N) binary steps • O(p) total char-cmp for routing • O((p/B) + log2 N + (occ/B)) I/Os L Suffix char>Pattern char P Range Minima Suffix char<Pattern char The cost: O(L) char cmp Incremental search (case 3) Incremental search using the LCP array: no rescanning of pattern chars SA i q j base B : more tricky (we will not consider this)
Summary: suffix array [Manber-Myers, 90] • Space: O(N) • String matching queries: O(p + log N + occ) • Can be constructed O(N log N) time. • Static.
P M SA Copy a prefix of marked suffixes • SA + Supra-index • O((p/B) + log2 (N/s) + (occ/B)) I/Os binary-search inside s Hybrid Index: Short Pat array Exploit internal memory: sample the suffix array and copy something in memory Disk • Parameter s depends on M and influences both performance and space !! (only a huristic)
s b e u u i a n d r l d $ $ l a $ k y $ $ $ Tries • Trie (name from the word “retrieval”): a data structure for storing a set of strings • Let’s assume that all strings end with “$” (not in S) Set of strings: {bear, bid, bulk, bull, sun, sunday}
Tries • Properties of a trie: • A multi-way tree. • Each node has from 1 to S+1 children. • Each edge of the tree is labeled with a character. • Each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this node.
Analysis of the Trie Given k strings of total length N: • Size: • O(N) in the worst-case • Search, insertion, and deletion (string of length m): • O(m) (assuming ∑ is constant) • Observation: • Having chains of one-child nodes is wasteful
s b e u u i a n d r l d $ $ l b a $ sun k y ear$ $ $ day$ ul id$ $ $ l$ k$ Compact Tries • Compact Trie: • Replace a chain of one-child nodes with an edge labeled with a string • Each non-leaf node (except root) has at least two children
1,1 20,22 2,5 b 27,30 11,12 sun 7,9 5,5 ear$ day$ ul 18,19 id$ $ 13,14 l$ k$ 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 T: bear$bid$bulk$bull$sun$sunday$ Compact Tries • Implementation: • Strings are external to the structure in one array, edges are labeled with indices in the array (from, to) • Improves the space to O(k)
b,1 1,1 s,3 20,22 e,4 2,5 d,4 u,2 27,30 11,12 i,3 7,9 _,1 5,5 l,2 18,19 k,2 13,14 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 T: bear bid bulk bull sun sunday Patricia Tries • Patricia trie: • a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1)
r a (a,1) (r,1) carrara$ (c,8) (r,1) r rara$ (r,5) (a,1) a $ ($,1) a$ 1 1 rara$ (a,2) (r,5) $ (r,3) 3 3 ra$ 7 7 ($,1) 5 5 4 2 2 4 6 6 carrara$ Suffix Trees [McCreight, 76] • Suffix tree – a compact trie (or similar structure) of all suffixes of the text • Patricia trie of suffixes is sometimes called a Pat tree 1 2 3 4 5 6 7 8
3 1 0 4 2 a b c b b O(p) time a a b b O(N) space a c (5,8) a b b b c c 2 4 1 3 5 7 6 8 4 2 W(p) I/Os W(occ) I/Os?? Packing ?! CPAT tree ~ 5N on average No (p/B), possibly no (occ/B), mainly static and space costly Search in suffix trees P = ba Search is a path traversal and O(occ) time a b c c b b b c c b What about ST in external memory ? • Unbalanced tree topology • Updates T = abababbc# 1 3 5 79 - Large space ~ 15N
Summary: compact trie K strings of total length N: • Space: O(K) • String matching queries: O(p + occ) time • Can be constructed O(N) time. • Update(S): O(|S|) time
Summary: suffix tree [McCreight, 76] For a string of length n: • Space: O(n) • String matching queries: O(p + occ) • Can be constructed O(n) time. • Static.
The String B-tree(An I/O-efficient full-text index !!)[Ferragina-Grossi, 95]
String B-tree[Ferragina-Grossi, 95] • Index unbounded length keys • Good worst-case I/O-bounds in search and update The prologue We are left with many open issues: • Suffix Array: updates • Hybrid: Heuristic tuning of the performance • Suffix tree: difficult packing and W(p) I/Os B-tree is ubiquitous in large-scale applications: • Atomic keys: integers, reals, ... • Prefix B-tree: bounded length keys ( 255 chars) Suffix trees + B-trees ?
Some considerations Strings have arbitrary length: • Disk page cannot ensure the storage of Q(B) strings • M may be unable to store even one single string String storage: • Pointers allow to fit Q(B) strings per disk page • String comparison needs disk access and may be expensive String pointers organization seen so far: • Suffix array: simple but static and not optimal • Patricia trie: sophisticated and much efficient (optimal ?) Recall the problem: T is a string collection • Search( P[1,p] ): retrieve all occurrences of P in T’s strings • Update( S[1,t] ): insert or delete a text S from T
+ B • Search(P) • O ((p/B) log2 N) I/Os • O (occ/B) I/Os It is dynamic !! O(s (s/B) log2 N) I/Os O((p/B) log2 B) I/Os 29 13 20 18 3 23 O(logB N) levels 29 2 26 13 20 25 6 18 3 14 21 23 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23 Disk AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30 1º step: B-tree on string pointers P = AT
G A 0 4 6 3 5 6 G A 2 3 1 7 4 5 6 A G A A G A A G A A G G A G A C G C G C A G A G C G C A G G G C G C G G A G C G C G G G A 2º step: The Patricia trie (1; 1,3) (4; 1,4) (2; 1,2) (5; 5,6) (3; 4,4) (6; 5,6) (2; 6,6) (1; 6,6) (5; 7,7) (4; 7,7) (7; 7,8) (6; 7,7) Disk
G • Space PT • O(k) , not O(N) C G 6 4 6 5 3 0 • Search(P): • First phase: no string access 1 5 7 G G G A A max LCP with P 6 2 5 3 1 4 7 A G A A G A A G A A G G A G A C G C G C A G A G C G C A G G G C G C G G A G C G C G G G A mismatch P’s position 2º step: The Patricia trie A Two-phase search: P = GCACGCAC • Second phase: O(p/B) I/Os A C A Just one string is checked !! A G A G G Disk
+ • Search(P) • O((p/B) logB N) I/Os • O(occ/B) I/Os Insert(S) O( s (s/B) logB N) I/Os O(p/B) I/Os O(logB N) levels PT PT PT PT PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 29 2 26 13 20 25 6 18 3 14 21 23 21 17 23 Disk AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30 3º step: B-tree + Patricia tree P = AT 29 13 20 18 3 23
Search(P) • O(logB N) I/Os just • to go to the leaf level PT Level i Max_lcp(i) P Level i+1 PT PT Leaf Level adjacent 4º step: Incremental Search First case
Search(P) • O(p/B + logB N) I/Os • O(occ/B) I/Os PT Level i Max_lcp(i) Inductive Step P PT Level i+1 i-th step: O(( lcp i+1 – lcp i)/B + 1) I/Os skip Max_lcp(i) Max_lcp(i+1) 4º step: Incremental Search Second case No rescanning
Summary: String B-tree String B-tree performance: [Ferragina-Grossi, 95] • Search(P) takes O(p/B + logB N + occ/B) I/Os • Update(S) takes O( s logB N ) I/Os • Space is Q(N/B) disk pages Using the String B-tree in internal memory: • Search(P) takes O(p + log2 N + occ) time • Update(S) takes O( s log2 N ) time • Space is Q(N) bytes • It is a sort of dynamic suffix array
Summary • Indexed string matching problem • Word-based • Full-text • Internal memory data structures • Suffix array • Suffix tree • External memory data structures • Patricia trie and Pat tree • Short Pat array • String B-tree