Text Indexing

Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Why string data are interesting ? They are ubiquitous: • Digital libraries and product catalogues • Electronic white and yellow pages • Specialized information sources (e.g. Genomic or Patent dbs) • Web pages repositories • Private information dbs • ... String collections are growing at a staggering rate: • ...more than 10Tb of textual data in the web • ...more than 15Gb of base pairs in the genomic dbs

The need for an “index” Brute-force scanning is not a viable approach: • Fast single searches • Multiple simple searches for complex queries The index is a basic block of any IR system. An IR system also encompasses: • IR models • Ranking algorithms • Query languages and operations • ... We will concentrate only on “index design” !!

Types of data DNA sequences Audio-video files Executables Linguistic or tokenizable text Raw sequence of characters or bytes Exact word Word prefix or suffix Phrase Arbitrary substring Complex matches Word-based query Character-based query Types of query Two families of indexes Two indexing approaches : • Word-based indexes, here a concept of “word” must be devised ! • Inverted files, Signature files or Bitmaps. • Full-text indexes, no constraint on text and queries ! • Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.

Vocabulary Postings Doc #1 Now is the time for all good men to come to the aid of their country Doc #2 It was a dark and stormy night in the country manor. The time was past midnight 2 Word-based: Inverted files (or lists) Query answering is a two-phase process: midnight AND time

Full-text indexes Their need is pervasive: • Raw data: DNA sequences, Audio-Video files, ... • Linguistic texts: data mining, statistics, ... • Vocabulary for Inverted Lists • Xpathqueries on XML documents • Intrusion detection, Anti-viruses, ... Classes of indexes: • Suffix array, Suffix tree (variants) • Multi-level indexes: Short Pat array • B-tree based data structures: Prefix B-tree, String B-tree

Terminology • An alphabet, denoted ∑ is a set of (ordered) characters. • A string S is an array of characters, S[1,n] = S[1] S[2] … S[n]. • S[i,j] = S[i] … S[j] is a substring of S. • S[1,j] is a prefix of S; S[i,n] is a suffix of S. • ∑* denotes all strings over alphabet ∑. • Lexicographic order: • Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z, the lexicographic order is the same as in a dictionary. SUF(T) = Sorted set of suffixes of T

Indexed string matching problem • Let T be a set of K strings in ∑*, where N is the total length of all strings in T. • String matching query on T: Given a pattern P find all occurrences of P in the strings in T. • Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index. • Dynamic version: Supports also insertions and deletions of strings in the full-text index.

P i T T[i,n] • T =This is a visual example • This is a visual example • This is a visual example 3,6,12 A simple but crucial observation Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n] Occurrences of P in T = All suffixes of T having P as a prefix Can transform the string matching problem to a “prefix matching problem” over all the suffixes.

5 Q(N2) space T = mississippi# SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# suffix pointer • Suffix Array • SA: array of ints, 4N bytes • Text T: N bytes •  5N bytes of space occupancy Suffix Array [Manber-Myers, 90] Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order. T = mississippi#

T = mississippi# SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# Two key properties [Manber-Myers, 90] Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. T = mississippi# P=si

SA 12 11 8 5 2 1 10 9 7 4 6 3 P is larger 2 accesses for binary step si Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log2 N) time T = mississippi#

SA 12 11 8 5 2 1 10 9 7 4 6 3 P is smaller si Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log2 N) time T = mississippi#

SA 12 11 8 5 2 1 10 9 7 4 6 3 si • Suffix Array search • O (p (log2 N + occ)) time • O (log2 N + occ) in practice P is a prefix sippi occ=2 sissippi P is a prefix • External memory • Simple disk paging for SA • O ((p/B) (log2 N + occ)) I/Os issippi P is not a prefix + occ/B logB N Listing the occurrences [Manber-Myers, 90] Brute-force comparison: O(p x occ) time T = mississippi# 4 6 7 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3

Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA SA SUF(T) Lcp 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# 0 0 1 4 0 0 1 0 2 1 3 • Suffix Array search • O ((p/B) log2 N + (occ/B)) I/Os • 9 N bytes of space P=si occ=2 Scan Lcp until Lcp[i] < p Output-sensitive retrieval T = mississippi# 4 6 7 base B : tricky !! 0 0 1 4 0 0 1 0 2 1 3 0 0 1 4 0 0 1 0 2 1 3 + : incremental search Compare against P

Min Lcp[i,q-1] < P’s > P’s P q Range Minima > P’s known induct. The cost: O (1) memory accesses Incremental search (case 1) Incremental search using the LCP array: no rescanning of pattern chars SA i j

Min Lcp[i,q-1] < P’s > P’s P q Range Minima known induct. The cost: O (1) memory accesses Incremental search (case 2) Incremental search using the LCP array: no rescanning of pattern chars SA i j

Min Lcp[i,q-1] < P’s > P’s • Suffix Array search • O(log2 N) binary steps • O(p) total char-cmp for routing •  O((p/B) + log2 N + (occ/B)) I/Os L Suffix char>Pattern char P Range Minima Suffix char<Pattern char The cost: O(L) char cmp Incremental search (case 3) Incremental search using the LCP array: no rescanning of pattern chars SA i q j base B : more tricky (we will not consider this)

Summary: suffix array [Manber-Myers, 90] • Space: O(N) • String matching queries: O(p + log N + occ) • Can be constructed O(N log N) time. • Static.

P M SA Copy a prefix of marked suffixes • SA + Supra-index • O((p/B) + log2 (N/s) + (occ/B)) I/Os binary-search inside s Hybrid Index: Short Pat array Exploit internal memory: sample the suffix array and copy something in memory Disk • Parameter s depends on M and influences both performance and space !! (only a huristic)

s b e u u i a n d r l d $ $ l a $ k y $ $ $ Tries • Trie (name from the word “retrieval”): a data structure for storing a set of strings • Let’s assume that all strings end with “$” (not in S) Set of strings: {bear, bid, bulk, bull, sun, sunday}

Tries • Properties of a trie: • A multi-way tree. • Each node has from 1 to S+1 children. • Each edge of the tree is labeled with a character. • Each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this node.

Analysis of the Trie Given k strings of total length N: • Size: • O(N) in the worst-case • Search, insertion, and deletion (string of length m): • O(m) (assuming ∑ is constant) • Observation: • Having chains of one-child nodes is wasteful

s b e u u i a n d r l d $ $ l b a $ sun k y ear$ $ $ day$ ul id$ $ $ l$ k$ Compact Tries • Compact Trie: • Replace a chain of one-child nodes with an edge labeled with a string • Each non-leaf node (except root) has at least two children

1,1 20,22 2,5 b 27,30 11,12 sun 7,9 5,5 ear$ day$ ul 18,19 id$ $ 13,14 l$ k$ 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 T: bear$bid$bulk$bull$sun$sunday$ Compact Tries • Implementation: • Strings are external to the structure in one array, edges are labeled with indices in the array (from, to) • Improves the space to O(k)

b,1 1,1 s,3 20,22 e,4 2,5 d,4 u,2 27,30 11,12 i,3 7,9 _,1 5,5 l,2 18,19 k,2 13,14 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 T: bear bid bulk bull sun sunday Patricia Tries • Patricia trie: • a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1)

r a (a,1) (r,1) carrara$ (c,8) (r,1) r rara$ (r,5) (a,1) a $ ($,1) a$ 1 1 rara$ (a,2) (r,5) $ (r,3) 3 3 ra$ 7 7 ($,1) 5 5 4 2 2 4 6 6 carrara$ Suffix Trees [McCreight, 76] • Suffix tree – a compact trie (or similar structure) of all suffixes of the text • Patricia trie of suffixes is sometimes called a Pat tree 1 2 3 4 5 6 7 8

3 1 0 4 2 a b c b b O(p) time a a b b O(N) space a c (5,8) a b b b c c 2 4 1 3 5 7 6 8 4 2 W(p) I/Os W(occ) I/Os?? Packing ?! CPAT tree ~ 5N on average No (p/B), possibly no (occ/B), mainly static and space costly Search in suffix trees P = ba  Search is a path traversal and O(occ) time a b c c b b b c c b What about ST in external memory ? • Unbalanced tree topology • Updates T = abababbc# 1 3 5 79 - Large space ~ 15N

Summary: compact trie K strings of total length N: • Space: O(K) • String matching queries: O(p + occ) time • Can be constructed O(N) time. • Update(S): O(|S|) time

Summary: suffix tree [McCreight, 76] For a string of length n: • Space: O(n) • String matching queries: O(p + occ) • Can be constructed O(n) time. • Static.

The String B-tree(An I/O-efficient full-text index !!)[Ferragina-Grossi, 95]

String B-tree[Ferragina-Grossi, 95] • Index unbounded length keys • Good worst-case I/O-bounds in search and update The prologue We are left with many open issues: • Suffix Array: updates • Hybrid: Heuristic tuning of the performance • Suffix tree: difficult packing and W(p) I/Os B-tree is ubiquitous in large-scale applications: • Atomic keys: integers, reals, ... • Prefix B-tree: bounded length keys ( 255 chars) Suffix trees + B-trees ?

Some considerations Strings have arbitrary length: • Disk page cannot ensure the storage of Q(B) strings • M may be unable to store even one single string String storage: • Pointers allow to fit Q(B) strings per disk page • String comparison needs disk access and may be expensive String pointers organization seen so far: • Suffix array: simple but static and not optimal • Patricia trie: sophisticated and much efficient (optimal ?) Recall the problem: T is a string collection • Search( P[1,p] ): retrieve all occurrences of P in T’s strings • Update( S[1,t] ): insert or delete a text S from T

+ B • Search(P) • O ((p/B) log2 N) I/Os • O (occ/B) I/Os It is dynamic !! O(s (s/B) log2 N) I/Os O((p/B) log2 B) I/Os 29 13 20 18 3 23 O(logB N) levels 29 2 26 13 20 25 6 18 3 14 21 23 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23 Disk AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30 1º step: B-tree on string pointers P = AT

G A 0 4 6 3 5 6 G A 2 3 1 7 4 5 6 A G A A G A A G A A G G A G A C G C G C A G A G C G C A G G G C G C G G A G C G C G G G A 2º step: The Patricia trie (1; 1,3) (4; 1,4) (2; 1,2) (5; 5,6) (3; 4,4) (6; 5,6) (2; 6,6) (1; 6,6) (5; 7,7) (4; 7,7) (7; 7,8) (6; 7,7) Disk

G • Space PT • O(k) , not O(N) C G 6 4 6 5 3 0 • Search(P): • First phase: no string access 1 5 7 G G G A A max LCP with P 6 2 5 3 1 4 7 A G A A G A A G A A G G A G A C G C G C A G A G C G C A G G G C G C G G A G C G C G G G A mismatch P’s position 2º step: The Patricia trie A Two-phase search: P = GCACGCAC • Second phase: O(p/B) I/Os A C A Just one string is checked !! A G A G G Disk

+ • Search(P) • O((p/B) logB N) I/Os • O(occ/B) I/Os Insert(S) O( s (s/B) logB N) I/Os O(p/B) I/Os O(logB N) levels PT PT PT PT PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 29 2 26 13 20 25 6 18 3 14 21 23 21 17 23 Disk AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30 3º step: B-tree + Patricia tree P = AT 29 13 20 18 3 23

Search(P) • O(logB N) I/Os just • to go to the leaf level PT Level i Max_lcp(i) P Level i+1 PT PT Leaf Level adjacent 4º step: Incremental Search First case

Search(P) • O(p/B + logB N) I/Os • O(occ/B) I/Os PT Level i Max_lcp(i) Inductive Step P PT Level i+1 i-th step: O(( lcp i+1 – lcp i)/B + 1) I/Os skip Max_lcp(i) Max_lcp(i+1) 4º step: Incremental Search Second case No rescanning

Summary: String B-tree String B-tree performance: [Ferragina-Grossi, 95] • Search(P) takes O(p/B + logB N + occ/B) I/Os • Update(S) takes O( s logB N ) I/Os • Space is Q(N/B) disk pages Using the String B-tree in internal memory: • Search(P) takes O(p + log2 N + occ) time • Update(S) takes O( s log2 N ) time • Space is Q(N) bytes • It is a sort of dynamic suffix array

Summary • Indexed string matching problem • Word-based • Full-text • Internal memory data structures • Suffix array • Suffix tree • External memory data structures • Patricia trie and Pat tree • Short Pat array • String B-tree

Text Indexing

Text Indexing

Presentation Transcript

Full-Text Indexing

Text Mining: Fast Phrase-based Text Indexing and Matching

Lecture 1 : Full-text Indexing

Tools for Text Indexing and SearchING

Text Indexing and Dictionary Matching with One Error

Text Document Representation & Indexing ---- Vector Space Model

Query-Driven Indexing for Scalable P2P Text Retrieval

Query-Driven Indexing for P2P Text Retrieval

Semi-Automatic Indexing of Full Text Biomedical Articles

Off-line text search (indexing)

Vakhitov Alexander Approximate Text Indexing.

Basic Text Processing and Indexing

Indexing Text Data under Space Constraints

Computer Evaluation of Indexing and Text Processing

Multimedia and Text Indexing

Performing Indexing and Full-Text Searching

Inverted Indexing for Text Retrieval

Text Document Representation & Indexing ---- Vector Space Model

Full-Text Indexing

Vakhitov Alexander Approximate Text Indexing.

Multimedia and Text Indexing

Text Indexing

Text Indexing

Presentation Transcript

Full-Text Indexing

Text Mining: Fast Phrase-based Text Indexing and Matching

Lecture 1 : Full-text Indexing

Tools for Text Indexing and SearchING

Text Indexing and Dictionary Matching with One Error

Text Document Representation &amp; Indexing ---- Vector Space Model

Query-Driven Indexing for Scalable P2P Text Retrieval

Query-Driven Indexing for P2P Text Retrieval

Semi-Automatic Indexing of Full Text Biomedical Articles

Off-line text search (indexing)

Vakhitov Alexander Approximate Text Indexing.

Basic Text Processing and Indexing

Indexing Text Data under Space Constraints

Computer Evaluation of Indexing and Text Processing

Multimedia and Text Indexing

Performing Indexing and Full-Text Searching

Inverted Indexing for Text Retrieval

Text Document Representation &amp; Indexing ---- Vector Space Model

Full-Text Indexing

Vakhitov Alexander Approximate Text Indexing.

Multimedia and Text Indexing

Text Document Representation & Indexing ---- Vector Space Model

Text Document Representation & Indexing ---- Vector Space Model