Space-Efficient Algorithms for Document Retrieval

Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki

Introduction Solution Problem Field Information Retrieval Document Retrieval Inverted Index [Sad07 & this paper] [PST06] practice: space limits theory: time limits [Mut02] Combinatorial Pattern Matching Text Indexing Suffix tree Space-Efficient Document Retrieval

Text Indexing • Let T = t1t2 ... tn be a text string from an ordered alphabet Σ. • Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p1p2 ... pm: • Count(P): How many times P occurs in T? • List(P): list the occurrence positions of P in T. Space-Efficient Document Retrieval

Document Retrieval • Let D={T1,T2,...Tk} be a set of text documents of total length n. • Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p1p2 ... pm:- Find(P): List the documents that contain P (in the order of relevance,...) Space-Efficient Document Retrieval

Inverted Index & Document Retrieval To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; ... be: (d1,4) (d1,18) ... (d2,74) (d2,139)... ... to: (d1,1) (d1,15) ...(d2,136)... ... PORTIA: If to do were as easy as to know what were good todo, chapels had been churches and poor men'scottages princes' palaces. It is a good divine thatfollows his own instructions: I can easier teachtwenty what were good to be done, than be one of thetwenty to follow mine own teaching. Creating inverted file over Shakespeare's plays............................... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... Space-Efficient Document Retrieval

Suffix Array & Document Retrieval (1/2) • Build generalized suffix array of D: 1 2 .... 6853491 6853492 6853493 6853494 ... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good todo, chapels had been churches and poor men'scottages princes' palaces. It is a good divine thatfollows his own instructions: I can easier teachtwenty what were good to be done, than be one of thetwenty to follow mine own teaching. Space-Efficient Document Retrieval

Suffix Array & Document Retrieval • Build generalized suffix array of D: • Locate the interval containing all occurrences of pattern P: • Remove duplicates: 1 2 .... 6853491 6853492 6853493 6853494 ... "to be" 1 2 .... 6853491 6853492 6853493 6853494 ... d1 (Hamlet), d2 (Merchant of Venice),... Space-Efficient Document Retrieval

prev -1 -1 ...6853434 6853372 6853492 6853420 ... min min min ... min>6853490 Muthukrishnan's improvement "to be" 1 2 .... 6853491 6853492 6853493 6853494 ... 6 4 .... 2 1 1 3 doc Space-Efficient Document Retrieval

Time-Optimal Document Retrieval • Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. • Observation: The solution is not space-optimal, as the document collection can be represented in n log |Σ| bits. Space-Efficient Document Retrieval

Space-Optimal Document Retrieval • Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where • |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; • f(m,n)=O(m log n) is the pattern search time; and • Ω(logεn)=g(n) is the time to decode a suffix array value. Space-Efficient Document Retrieval

Our Result: Space- and Time-Efficient Document Retrieval • Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); • for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively. Space-Efficient Document Retrieval

Details of Our Result (1/3) • We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. • We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order. Space-Efficient Document Retrieval

Details of Our Result (2/3) • Observation: prev[i]=selectdoc[i](doc,rankdoc[i](doc,i)-1), where • rankk'(A,i) gives the number of times value k' appears in A[1,i]; and • selectk'(A,j) gives the position of the j-th occurrence of value k' in A. Space-Efficient Document Retrieval

Details of Our Result (3/3) • The generalized wavelet tree representation of doc-array provides constant time rank and select when kpolylog (n). • Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07]. Space-Efficient Document Retrieval

|CSA|+2n+o(n)+n log k(1+o(1)) bits A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 1 3 4 3 4 3 3 2 2 2 4 4 1 1 Space-Efficient Document Retrieval

Extensions • The approach can easily be extended to • report the documents in relevance order under standard scoring schemes like TF*IDF; and • show context around the first/several/all occurrences in selected documents. Space-Efficient Document Retrieval

Small experiment • 50MB English text • k=200 query time m=3 query time m=4 size Space-Efficient Document Retrieval

Space-Efficient Algorithms for Document Retrieval

Space-Efficient Algorithms for Document Retrieval

Presentation Transcript

Space Efficient Alignment Algorithms

Document Expansion for Speech Retrieval ( Singhal, Pereira)

Efficient Algorithms for Matching

Adaptive Subjective Triggers for Opinionated Document Retrieval

Dynamic hierarchical algorithms for document clustering

Enhancing Query Formulation for Spoken Document Retrieval

Efficient learning algorithms for changing environments

Energy-Efficient Algorithms

Space Efficient Alignment Algorithms and Affine Gap Penalties

ExtMiner: Combining Multiple Ranking and Clustering Algorithms for Structured Document Retrieval

Document Retrieval Problems

Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem.

Document retrieval

Algorithms for Efficient Collaborative Filtering

Efficient Algorithms for Motif Search

EFFICIENT ALGORITHMS FOR MULTICHROMOSOMAL GENOME REARRANGEMENTS

Efficient Case Retrieval

Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection