1 / 17

Space-Efficient Algorithms for Document Retrieval

Space-Efficient Algorithms for Document Retrieval. Veli Mäkinen University of Helsinki. Joint work with Niko Välimäki. Introduction. Solution. Problem. Field. Information Retrieval. Document Retrieval. Inverted Index. [Sad07 & this paper]. [PST06]. practice: space limits

crwys
Download Presentation

Space-Efficient Algorithms for Document Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki

  2. Introduction Solution Problem Field Information Retrieval Document Retrieval Inverted Index [Sad07 & this paper] [PST06] practice: space limits theory: time limits [Mut02] Combinatorial Pattern Matching Text Indexing Suffix tree Space-Efficient Document Retrieval

  3. Text Indexing • Let T = t1t2 ... tn be a text string from an ordered alphabet Σ. • Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p1p2 ... pm: • Count(P): How many times P occurs in T? • List(P): list the occurrence positions of P in T. Space-Efficient Document Retrieval

  4. Document Retrieval • Let D={T1,T2,...Tk} be a set of text documents of total length n. • Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p1p2 ... pm:- Find(P): List the documents that contain P (in the order of relevance,...) Space-Efficient Document Retrieval

  5. Inverted Index & Document Retrieval To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; ... be: (d1,4) (d1,18) ... (d2,74) (d2,139)... ... to: (d1,1) (d1,15) ...(d2,136)... ... PORTIA: If to do were as easy as to know what were good todo, chapels had been churches and poor men'scottages princes' palaces. It is a good divine thatfollows his own instructions: I can easier teachtwenty what were good to be done, than be one of thetwenty to follow mine own teaching. Creating inverted file over Shakespeare's plays............................... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... Space-Efficient Document Retrieval

  6. Suffix Array & Document Retrieval (1/2) • Build generalized suffix array of D: 1 2 .... 6853491 6853492 6853493 6853494 ... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good todo, chapels had been churches and poor men'scottages princes' palaces. It is a good divine thatfollows his own instructions: I can easier teachtwenty what were good to be done, than be one of thetwenty to follow mine own teaching. Space-Efficient Document Retrieval

  7. Suffix Array & Document Retrieval • Build generalized suffix array of D: • Locate the interval containing all occurrences of pattern P: • Remove duplicates: 1 2 .... 6853491 6853492 6853493 6853494 ... "to be" 1 2 .... 6853491 6853492 6853493 6853494 ... d1 (Hamlet), d2 (Merchant of Venice),... Space-Efficient Document Retrieval

  8. prev -1 -1 ...6853434 6853372 6853492 6853420 ... min min min ... min>6853490 Muthukrishnan's improvement "to be" 1 2 .... 6853491 6853492 6853493 6853494 ... 6 4 .... 2 1 1 3 doc Space-Efficient Document Retrieval

  9. Time-Optimal Document Retrieval • Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. • Observation: The solution is not space-optimal, as the document collection can be represented in n log |Σ| bits. Space-Efficient Document Retrieval

  10. Space-Optimal Document Retrieval • Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where • |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; • f(m,n)=O(m log n) is the pattern search time; and • Ω(logεn)=g(n) is the time to decode a suffix array value. Space-Efficient Document Retrieval

  11. Our Result: Space- and Time-Efficient Document Retrieval • Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); • for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively. Space-Efficient Document Retrieval

  12. Details of Our Result (1/3) • We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. • We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order. Space-Efficient Document Retrieval

  13. Details of Our Result (2/3) • Observation: prev[i]=selectdoc[i](doc,rankdoc[i](doc,i)-1), where • rankk'(A,i) gives the number of times value k' appears in A[1,i]; and • selectk'(A,j) gives the position of the j-th occurrence of value k' in A. Space-Efficient Document Retrieval

  14. Details of Our Result (3/3) • The generalized wavelet tree representation of doc-array provides constant time rank and select when kpolylog (n). • Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07]. Space-Efficient Document Retrieval

  15. |CSA|+2n+o(n)+n log k(1+o(1)) bits A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 1 3 4 3 4 3 3 2 2 2 4 4 1 1 Space-Efficient Document Retrieval

  16. Extensions • The approach can easily be extended to • report the documents in relevance order under standard scoring schemes like TF*IDF; and • show context around the first/several/all occurrences in selected documents. Space-Efficient Document Retrieval

  17. Small experiment • 50MB English text • k=200 query time m=3 query time m=4 size Space-Efficient Document Retrieval

More Related