1 / 82

Information retrieval

This article discusses the Boolean Information Retrieval (IR) model, which is based on set theoretic principles established by George Boole in the 1850s. It explores how this model deals with set membership and operations on sets, using Boolean expressions for queries and returning documents that satisfy the expressions. The model's simplicity and ease of understanding make it conceptually straightforward, but it has limitations in terms of speed and complex operations.

Download Presentation

Information retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information retrieval 2019/2020

  2. ir process

  3. IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.)

  4. Boolean IR Model • Based on Boolean Logic (Algebra of Sets) • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)

  5. Boolean IR Model • Queries are Boolean expressions, e.g., ‘Caesar and Brutus’ (predicates for term occurrence) • Search engine returns all documents that satisfy the Boolean expression • Conceptually simple and easy to understand • Basic operation: • identify set of documents containing a certain term

  6. Boolean IR Model • Which plays of Shakespeare contain the words BRUTUS and CAESAR, but NOT CALPURNIA? • One could grepall of Shakespeare’s plays for BRUTUS and CAESAR, then strip out lines containing CALPURNIA. • Why is grep not the solution? • Slow (forlargecollections) • “NOT CALPURNIA” isnon-trivial • Other operations (e.g., find the word Romans near countryman) notfeasible • Ranked retrieval (best documents to return)

  7. Term-document matrix document term

  8. We have a 0/1 vector for each term (characteristic functions for term occurrence predicates) • To answer the query BRUTUS AND CAESAR AND NOT CALPURNIA: • Take the vectors for BRUTUS, CAESAR, and CALPURNIA • Complement the vector of CALPURNIA • Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 100100

  9. Sometimes we need to • Fast process large documents • More specific operations (not only AND, OR, NOT) – e.g., near • Sort = RANK of retrieved documents

  10. 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Posting list Inverted index • We need variable-size postings lists • On disk, a continuous run of postings is normal and best • In memory, can use linked lists or variable length arrays Brutus 174 Caesar Calpurnia 2 31 54 101

  11. 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Posting list Concepts • Dictionary of terms – list of all terms in memory • Posting list – list of term occurrence in documents • Posting – one item in the occurrence list Brutus 174 Caesar Calpurnia 2 31 54 101

  12. How to create posting list • Acquire the collection of documents. • “Miervšetkýmnárodomsveta.” • Tokenize content of documents. • Mier, všetkým, národom, sveta. • Perform text operations (which?). • mier, všetko, národ, svet • Alphabetically sort and find all documents containing terms.

  13. mier národ svet všetko

  14. Query • “miervšetkým” • ->mier and všetko mier národ svet všetko

  15. Merging lists INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer

  16. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia

  17. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia

  18. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,

  19. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,

  20. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,5,

  21. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,5,15

  22. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1

  23. INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1

  24. Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar

  25. Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar

  26. Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar

  27. Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar

  28. Results • Sorted vs. Not sorted • 3 vs 7 • Sort posting list by size – it is useful to store frequencies

  29. Faster Posting lists • If the list lengths are m and n, the merge takes O(m+n) operations. • Canwe do better? • Yes (if index isn’t changing too fast).

  30. Augment postings with skip pointers(at indexingtime) • To skip postings that will not figure in the search results.

  31. Process • Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance. • We then have 41 and 11 on the lower. 11 is smaller. • But instead to advance to 17 the skip successor of 11 on the lower list is 31, and it is smaller than 41, so we can skip ahead.

  32. Intersect with skips INTERSECTWITHSKIPS(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenwhilehasSkip(p1) and (docID(skip(p1))<docID(p2))) do p1=skip(p1) p1=next(p1) elsewhilehasSkip(p2) and (docID(skip(p2))<docID(p1))) do p2=skip(p2) p2=next(p2) returnanswer

  33. Where to place skips? • Tradeoff: • More skips → shorter skip spans ⇒ more likely to skip. But lots of comparisons to skip pointers. • Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips.

  34. Placingskips • Simple heuristic: for postinglists of length L, use evenly spaced skippointers • This takes into account the distribution of query terms in a simple way: • the larger the doc frequency of a term the larger the number of skip pointers • Easy if the index is relatively static; harder if postings keep changingbecause of updates • This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based: • because the I/O cost of loading a bigger index structure can outweigh the gains from quicker in memory merging!

  35. Phrase retrieval • 10% of queries are phrases • Occurrence posting list is not enough • Biword index • Positional index

  36. Biword index • Mier ľuďom celého sveta • mier ľudia • ľudia celý • celý svet • Longer phrases are transformed to biword • May lead to wrong results (recall)

  37. Biword – false positive • Query: “faculty of informatics and information technologies” • Doc1 • The Faculty of Informatics is one of eight faculties at the Vienna University of Technology, which carries out research in information, technology and business informatics. • faculty informatics, informatics one, …, information technology • Doc2 • Informationtechnology is the application of computers and teleco equipment to store, retrieve, transmit and manipulate data. • information technology, technology application, application computers • Doc3 • Faculty of Informatics and InformationTechnologies (FIIT) is one of the seven faculties of the Slovak University of Technology in Bratislava (SUT). • faculty informatics, informatics information, information technology

  38. Positional index(with freq) • <term, number of docs containing term; Doc1, term-freq in Doc1: position1, position2 … ; Doc2, term-freq in Doc2: position1, position2 … ; etc.>

  39. Positional index example (without freq) <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367,…> • For phrase queries, we use a merge algorithm recursively at the document level • But we need to deal with more than just equality

  40. Positional index processing a query • “to be or not to be” • Extract inverted index entries for to,be,or,not • Merge their doc:position lists to enumerate all positions with “to be or not to be” • to: • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191… • be: • 1:17,19; 4:17,191,291,430,434; 5:14,19,101… • Same general method for proximity queries (distance between words)

  41. Intersect with positional indexes POSITIONALINTERSECT(p1,p2,k) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then l=[] pp1=positions(p1) pp2=positions(p2) whilepp1<>null do while pp2<>null do if abs(pos(pp1)-pos(pp2))≤k then ADD(l,pos(pp2)) else if pos(pp2)>pos(pp1) thenbreak pp2=next(pp2) while l<>[] and abs(l[0]-pos(pp1))>k do delete(l[0]) for each ps € l do ADD(answer,(docID(p1),pos(pp1),ps)) pp1=next(pp1) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer

  42. Rules • A positional index is 2-4 as large as a non-positional index • Positional index size 35-50% of volume of original text

  43. Dictionary data structures for inverted indexes • An array of struct:

  44. Dictionary data structures • Twomainchoices: • Hash table • Tree • Some IR systems use hashes, some trees

  45. Hashes • Each vocabulary term is hashed to an integer • (We assume you’ve seen hash tables before) • Pros: • Lookup is faster than for a tree • Cons: • No easy way to find minor variants: • judgment/judgement • No prefix search ("bar*") [tolerantretrieval] • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashingeverything

  46. Tree: binarytree

  47. Tree: B-tree • Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriatenaturalnumbers, e.g., [2,4].

  48. Trees • Simplest: binarytree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we have one – lexicographic • Unless we are dealing with Chinese (no unique ordering) • Pros: • Solves the prefix problem (terms starting with 'hyp') • Cons: • Slower: O(log M) [and this requires balanced tree] • Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem.

More Related