1 / 70

Information Retrieval Implementation issues

Information Retrieval Implementation issues. Djoerd Hiemstra & Vojkan Mihajlovic University of Twente {d.hiemstra,v.mihajlovic}.utwente.nl. The lecture.

kenton
Download Presentation

Information Retrieval Implementation issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information RetrievalImplementation issues Djoerd Hiemstra & Vojkan Mihajlovic University of Twente {d.hiemstra,v.mihajlovic}.utwente.nl

  2. The lecture • Ian H. Witten, Alistar Mofat, Timothy C. Bell, “Managing Gigabytes”, Morgan Kaufmann, pages 72-115 (Section 3), 1999. (For the exam, the compression methods in Section 3.3., i.e., the part with the grey bar left of the text, does not have to be studies in detal) • Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Computer Networks and ISDN Systems, 1997.

  3. Overview • Brute force implementation • Text analysis • Indexing • Index coding and query processing • Web search engines • Wrap-up

  4. Overview • Brute force implementation • Text analysis • Indexing • Index coding and query processing • Web search engines • Wrap-up

  5. Architecture 2000 FAST search Engine Knut Risvik

  6. Architecture today 1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query. 3. The search results are returned to the user in a fraction of a second. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.

  7. Storing the web • More than 10 billion sites • Assume each site contains 1000 terms • Each term consists of 5 chars on average • Each term a UTF character >=2bytes • To store the web you need to search: • 1010 x 103 x 5 x 2B ~= 100TB • What about: term statistics, hypertext info, pointers, search indexes, etc.? ~= PB • Do we really need all this data?

  8. Counting the web • Text statistics: • Term frequency • Collection frequency • Inverse document frequency … • Hypertext statistics: • Ingoing and outgoing links • Anchor text • Term positions, proximities, sizes, and characteristics …

  9. Searching the web • 100TB of data to be searched • We need to find such a large hard disk (currently the biggest are 250GB) • Hard disk transfer time 100MB/s • Time needed to sequentially scan the data: 1 million seconds • We have to wait for 10 days to get the answer to a query • That is not all …

  10. Problems in web search • Web crawling • Deal with limits, freshness, duplicates, missing links, loops, server problems, virtual hosts, etc. • Maintain large cluster of servers • Page servers: store and deliver the results of the queries • Index servers: resolve the queries • Answer 250 million of user queries per day • Caching, replicating, parallel processing, etc. • Indexing, compression,coding, fast access, etc.

  11. Implementation issues • Analyze the collection • Avoid non-informative data for indexing • Decision on relevant statistics and info • Index the collection • Which index type to use? • How to organize the index? • Compress the data • Data compression • Index compression

  12. Overview • Brute force implementation • Text analysis • Indexing • Index coding and query processing • Web search engines • Wrap-up

  13. Term frequency • Count how many times a tem occur in the collection (size N terms) => frequency (f) • Order them in descending order => rank (r) • The product of the frequency of words and their rank is approximately constant: f x r = C, C ~= N/10

  14. Zipf distribution Term count Term count Terms by rank order Terms by rank order Linear scale Logarithmic scale

  15. Consequences • Few terms occur very frequently: a, an, the, … => non-informative (stop) words • Many terms occur very infrequently: spelling mistakes, foreign names, … => noise • Medium number of terms occur with medium frequency => useful

  16. Word resolving power (van Rijsbergen 79)

  17. Heap’s law for dictionary size number of unique terms collection size

  18. Let’s store the web • Let’s remove: • Stop words: N/10 + N/20 + … • Noise words ~ N/1000 • UTF => ASCII • To store the web you need: • to use ~ 4/5 of the terms • 4/5 x 1010 x 103 x 5 x 1B ~= 40TB • How to search this vast amount of data?

  19. Overview • Brute force implementation • Text analysis • Indexing • Index coding and query processing • Web search engines • Wrap-up

  20. Indexing • How would you index the web? • Document index • Inverted index • Postings • Statistical information • Evaluating a query • Can we really search the web index? • Bitmaps and signature files

  21. Example Stop words: in, the, it.

  22. Document index #docs x [log2#docs] + #u_terms x #docs x 8b + #u_terms x (5 x 8b + [log2#u_terms]) 1010 x 5B + 106 x 1010 x 1B + 106 x (5 x 1B + 4B) ~= 10PB

  23. Inverted index (1) 1013 x (4B + 5B) + 106 x (5 x 1B + 4B) = 90TB

  24. Inverted index (2) 500 x 1010 x (4B + 1B + 5B) + 106 x (5 x 1B + 4B) = 50TB

  25. Inverted index - Postings 500 x 1010 x (5B + 1B) + 106 x (5 x 1B + 5B + 5B) = 30TB + 15MB < 40TB

  26. Inverted index - Statistics 500 x 1010 x (5B + 1B) + 106 x (5 x 1B + 5B + 5B + 5B) = 30TB + 20MB

  27. Inverted index querying Cold and hot => doc1,doc4; score = 1/6 x 1/2 x 1/6 x 1/2 = 1/144

  28. Break: can we search the web? • Number of postings (term-document pairs): • Number of documents: ~1010, • Average number of unique terms per document (document size ~1000): ~500 • Number of unique terms: ~106 • Formula: #docs x avg_tpd x ([log2#docs] + [log2max(tf)]) • + #u_trm(5 x [log2#char_size] + [log2N/10] + [log2#docs/10] + [log2(#docs x avg_tpd)]) • Can we still make the search more efficient? • Yes, but let’s take a look at other indexing techniques 500 x 1010 x (5B + 1B) + 106 x (5 x 1B + 5B + 5B + 5B) = 3 x 1013 + 2 x 107 ~= 30TB

  29. Bitmaps • For every term in the dictionary a bitvector is stored • Each bit represent presence or absence of a term in a document • Cold and pease => 100100 & 110000 = 100000 106 x (1GB + 5 x 1B) = 1PB

  30. Signature files • A text index based on storing a signature for each text block to be able to filter out some blocks quickly • A probabilistic method for indexing text • k hash functions are generating n-bit values • Signatures of two words can be identical

  31. Signature file example

  32. Signature file searching • If the corresponding word signature bits are set in the document, there is a high probability that the document contains the word. • Cold (1 & 4) => OK • Old (2,3,5 & 6) => not OK (2 & 5) => fetch the document at a query time and check if it occurs • Cold & hot: 1000 1010 0010 0100 (1 & 4) => OK • Reduce the false hits by increasing the number of bits per term signature 1010 x (5B + 1KB) = 10PB

  33. Indexing - Recap • Inverted files • require less storage than other two • more robust for ranked retrieval • can be extended for phrase/proximity search • numerous techniques exist for speed & storage space reduction • Bitmaps • an order of magnitude more storage than inverted files • efficient for Boolean queries

  34. Indexing – Recap 2 • Signature files • an order (or two) of magnitude more storage than inverted files • require un-necessary access to the main text because of false matches • no in-memory lexicon • Insertions can be handled easily • Coded(compressed) inverted files are the state-of-the art index structure used by most search engines

  35. Overview • Brute force implementation • Text analysis • Indexing • Index coding and query processing • Web search engines • Wrap-up

  36. Inverted file coding • The inverted file entries are usually stored in order of increasing document number • [<retrieval; 7; [2,23,81,98,121, 126, 180]> (the term “retrieval” occurs in 7 documents with document identifiers 2, 23, 81, 98, etc.)

  37. Query processing (1) • Each inverted file entry is an ascending sequence of integers • allows merging (joining) of two lists in a time linear in the size of the lists • Advanced Database Applications (211090): a merge join

  38. Query processing (2) • Usually queries are assumed to be conjunctive queries • query: information retrieval • is processed as information AND retrieval [<retrieval; 7; [2, 23, 81, 98, 121, 126, 139]> [<information; 9; [1,14,23,45,46,84,98,111,120]> • intersection of posting lists gives: [23, 98]

  39. Query processing (3) • Remember the Boolean model? • intersection, union and complement is done on posting lists • so, information OR retrieval [<retrieval; 7; [2, 23, 81, 98, 121, 126, 139]> [<information; 9; [1,14,23,45,46,84,98,111,120]> • union of posting lists gives: [1,2,14,23,45,46,81,84,98,111,120,121,126,139]

  40. Query processing (4) • Estimate of selectivity of terms: • Suppose information occurs on 1 billion pages • Suppose retrieval occurs on 10 million pages • size of postings (5 bytes per docid): • 1 billion * 5B = 5 GB for information • 10 million * 5B = 50 MB for retrieval • Hard disk transfer time: • 50 sec. for information + 0.5 sec. for retrieval • (ignore CPU time and disk latency)

  41. Query processing (6) • We just brought query processing down from 10 days to just 50.5 seconds (!) :-) • Still... way too slow... :-(

  42. Inverted file compression (1) • Trick 1, store sequence of doc-ids: • [<retrieval; 7; [2,23,81,98,121, 126, 180]> as a sequence of gaps • [<retrieval; 7; [2,21,58,17,23, 5, 54]> • No information is lost. • Always process posting lists from the beginning, so easily decoded into the original sequence

  43. Inverted file compression (2) • Does it help? • maximum gap determined by the number of indexed web pages... • infrequent terms coded as a few large gaps • frequent terms coded by many small gaps • Trick 2: use variable byte length encoding.

  44. Variable byte encoding (1) •  code: represent number x as: • first bits as the unary code for • remainder bits as binary code for • unary part specifies how many bits are required to code the remainder part • For example x = 5: • first bits: 110 • remainder: 01

  45. Variable byte encoding (2)

  46. Index sizes

  47. Index size of “our Google” • Number of postings (term-document pairs): • 100 billion documents • 500 unique terms on average • Assume on average 6 bits per doc-id • 500 x 1010 x 6 bits ~= 4TB • about 15% of the uncompressed inverted file.

  48. Query processing on compressed index • size of postings (6 bits per docid): • 1 billion * 6 bits = 750 Mb for information • 10 million * 6 bits = 7.5 Mb for retrieval • Hard disk transfer time: • 7.5 sec. for information + 0.08 sec. for retrieval • (ignore CPU time and disk latency and decompressing time)

  49. Query processing – Continued (1) • We just brought query processing down from 10 days to just 50.5 seconds... • and brought that down to 7.58 seconds :-) • but that is still too slow... :-(

  50. Early termination (1) • Suppose we re-sort the document ids for each posting such that the best documents come first • e.g., sort document identifiers for retrieval by their tf.idf values. • [<retrieval; 7; [98, 23, 180, 81, 98, 121, 2, 126,]> • then: top 10 documents for retrieval can be retrieved very quickly: stop after processing the first 10 document ids from the posting list! • but compression and merging (multi-word queries) of postings no longer possible...

More Related