1 / 63

Document Preprocessing and Indexing SI650: Information Retrieval

Document Preprocessing and Indexing SI650: Information Retrieval. Winter 2010 School of Information University of Michigan. Typical IR system architecture. documents. INDEXING. Query Rep. query. Doc Rep. User. Ranking. results. SEARCHING. INTERFACE. Feedback. judgments.

brand
Download Presentation

Document Preprocessing and Indexing SI650: Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Preprocessing and IndexingSI650: Information Retrieval Winter 2010 School of Information University of Michigan

  2. Typical IR system architecture documents INDEXING Query Rep query Doc Rep User Ranking results SEARCHING INTERFACE Feedback judgments QUERY MODIFICATION - From ChengXiang Zhai’s slides

  3. Overload of text content - Ramakrishnan and Tomkins 2007

  4. Data volume behind online information systems ~100B 10B ~3M day ~750k /day ~150k /day 6M 1M

  5. IR Winter 2010 … Automated indexing/labeling Storing, indexing and searching text. Inverted indexes. …

  6. Sec. 1.1 Handling large collections Life is good when every document is mapped into a vector of words, but … Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation • 6GB of data in the documents. Say there are M = 500K distinct terms among these.

  7. Sec. 1.1 Storage issue 500K x 1M matrix has half-a-trillion elements. • 4 bytes for an integer • 500K x 1M x 4 = 2T (your laptop would fail) • 500K x 100G x 4 = 2*105 T (challenging even for google) But it has no more than one billion positive numbers. • matrix is extremely sparse. • 1000 x 1M x 4 = 4G What’s a better representation?

  8. Indexing • Indexing = Convert documents to data structures that enable fast search • Inverted index is the dominating indexing method (used by all search engines) • Other indices (e.g., document index) may be needed for feedback

  9. Inverted index • Instead of an incidence vector, use a posting table • CLEVELAND: D1, D2, D6 • OHIO: D1, D5, D6, D7 • Use linked lists to be able to insert new document postings in order and to remove existing postings. • More efficient than scanning docs (why?)

  10. Inverted index • Fast access to all docs containing a given term (along with frequency and position information) • For each term, we get a list of tuples • (docID, freq, pos). • Given a query, we can fetch the lists for all query terms and work on the involved documents. • Boolean query: set operation • Natural language query: term weight summing • Keep everything sorted! This gives you a logarithmic improvement in access.

  11. Sec. 1.2 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Postings Inverted index - example Posting Brutus 174 Caesar Calpurnia 2 31 54 101 - From Chris Manning’s slides For each term t, we must store a list of all documents that contain t. • Identify each by a docID, a document serial number

  12. Inverted index - example Doc 1 Dictionary Postings This is a sample document with one sample sentence Doc 2 This is another sample document - From ChengXiang Zhai’s slides

  13. Basic operations on inverted indexes • Conjunction (AND) – iterative merge of the two postings: O(x+y) • Disjunction (OR) – very similar • Negation (NOT) – can we still do it in O(x+y)? • Example: MICHIGAN AND NOT OHIO • Example: MICHIGAN OR NOT OHIO • Recursive operations • Optimization: start with the smallest sets

  14. Data structures for inverted index • Dictionary: modest size • Needs fast random access • Preferred to be in memory • Hash table, B-tree, trie, … • Postings: huge • Sequential access is expected • Can stay on disk • May contain docID, term freq., term pos, etc • Compression is desirable

  15. Constructing inverted index • The main difficulty is to build a huge index with limited memory • Memory-based methods: not usable for large collections • Sort-based methods: • Step 1: collect local (termID, docID, freq) tuples • Step 2: sort local tuples (to make “runs”) • Step 3: pair-wise merge runs • Step 4: Output inverted file

  16. Sort by doc-id Sort by term-id All info about term 1 <1,1,3> <2,1,2> <3,1,1> ... <1,2,2> <3,2,3> <4,2,2> … <1,300,3> <3,300,1> ... <1,1,3> <1,2,2> <2,1,2> <2,4,3> ... <1,5,3> <1,6,2> … <1,299,3> <1,300,1> ... <1,1,3> <1,2,2> <1,5,2> <1,6,3> ... <1,300,3> <2,1,2> … <5000,299,1> <5000,300,1> ... Parse & Count “Local” sort Merge sort Sort-based inversion Term Lexicon: the 1 cold 2 days 3 a 4 ... doc1 doc1 ... DocID Lexicon: doc1 1 doc2 2 doc3 3 ... doc300

  17. IR Winter 2010 … Document preprocessing. Tokenization. Stemming. The Porter algorithm. …

  18. Can we make it even better? • Index term selection/normalization • Reduce the size of the vocabulary • Index compression • Reduce the space of storage

  19. Should we index every term? • How big is English? • Dictionary Marketing • Education (Testing of Vocabulary Size) • Psychology • Statistics • Linguistics • Two Very Different Answers • Chomsky: language is infinite • Shannon: 1.25 bits per character • Should we care about a term • If no body uses it as a query?

  20. What is a good indexing term? • Specific (phrases) or general (single word)? • Luhn found that words with middle frequency are most useful • Not too specific (low utility, but still useful!) • Not too general (lack of discrimination, stop words) • Stop word removal is common, but rare words are kept • All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms (more later)

  21. Term selection for indexing • Manual: e.g., Library of Congress subject headings, MeSH • Automatic: e.g., TF*IDF based

  22. LOC subject headings A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http://www.loc.gov/catdir/cpso/lcco/lcco.html

  23. Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine

  24. Automatic term selection methods • TF*IDF: pick terms with the highest TF*IDF scores • Centroid-based: pick terms that appear in the centroid with high scores • The maximal marginal relevance principle (MMR) • Related to summarization, snippet generation

  25. Non-English languages • Arabic: • Japanese: • Chinese: • 信息檢索 • German: Lebensversicherungsgesellschaftsangesteller كتاب この本は重い。

  26. Document preprocessing • What should we use to index? • Dealing with formatting and encoding issues • Hyphenation, accents, stemming, capitalization • Tokenization: • Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc, can’t • Example: “The New York-Los Angeles flight”

  27. Document preprocessing • Normalization: • Casing (cat vs. CAT) • Stemming (computer, computation) • String matching • Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi • Index reduction • Dropping stop words (“and”, “of”, “to”) • Problematic for “to be or not to be”

  28. Tokenization • Normalize lexical units: Words with similar meanings should be mapped to the same indexing term • Stemming: Mapping all inflectional forms of words to the same root form, e.g. • computer -> compute • computation -> compute • computing -> compute (but king->k?) • Porter’s Stemmer is popular for English

  29. Porter’s algorithm Example: the word “duplicatable” duplicat rule from step 4duplicate rule from step 1b1duplic rule from step 3 The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

  30. Porter’s algorithm

  31. Links • http://maya.cs.depaul.edu/~classes/ds575/porter.html • http://www.tartarus.org/~martin/PorterStemmer/def.txt

  32. IR Winter 2010 … Approximate string matching …

  33. Approximate string matching • The Soundex algorithm (Odell and Russell) • Uses: • spelling correction • hash function • non-recoverable

  34. The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

  35. The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, Hilbert: H416, Knuth :K530, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair

  36. Levenshtein edit distance • Examples: • Theatre-> theater • Ghaddafi->Qadafi • Computer->counter • Edit distance (inserts, deletes, substitutions) • Edit transcript • Done through dynamic programming

  37. Recurrence relation • Three dependencies • D(i, 0)=i • D(0, j)=j • D(i, j)=min[D(i-1,j)+1, D(1,j-1)+1, D(i-1,j-1)+t(i,j)] • Simple edit distance: • t(i, j) = 0 iff S1(i) = S2(j) • Target: D(l1, l2)

  38. Example Gusfield 1997

  39. Example (cont’d) Gusfield 1997

  40. Tracebacks Gusfield 1997

  41. Weighted edit distance • Used to emphasize the relative cost of different edit operations • Useful in bioinformatics • Homology information • BLAST • Blosum • http://eta.embl-heidelberg.de:8000/misc/mat/blosum50.html

  42. Links • Web sites: • http://www.merriampark.com/ld.htm • http://odur.let.rug.nl/~kleiweg/lev/ • Demo: • http://nayana.ece.ucsb.edu/imsearch/imsearch.html

  43. IR Winter 2010 … Index Compression IR Toolkits …

  44. Inverted index compression • Compress the postings • Observations • Inverted list is sorted (e.g., by docid or termfq) • Small numbers tend to occur more frequently • Implications • “d-gap” (store difference): d1, d2-d1, d3-d2-d1,… • Exploit skewed frequency distribution: fewer bits for small (high frequency) integers • Binary code, unary code, -code, -code

  45. Integer compression • In general, to exploit skewed distribution • Binary: equal-length coding • Unary: x1 is coded as x-1 one bits followed by 0, e.g., 3=> 110; 5=>11110 • -code: x=> unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001 • -code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101

  46. Text compression • Compress the dictionaries • Methods • Fixed length codes • Huffman coding • Ziv-Lempel codes

  47. Fixed length codes • Binary representations • ASCII • Representational power (2k symbols where k is the number of bits)

  48. Variable length codes • Alphabet: A .-  N -.  0 ----- B -...  O ---  1 .---- C -.-.  P .--.  2 ..--- D -..  Q --.-  3 ...— E .  R .-. 4 ....- F ..-. S ... 5 ..... G --. T -  6 -.... H .... U ..-  7 --... I ..  V ...-  8 ---.. J .---  W .--  9 ----. K -.-  X -..- L .-..  Y -.— M --  Z --.. • Demo: • http://www.scphillips.com/morse/

  49. Most frequent letters in English • Some are more frequently used than others… • Most frequent letters: • E T A O I N S H R D L U • Demo: • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm • Also: bigrams: • TH HE IN ER AN RE ND AT ON NT

  50. Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character (37.5% compression) • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

More Related