1 / 31

Introduction to Information Retrieval

Introduction to Information Retrieval. Slides by me,. The Inverted Index. Indexing. Indexing is a technique borrowed from databases An index is a data structure that supports efficient lookups in a large data set E.g., hash indexes, R-trees, B-trees, etc. Document Retrieval.

Download Presentation

Introduction to Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Information Retrieval Slides by me, NLP

  2. The Inverted Index NLP

  3. Indexing • Indexing is a technique borrowed from databases • An index is a data structure that supports efficient lookups in a large data set • E.g., hash indexes, R-trees, B-trees, etc. NLP

  4. Document Retrieval • In search engines, the lookups have to find all documents that contain query terms. • What’s the problem with using a tree-based index? • A hash index? NLP

  5. Inverted Index An inverted index stores an entry for every word, and a pointer to every document where that word is seen. VocabularyPostings List Word1  Document17, Document 45123 . . . WordN  Document991, Document123001 NLP

  6. Example Example: Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” Query “you got”: “you”  {D2, D3} “got”  {D1, D2, D3} Whole query gives the intersection: “you got”  {D2, D3} ^ {D1, D2, D3} = {D2, D3} VocabularyPostings List Yes  D1, D3 we  D1 got  D1, D2, D3 no  D1 bananas  D1 What D2, D3 You D2, D3 I D3 like D3 NLP

  7. Variations • Record-level index stores just document identifiers in the postings list • Word-level index stores document IDs and offsets for the position of the words in each document • Supports phrased based searches (why?) Real search engines add all kinds of other information to their postings lists (see below). NLP

  8. Index Construction Algorithm: • Scan through each document, word by word • Write term, docID pair for each word to TempIndex file 2. Sort TempIndex by terms 3. Iterate through sorted TempIndex: merge all entries for the same term into one postings list. NLP

  9. Efficient Index Construction Problem: Indexes can be huge. How can we efficiently build them? • Blocked Sort-based Construction (BSBI) • Single-Pass In-Memory Indexing (SPIMI) What’s the difference? NLP

  10. Ranking Results NLP

  11. Problem: Too many matching results for every query Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results. • We need to be able to return the “most relevant” results. • We need to rank the results. NLP

  12. Documents as Vectors • Example: • Document D1: “yes we got no bananas” • Document D2: “what you got” • Document D3: “yes I like what you got” Vector V1: Vector V2: Vector V3: NLP

  13. What about queries? In the vector space model, queries are treated as (very short) documents. Example query: “bananas” Query Q1: NLP

  14. Measuring Similarity Similarity metric: the size of the angle between document vectors. “Cosine Similarity”: NLP

  15. Ranking documents Query Q1: Vector V1: Vector V2: Vector V3: NLP

  16. All words are equal? The TF-IDF measure is used to weight different words by more or less, depending on how informative they are. NLP

  17. Compare Document Classification and Document Retrieval/Ranking • Similarities: • Differences: NLP

  18. Synonymy NLP

  19. Handling Synonymy in Retrieval Problem: Straightforward search for a term may miss the most relevant results, because those documents use a synonym of the term. Examples: Search for “Burma” will miss documents containing only “Myanmar” Search for “document classification” will miss results for “text classification” Search for “scientists” will miss results for “physicists”, “chemists”, etc. NLP

  20. Two approaches • Convert retrieval into a classification or clustering problem • Relevance Feedback (classification) • Pseudo-relevance Feedback (clustering) • Expand the query to include synonyms or other relevant terms • Thesaurus-based • Automatic query expansion NLP

  21. Relevance Feedback Algorithm: • User issues a query q • System returns initial results D1 • User labels some results (relevant or not) • System learns a classifier/ranker for relevance • System returns new result set D2 NLP

  22. Relevance Feedback as Text Classification • The system gets a set of labeled documents (+ = relevant, - = not relevant) • This is exactly the input to a standard text classification problem • Solution: convert labeled documents into vectors, then apply standard learning • Rocchio, Naïve Bayes, k-NN, SVM, … NLP

  23. Details • In relevance feedback, there are few labeled examples • Efficiency is a concern • user is waiting online during training and testing • Output is ranking, not binary classification • But most classifiers can be converted into rankers e.g., Naïve Bayes can rank according to the probability score, SVM can rank according to wTx + b NLP

  24. Pseudo Relevance Feedback IDEA: instead of waiting for user to provide relevance judgements, just use top-K documents to represent + (relevant) class • It’s a somewhat mind-bending thought, but this actually works in practice. • Essentially, this is like one iteration of K-means clustering! NLP

  25. Clickstream Mining (Aka, “Indirect relevance feedback”) IDEA: use the clicks that users make as proxies for relevance judgments For example, if the search engine returns 10 documents for “bananas”, and users consistently click on the third link first, then increase the rank of that document and similar ones. NLP

  26. Query Expansion IDEA: help users formulate “better” queries “better” can mean • More precise, to exclude more unrelated stuff • More inclusive, to increase recall of documents that wouldn’t match a basic query NLP

  27. Query Term Suggestion Problem: Given a base query q, suggest a list of terms T={t1, …, tK} that could help the user refine the query. One common technique, is to suggest terms that frequently “co-occur” with terms already in the base query. NLP

  28. Co-occurrence Terms t1 and t2 “co-occur” if they occur near each other in the same document. There are many measures of co-occurrence, including: PMI, MI, LSI-based scores, and others NLP

  29. Computing Co-occurrence Example At,d = NLP

  30. Computing Co-occurrence Example Ct,t’ = ATA= * NLP

  31. Query Log Mining IDEA: use other people’s queries as suggestions for refinements of this query. Example: If I type “google” into the search bar, the search engine can suggest follow-up words that other people used, like: “maps”, “earth”, “translate”, “wave”, … NLP

More Related