1 / 24

Basic IR Concepts & Techniques

Basic IR Concepts & Techniques. ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Text Information Systems Applications. Mining. Access. Select information. Create Knowledge. Add Structure/Annotations. Organization.

langer
Download Presentation

Basic IR Concepts & Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. Text Information Systems Applications Mining Access Select information Create Knowledge Add Structure/Annotations Organization

  3. Two Modes of Information Access: Pull vs. Push • Pull Mode • Users take initiative and “pull” relevant information out from a text information system (TIS) • Works well when a user has an ad hoc information need • Push Mode • Systems take initiative and “push” relevant information to users • Works well when a user has a stable information need or the system has good knowledge about a user’s need

  4. Pull Mode: Querying vs. Browsing • Querying • A user enters a (keyword) query, and the system returns relevant documents • Works well when the user knows exactly what keywords to use • Browsing • The system organizes information with structures, and a user navigates into relevant information by following a path enabled by the structures • Works well when the user wants to explore information or doesn’t know what keywords to use, or can’t conveniently enter a query (e.g., with a smartphone)

  5. Information Seeking as Sightseeing • Sightseeing: Know address of an attraction? • Yes: take a taxi and go directly to the site • No: walk around or take a taxi to a nearby place then walk around • Information seeking: Know exactly what you want to find? • Yes: use the right keywords as a query and find the information directly • No: browse the information space or start with a rough query and then browse Querying is faster, but browsing is useful when querying fails or a user wants to explore

  6. Text Mining: Two Different Views • Data Mining View: Explore patterns in textual data • Find latent topics • Find topical trends • Find outliers and other hidden patterns • Natural Language Processing View: Make inferences based on partial understanding of natural language text • Information extraction • Knowledge representation + inferences • Often mixed in practice

  7. Applications of Text Mining • Direct applications • Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? • Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it? • Indirect applications • Assist information access (e.g., discover latent topics to better summarize search results) • Assist information organization (e.g., discover hidden structures)

  8. IR Topics (Broader View): Text Information Systems (TIS) Retrieval Applications Summarization Visualization Mining Applications Filtering Clustering Information Organization Information Access Knowledge Acquisition Search Extraction Topic Analysis Categorization Natural Language Content Analysis Text

  9. Elements of TIS: Natural Language Content Analysis • Natural Language Processing (NLP) is the foundation of TIS • Enable understanding of meaning of text • Provide semantic representation of text for TIS • Current NLP techniques mostly rely on statistical machine learning enhanced with limited linguistic knowledge • Shallow techniques are robust, but deeper semantic analysis is only feasible for very limited domain • Some TIS capabilities require deeper NLP than others • Most text information systems use very shallow NLP (“bag of words” representation)

  10. Elements of TIS: Text Access • Search: take a user’s query and return relevant documents • Filtering/Recommendation: monitor an incoming stream and recommend to users relevant items (or discard non-relevant ones) • Categorization: classify a text object into one of the predefined categories • Summarization: take one or multiple text documents, and generate a concise summary of the essential content

  11. Elements of TIS: Text Mining • Topic Analysis: take a set of documents, extract and analyze topics in them • Information Extraction: extract entities, relations of entities or other “knowledge nuggets” from text • Clustering: discover groups of similar text objects (terms, sentences, documents, …) • Visualization: visually display patterns in text data

  12. IR Topics (narrow view) docs 4. Efficiency & scalability INDEXING Query Rep query 3. Document representation/structure Doc Rep 6. User interface (browsing) User Ranking SEARCHING 1. Evaluation 2. Retrieval (Ranking) Models results 5. Search result summarization/presentation INTERFACE Feedback judgments 7. Feedback/Learning QUERY MODIFICATION LEARNING Our focus: 1, 2, 7

  13. Typical TR System Architecture judgments Feedback docs query Tokenizer Doc Rep (Index) Query Rep User Scorer Indexer results Index

  14. Tokenization • Normalize lexical units: Words with similar meanings should be mapped to the same indexing term • Stemming: Mapping all inflectional forms of words to the same root form, e.g. • computer -> compute • computation -> compute • computing -> compute • Some languages (e.g., Chinese) pose challenges in word segmentation

  15. Indexing • Indexing = Convert documents to data structures that enable fast search • Inverted index is the dominating indexing method (used by all search engines): basic idea is to enable quick look up of all the documents containing a particular term • Other indices (e.g., document index) may be needed for feedback

  16. How to Design a Ranking Function? • Query q = q1,…,qm, where qi  V • Document d = d1,…,dn, where di  V • Ranking function: f(q, d)  • A good ranking function should rank relevant documents on top of non-relevant ones • Key challenge: how to measure the likelihood that document d is relevant to query q? • Retrieval Model = formalization of relevance (give a computational definition of relevance)

  17. Many Different Retrieval Models • Similarity-based models: • a document that is more similar to a query is assumed to be more likely relevant to the query • relevance (d,q) = similarity (d,q) • e.g., Vector Space Model • Probabilistic models (language models): • compute the probability that a given document is relevant to a query based on a probabilistic model • relevance(d,q) = p(R=1|d,q), where R {0,1} is a binary random variable • E.g., Query Likelihood

  18. Relevance Feedback Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... Feedback Users make explicit relevance judgments on the initial results (judgments are reliable, but users don’t want to make extra effort)

  19. Pseudo/Blind/Automatic Feedback top 10 assumed relevant Top-k initial results are simply assumed to be relevant (judgments aren’t reliable, but no user activity is required) Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query Document collection Judgments: d1 + d2 + d3 + … dk - ... Feedback

  20. Implicit Feedback Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Clickthroughs: d1 + d2 - d3 + … dk - ... Feedback User-clicked docs are assumed to be relevant; skipped ones non-relevant (judgments aren’t completely reliable, but no extra effort from users)

  21. Evaluation: Two Different Reasons • Reason 1: So that we can assess how useful an IR system/technology would be (for an application) • Measures should reflect the utility to users in a real application • Usually done through user studies (interactive IR evaluation) • Reason 2: So that we can compare different systems and methods (to advance the state of the art) • Measures only need to be correlated with the utility to actual users, thus don’t have to accurately reflect the exact utility to users • Usually done through test collections (test set IR evaluation)

  22. What to Measure? • Effectiveness/Accuracy: how accurate are the search results? • Measuring a system’s ability of ranking relevant docucments on top of non-relevant ones • Efficiency: how quickly can a user get the results? How much computing resources are needed to answer a query? • Measuring space and time overhead • Usability: How useful is the system for real user tasks? • Doing user studies

  23. The Cranfield Evaluation Methodology • A methodology for laboratory testing of system components developed in 1960s • Idea: Build reusable test collections & define measures • A sample collection of documents (simulate real document collection) • A sample set of queries/topics (simulate user queries) • Relevance judgments (ideally made by users who formulated the queries)  Ideal ranked list • Measures to quantify how well a system’s result matches the ideal ranked list • A test collection can then be reused many times to compare different systems

  24. What You Should Know • Information access modes: pull vs. push • Pull mode: querying vs. browsing • Basic elements of TIS: • search, filtering/recommendation, categorization, summarization • topic analysis, information extraction, clustering, visualization • Know the terms of the major concepts and techniques (e.g., query, document, retrieval model, feedback, evaluation, inverted index, etc)

More Related