1 / 14

Information Retrieval

Information Retrieval. CSE 8337 Spring 2003 Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

alena
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval CSE 8337 Spring 2003 Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

  2. Motivation • IR: representation, storage, organization of, and access to information items • Focus is on the user information need • User information need: • Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. • Emphasis is on the retrieval of information (not data)

  3. DB vs IR • Records (tuples) vs. documents • Well defined results vs. fuzzy results • DB grew out of files and traditional business systesm • IR grew out of library science and need to categorize/group/access books/articles

  4. DB vs IR (cont’d) • Data retrieval • which docs contain a set of keywords? • Well defined semantics • a single erroneous object implies failure! • Information retrieval • information about a subject or topic • semantics is frequently loose • small errors are tolerated • IR system: • interpret contents of information items • generate a ranking which reflects relevance • notion of relevance is most important

  5. Motivation • IR in the last 20 years: • classification and categorization • systems and languages • user interfaces and visualization • Still, area was seen as of narrow interest • Advent of the Web changed this perception once and for all • universal repository of knowledge • free (low cost) universal access • no central editorial board • many problems though: IR seen as key to finding the solutions!

  6. Retrieval Database Browsing Basic Concepts • The User Task • Retrieval • information or data • purposeful • Browsing • glancing around • cars, Le Mans, France, tourism

  7. Accents spacing Noun groups Manual indexing Docs stopwords stemming structure structure Full text Index terms Basic Concepts Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift

  8. Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process

  9. Fuzzy Sets and Logic • Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. • f(x): Probability x is in F. • 1-f(x): Probability x is not in F. • EX: • T = {x | x is a person and x is tall} • Let f(x) be the probability that x is tall • Here f is the membership function

  10. Fuzzy Sets

  11. IR is Fuzzy Reject Reject Accept Accept Simple Fuzzy

  12. Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”.

  13. Information Retrieval • Similarity: measure of how close a query is to a document. • Documents which are “close enough” are retrieved. • Metrics: • Precision = |Relevant and Retrieved| |Retrieved| • Recall= |Relevant and Retrieved| |Relevant|

  14. IR Query Result Measures IR

More Related