Information Retrieval

Information Retrieval January 6, 2003 Handout #1

Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: TBA • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Mondays, 1-4 PM in 409 West Hall

Introduction

Demos • Google • Vivísimo • AskJeeves • NSIR • Lemur • MG

Syllabus (Part I) CLASSIC IR Week 1 The Concept of Information Need, IR Models, Vector models, Boolean models Week 2 Retrieval Evaluation, Precision and Recall, F-measure, Reference collection, The TREC conferences Week 3 Queries and Documents, Query Languages, Natural language querying, Relevance feedback Week 4 Indexing and Searching, Inverted indexes Week 5 XML retrieval Week 6 Language modeling approaches

Syllabus (Part II) WEB-BASED IR Week 7 Crawling the Web, hyperlink analysis, measuring the Web Week 8 Similarity and clustering, bottom-up and top-down paradigms Week 9 Social network analysis for IR, Hubs and authorities, PageRank and HITS Week 10 Focused crawling, Resource discovery, discovering communities Week 11 Question answering Week 12 Additional topics, e.g., relevance transfer Week 13 Project presentations

Readings BOOKS Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley/ACM Press, 1999 http://www.sims.berkeley.edu/~hearst/irbook/ Soumen Chakrabarti, Mining the Web, Morgan Kaufmann, 2002 http://www.cse.iitb.ac.in/~soumen/ PAPERS Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999 Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Davison "Topical locality on the Web" SIGIR 2000 Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 Jeong and Barabási "Diameter of the world wide web" Nature (401) 130-131, 1999 Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 Haveliwala "Topic-sensitive pagerank" WWW 2002 Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999 Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998 Menczer "Links tell us about lexical and semantic Web content" arXiv 2001 Menczer "Growing and Navigating the Small World Web by Local Content”. Proc. Natl. Acad. Sci. USA 99(22) 2002 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998 Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002 Radev et al. “Content Diffusion on the Web Graph“ CASE STUDIES (IR SYSTEMS) Lemur, MG, Google, AskJeeves, NSIR

Assignments Homeworks: The course will have three homework assignments in the form of problem sets. Each problem set will include essay-type questions, questions designed to show understanding of specific concepts, and hands-on exercises involving existing IR engines. Project: The final course project can be done in three different formats: (1) a programming project implementing a challenging and novel information retrieval application, (2) an extensive survey-style research paper providing an exhaustive look at an area of IR, or (3) a SIGIR-style experimental IR paper.

Grading • Three HW assignments (30%) • Project (30%) • Final (40%)

Topics • IR systems • Evaluation methods • Indexing, search, and retrieval

Need for IR • Advent of WWW - more than 3 Billion documents indexed on Google • How much information? http://www.sims.berkeley.edu/research/projects/how-much-info/ • Search, routing, filtering • User’s information need

Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”

Examples of IR systems • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ). • Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language

Types of queries (AltaVista) Including or excluding words: To make sure that a specific word is always included in your search topic, place the plus (+) symbol before the key word in the search box. To make sure that a specific word is always excluded from your search topic, place a minus (-) sign before the keyword in the search box. Example: To find recipes for cookies with oatmeal but without raisins, try recipe cookie +oatmeal -raisin. Expand your search using wildcards (*): By typing an * at the end of a keyword, you can search for the word with multiple endings. Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy.

Types of queries AND (&) Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb. OR (|) Finds documents containing at least one of the specified words or phrases. Mary OR lamb finds documents containing either Mary or lamb. The found documents could contain both, but do not have to. NOT (!) Excludes documents containing the specified word or phrase. Mary AND NOT lamb finds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND. NEAR (~) Finds documents containing both specified words or phrases within 10 words of each other. Mary NEAR lamb would find the nursery rhyme, but likely not religious or Christmas-related documents.

Mappings and abstractions Reality Data Information need Query From Korfhage’s book

Documents • Not just printed paper • collections vs. documents • data structures: representations • document surrogates: keywords, summaries • encoding: ASCII, Unicode, etc.

Typical IR system • (Crawling) • Indexing • Retrieval • User interface

Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running

Size matters • Typical document surrogate: 200 to 2000 bytes • Book: up to 3 MB of data • Stemming: computer, computational, computing

Key Terms Used in IR • QUERY: a representation of what the user is looking for - can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query

Classification Cluster Similarity Information Extraction Term Frequency Inverse Document Frequency Precision Recall Inverted File Query Expansion Relevance Relevance Feedback Stemming Stopword Vector Space Model Weighting TREC/TIPSTER/MUC Other important terms

Query structures • Query viewed as a document? • Length • repetitions • syntactic differences • Types of matches: • exact • range • approximate

Additional references on IR • Gerard Salton, Automatic Text Processing, Addison-Wesley (1989) • Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) • Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983) • C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) • ACM SIGIR Proceedings, SIGIR Forum • ACM conferences in Digital Libraries

Related courses elsewhere • Berkeley (Marti Hearst and Ray Larson)http://www.sims.berkeley.edu/courses/is202/f00/ • Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze)http://www.stanford.edu/class/cs276a/ • Cornell (Jon Kleinberg)http://www.cs.cornell.edu/Courses/cs685/2002fa/ • CMU (Yiming Yang and Jamie Callan)http://la.lti.cs.cmu.edu/classes/11-741/

Readings for weeks 1 – 3 • MIR (Modern Information Retrieval) • Week 1 • Chapter 1 “Introduction” • Chapter 2 “Modeling” • Chapter 3 “Evaluation” • Week 2 • Chapter 4 “Query languages” • Chapter 5 “Query operations” • Week 3 • Chapter 6 “Text and multimedia languages” • Chapter 7 “Text operations” • Chapter 8 “Indexing and searching”

IR models

Major IR models • Boolean • Vector • Probabilistic • Language modeling • Fuzzy • Latent semantic indexing

Major IR tasks • Ad-hoc • Filtering and routing • Question answering • Spoken document retrieval • Multimedia retrieval

Venn diagrams z x w y D1 D2

Boolean model B A

Boolean queries restaurants AND (Mideastern OR vegetarian) AND inexpensive • What types of documents are returned? • Stemming • thesaurus expansion • inclusive vs. exclusive OR • confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal)

Boolean queries • Weighting (Beethoven AND sonatas) • precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses • Use of negation: potential problems • Conjunctive and Disjunctive normal forms • Full CNF and DNF

Transformations • De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) • CNF or DNF? • Reference librarians prefer CNF - why?

Boolean model • Partition • Partial relevance? • Operators: AND, NOT, OR, parentheses

Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • D3 = “information” • D4 = “computer information” • Q1 = “information  retrieval” • Q2 = “information ¬computer”

Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63

Vector-based representation Term 1 Doc 1 Doc 2 Term 3 Doc 3 Term 2

Vector queries • Each document is represented as a vector • non-efficient representations (bit vectors) • dimensional compatibility

The matching process • Matching is done between a document and a query - topicality • document space • characteristic function F(d) = {0,1} • distance vs. similarity - mapping functions • Euclidean distance, Manhattan distance, Word overlap

Vector-based matching • The Cosine measure  (di x qi)  (D,Q) =  (di)2 x  (qi)2 • Intrinsic vs. extrinsic measures

Exercise • Compute the cosine measures  (D1,D2) and  (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> • Compute the corresponding Euclidean distances.

Matrix representations • Term-document matrix (m x n) • term-term matrix (m x m x n) • document-document matrix (n x n) • Example: 3,000,000 documents (n) with 50,000 terms (m) • sparse matrices • Boolean vs. integer matrices

Zipf’s law Rank x Frequency  Constant

Evaluation

Contingency table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y

Precision and Recall w Recall: w+x w Precision: w+y

Information Retrieval