CSM06 Information Retrieval

CSM06 Information Retrieval Lecture 1b – IR Basics Dr Andrew Salway a.salway@surrey.ac.uk

Requirements for IR systems When developing or evaluating an IR system the first considerations are… Who are the users of information retrieval systems? General public or specialist researchers? What kinds of information do they want to retrieve? Text, image, audio or video? General or specialist information?

Information Access Process Most uses of an information retrieval system can be characterised by this generic process • Start with an information need • Select a system / collections to search • Formulate a query • Send query • Receive results (i.e. information items) • Scan, evaluate, interpret results • Reformulate query and go to (4) OR stop From Baeza-Yates and Ribeiro-Neto (1999), p. 263 NB. When doing IR on the web a user can browse away from the results returned in step 5 – this may change the process

Information Need  Query Verbal queries • Single-word queries: a list of words • Context queries: phrase (“ ”); proximity (NEAR) • Boolean queries: use AND, OR, BUT • Natural Language: from sentence to whole text

Information Need  Query EXERCISE Tina is a user of an information retrieval system who is researching how the industrial revolution effected the urban population in Victorian England. • How could her information need be expressed with the different types of query described above? • What are the advantages / disadvantages of each query type?

“Ad-hoc” Retrieval Problem The ad-hoc retrieval problem is commonly faced by IR systems, especially web search engines. It takes the form “return information items on topic t” – where t is a string of one or more terms characterising a user’s information need. • For large collections this needs to happen automatically. • Note there is not a fixed list of topics! • So, the IR system should return documents relevant to the query. • Ideally it will rank the documents in order of relevance, so the user sees the most relevant first

Generic Architecture of an Text IR System • Based on Baeza-Yates and Riberio-Neto (1999), Modern Information Retrieval, Figure 1.3, p.10.

User Interface Text Operations Indexing Query Operations Searching INDEX Ranking Text Database

IR compared with data retrieval, and knowledge retrieval • Data Retrieval, e.g. SQL query to well-structured database – if data is stored you get exactly what you want

IR compared with data retrieval, and knowledge retrieval • Information Retrieval – returns information items from unstructured source; user must still interpret them

IR compared with data retrieval, and knowledge retrieval • Knowledge Retrieval (see current Information Extraction technology) – answers specific questions by analysing an unstructured information source, e.g. user could ask “What is capital of France?” and the system would answer “Paris” by ‘reading’ a book about France

How Good is an IR System? • We need ways to measure how good an IR systems is, i.e. evaluation metrics • Systems should return relevant information items (texts, images, etc); systems may rank the items in order of relevance Two ways to measure the performance of an IR system: Precision = “how many of the retrieved items are relevant?” Recall = “how many of the items that should have been retrieved were retrieved?” • These should be objective measures. • Both require humans to make decisions about what documents are relevant for a given query

Calculating Precision and Recall R = number of documents in collection relevant to topic t A(t) = number of documents returned by system in response to query t C = number of ‘correct’ (relevant) documents returned, i.e. the intersection of R and A(t) PRECISION = ((C+1)/(A(t)+1))*100 RECALL = ((C+1)/(R+1))*100

EXERCISE • Amanda and Alex each need to choose an information retrieval system. Amanda works for an intelligence agency, so getting all possible information about a topic is important for the users of her system. Alex works for a newspaper, so getting some relevant information quickly is more important for the journalists using his system. • See below for statistics for two information retrieval systems (Search4Facts and InfoULike) when they were used to retrieve documents from the same document collection in response to the same query: there were 100,000 documents in the collection, of which 50 were relevant to the given query. Which system would you advise Amanda to choose and which would you advise Alex to choose? Your decisions should be based on the evaluation metrics of precision and recall. Search4Facts • Number of Relevant Documents Returned = 12 • Total Number of Documents Returned = 15 InfoULike • Number of Relevant Documents Returned = 48 • Total Number of Documents Returned = 295

Precision and Recall: refinements • May plot graphs of P against R for single queries (see Belew 2000, Table 4.2 and Figs. 4.10 and 4.11) • These graphs are unstable for single queries so may need to combine P/R curves for multiple queries…

Reference Collections: TREC • A reference collection comprises a set of document and a set of queries for which all relevant documents have been identified: size is important!! • TREC = Text Retrieval Evaluation Conference • TREC Collection = 6GB of text (millions of documents – mainly news related) !! • See http://trec.nist.gov/

Reference Collections: Cystic Fibrosis Collection • Cystic Fibrosis collection comprises 1239 documents from the National Library of Medicine’s MEDLINE database + 100 information requests with relevant documents + four relevance scores (0-2) from four experts • Available for download: http://www.dcc.ufmg.br/irbook/cfc.html

Cystic Fibrosis Collection: example document PN 74001 RN 00001 AN 75051687 AU Hoiby-N. Jorgensen-B-A. Lykkegaard-E. Weeke-B. TI Pseudomonas aeruginosa infection in cystic fibrosis. SO Acta-Paediatr-Scand. 1974 Nov. 63(6). P 843-8. MJ CYSTIC-FIBROSIS: co. PSEUDOMONAS-AERUGINOSA: im. MN ADOLESCENCE. BLOOD-PROTEINS: me. CHILD. CHILD AB The significance of Pseudomonas aeruginosa infection in the respiratory tract of 9 cystic fibrosis patients have been studied…. … The results indicate no protective value of the many precipitins on the tissue of the respiratory tract. RF 001 BELFRAGE S ACTA MED SCAND SUPPL 173 5 963 002 COOMBS RRA IN: GELL PGH 317 964 CT 1 HOIBY N SCAND J RESPIR DIS 56 38 975 2 HOIBY N ACTA PATH MICROBIOL SCAND (C)83 459 975

CF Collection: example query and details of relevant documents QN 00001 QU What are the effects of calcium on the physical properties of mucus from CF patients? NR 00034 RD 139 1222 151 2211 166 0001 311 0001 370 1010 392 0001 439 0001 440 0011 441 2122 454 0100 461 1121 502 0002 503 1000 505 0001 139 = document number 1222 = expert 1 scored it relevance ‘1’, experts 2-4 scored it relevance ‘2’.

Further Reading • See Belew (2000), pages 119-128 • See also Belew CD for reference corpus (and lots more!)

Basic Concepts of IR: recap After this lecture, you should be able to explain and discuss: • Information access process; ‘ad-hoc’ retrieval • User information need; query; IR vs. data retrieval / knowledge retrieval; retrieval vs. browsing • Relevance; Ranking • Evaluation metrics - Precision and Recall

Set Reading • To prepare for next week’s lecture, you should look at: • Weiss et al (2005), handout – especially sections 1.4, 2.3, 2.4 and 2.5 • Belew, R. K. (2000), pages: 50-58

Further Reading For more about the IR basics in today’s lecture see introductions in: Belew, R. K. (2000), R. Baeza-Yates and Berthier Ribeiro-Neto, pages 1-9, or, Kowalski and Maybury (2000).

Further Reading • To keep up-to-date with web search engine developments, see www.searchenginewatch.com • I will put a links to some online articles about recent developments in web search technologies on the module web page

CSM06 Information Retrieval