CSA3080: Adaptive Hypertext Systems I

CSA3080:Adaptive Hypertext Systems I Lecture 6:Information Retrieval II Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 20 cstaff@cs.um.edu.mt

Aims and Objectives • Statistical Model of IR 2 of 20 cstaff@cs.um.edu.mt

Aims and Objectives • Once we know what an AHS user’s interests are, we can find relevant information in the document collection • Guide user along path • Show relevant document to user • Boolean/Extended Boolean models have some limitations. Statistical model may provide advantages 3 of 20 cstaff@cs.um.edu.mt

Precision and Recall • What is relevance? • How do we measure performance? • Recall: %age of relevant docs retrieved • Precision: %age of docs retrieved that are relevant 4 of 20 cstaff@cs.um.edu.mt

Boolean Model: Problems • Blair & Maron, 1985, “An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System” • Death-knell for pure Boolean approach • Evaluated IBM’s STorage And Information Retrieval System (STAIRS) • STAIRS used to index 40,000 legal documents representing c. 350,000 pages of text 5 of 20 cstaff@cs.um.edu.mt

Boolean Model: Problems • To retrieve all and only those documents that are relevant to a given request for information • Lawyers who made requests wanted at least 75% of relevant documents • Retrieval effectiveness discovered to be poor 6 of 20 cstaff@cs.um.edu.mt

Boolean Model: Problems • Lawyers would make request for information • Paralegals familiar with case and trained to use STAIRS would search for relevant documents • Lawyers would rate docs “vital”, “satisfactory”, “marginally relevant”, “irrelevant” • Lawyers could modify query • Iteration stops when lawyer signs that 75% of relevant docs have been seen 7 of 20 cstaff@cs.um.edu.mt

Boolean Model: Problems • Results: • Precision on average 79.0% • Recall on average only 20%! 8 of 20 cstaff@cs.um.edu.mt

Boolean Model: Problems • Why? • Mismatch between terminology used by lawyers/paralegals and authors of documents • Spelling mistakes in documents • Use of slang and indirect reference 9 of 20 cstaff@cs.um.edu.mt

Extended/Boolean Methods: other problems • The Vocabulary Problem • Furnas, et al, 1987, “The Vocabulary Problem in Human-System Communication” • “Armchair” naming of objects / concepts very inaccurate. Only c. 20% chance of two randomly selected people using the same name to refer to the same object/concept! • Implications for information retrieval • Why it appears to be a non-problem for Web-based systems 10 of 20 cstaff@cs.um.edu.mt

Extended/Boolean Methods: other problems • Boolean and extended boolean require a document to satisfy a query by containing the terms as specified in the query • Document representation is independent of other documents in the collection • No way of indicating which terms are more significant than others in the query 11 of 20 cstaff@cs.um.edu.mt

Extended/Boolean Methods: problems • Relevance feedback • RF is an important tool, much underutilised in “popular” search engines • Users not always able to describe need fully • But can always recognise a relevant document! • After initial query, mark documents in the results set as relevant or non-relevant • Let the IR system re-compute the query! 12 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • For a given term, which documents are statistically most likely to be about the term? • How does the co-occurrence of terms affect the relevance of the document? • Reference: • G. Salton and C. Buckley. (1988).Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523. 13 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • A document is considered to be relevant to a query if it is similar enough • A similarity measure calculates the Euclidean Distance between a query and a document representation plotted into vector space • Relevant documents can be ranked in descending order of similarity 14 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • Boolean model used simply presence or absence of a term in a document • Extended Boolean model used other term features, including term frequency to rank relevant documents • Statistical model also uses distribution of term in collection: document frequency • Size of collection / DF (inverse DF) 15 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • The term weight is: • Term frequency x Inverse Document Frequency • Also normalise term weight, so that length of document is taken into account • DL(j): no. of terms in document j • NDL(j): DL(j) / (Average document length) 16 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • Cosine Similarity Measure: 17 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • Can now rank documents according to similarity • Can also support relevance feedback in iterative retrieval • Relevance feedback can help AHS determine unspecified significant terms that also indicate user interests 18 of 20 cstaff@cs.um.edu.mt

Statistical Model of IR • Disadvantage that same document in different collections can have different IDF, which effects term weight • Modern approaches use statistical language models to use the likelihood of occurrence in the language, rather than in the document collection • Reference: • Djoerd Hiemstra and Franciska de Jong, (19??), Statistical Language Models and Information Retrieval: natural language processing really meets retrieval. 19 of 20 cstaff@cs.um.edu.mt

Conclusion • Statistical model of IR yields improvements over Boolean/Extended Boolean, although it is still not popular for Web-based search • Why? • Many approaches to adaptation use statistical evidence (e.g., Amazon) • Will investigate other models in CSA4080 20 of 20 cstaff@cs.um.edu.mt

CSA3080: Adaptive Hypertext Systems I