1 / 20

CSA3080: Adaptive Hypertext Systems I

CSA3080: Adaptive Hypertext Systems I. Lecture 6: Information Retrieval II. Dr. Christopher Staff Department of Computer Science & AI University of Malta. Aims and Objectives. Statistical Model of IR. Aims and Objectives.

Download Presentation

CSA3080: Adaptive Hypertext Systems I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA3080:Adaptive Hypertext Systems I Lecture 6:Information Retrieval II Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 20 cstaff@cs.um.edu.mt

  2. Aims and Objectives • Statistical Model of IR 2 of 20 cstaff@cs.um.edu.mt

  3. Aims and Objectives • Once we know what an AHS user’s interests are, we can find relevant information in the document collection • Guide user along path • Show relevant document to user • Boolean/Extended Boolean models have some limitations. Statistical model may provide advantages 3 of 20 cstaff@cs.um.edu.mt

  4. Precision and Recall • What is relevance? • How do we measure performance? • Recall: %age of relevant docs retrieved • Precision: %age of docs retrieved that are relevant 4 of 20 cstaff@cs.um.edu.mt

  5. Boolean Model: Problems • Blair & Maron, 1985, “An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System” • Death-knell for pure Boolean approach • Evaluated IBM’s STorage And Information Retrieval System (STAIRS) • STAIRS used to index 40,000 legal documents representing c. 350,000 pages of text 5 of 20 cstaff@cs.um.edu.mt

  6. Boolean Model: Problems • To retrieve all and only those documents that are relevant to a given request for information • Lawyers who made requests wanted at least 75% of relevant documents • Retrieval effectiveness discovered to be poor 6 of 20 cstaff@cs.um.edu.mt

  7. Boolean Model: Problems • Lawyers would make request for information • Paralegals familiar with case and trained to use STAIRS would search for relevant documents • Lawyers would rate docs “vital”, “satisfactory”, “marginally relevant”, “irrelevant” • Lawyers could modify query • Iteration stops when lawyer signs that 75% of relevant docs have been seen 7 of 20 cstaff@cs.um.edu.mt

  8. Boolean Model: Problems • Results: • Precision on average 79.0% • Recall on average only 20%! 8 of 20 cstaff@cs.um.edu.mt

  9. Boolean Model: Problems • Why? • Mismatch between terminology used by lawyers/paralegals and authors of documents • Spelling mistakes in documents • Use of slang and indirect reference 9 of 20 cstaff@cs.um.edu.mt

  10. Extended/Boolean Methods: other problems • The Vocabulary Problem • Furnas, et al, 1987, “The Vocabulary Problem in Human-System Communication” • “Armchair” naming of objects / concepts very inaccurate. Only c. 20% chance of two randomly selected people using the same name to refer to the same object/concept! • Implications for information retrieval • Why it appears to be a non-problem for Web-based systems 10 of 20 cstaff@cs.um.edu.mt

  11. Extended/Boolean Methods: other problems • Boolean and extended boolean require a document to satisfy a query by containing the terms as specified in the query • Document representation is independent of other documents in the collection • No way of indicating which terms are more significant than others in the query 11 of 20 cstaff@cs.um.edu.mt

  12. Extended/Boolean Methods: problems • Relevance feedback • RF is an important tool, much underutilised in “popular” search engines • Users not always able to describe need fully • But can always recognise a relevant document! • After initial query, mark documents in the results set as relevant or non-relevant • Let the IR system re-compute the query! 12 of 20 cstaff@cs.um.edu.mt

  13. Statistical Model of IR • For a given term, which documents are statistically most likely to be about the term? • How does the co-occurrence of terms affect the relevance of the document? • Reference: • G. Salton and C. Buckley. (1988).Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523. 13 of 20 cstaff@cs.um.edu.mt

  14. Statistical Model of IR • A document is considered to be relevant to a query if it is similar enough • A similarity measure calculates the Euclidean Distance between a query and a document representation plotted into vector space • Relevant documents can be ranked in descending order of similarity 14 of 20 cstaff@cs.um.edu.mt

  15. Statistical Model of IR • Boolean model used simply presence or absence of a term in a document • Extended Boolean model used other term features, including term frequency to rank relevant documents • Statistical model also uses distribution of term in collection: document frequency • Size of collection / DF (inverse DF) 15 of 20 cstaff@cs.um.edu.mt

  16. Statistical Model of IR • The term weight is: • Term frequency x Inverse Document Frequency • Also normalise term weight, so that length of document is taken into account • DL(j): no. of terms in document j • NDL(j): DL(j) / (Average document length) 16 of 20 cstaff@cs.um.edu.mt

  17. Statistical Model of IR • Cosine Similarity Measure: 17 of 20 cstaff@cs.um.edu.mt

  18. Statistical Model of IR • Can now rank documents according to similarity • Can also support relevance feedback in iterative retrieval • Relevance feedback can help AHS determine unspecified significant terms that also indicate user interests 18 of 20 cstaff@cs.um.edu.mt

  19. Statistical Model of IR • Disadvantage that same document in different collections can have different IDF, which effects term weight • Modern approaches use statistical language models to use the likelihood of occurrence in the language, rather than in the document collection • Reference: • Djoerd Hiemstra and Franciska de Jong, (19??), Statistical Language Models and Information Retrieval: natural language processing really meets retrieval. 19 of 20 cstaff@cs.um.edu.mt

  20. Conclusion • Statistical model of IR yields improvements over Boolean/Extended Boolean, although it is still not popular for Web-based search • Why? • Many approaches to adaptation use statistical evidence (e.g., Amazon) • Will investigate other models in CSA4080 20 of 20 cstaff@cs.um.edu.mt

More Related