1 / 40

Introduction to Information Retrieval (Part 2)

Introduction to Information Retrieval (Part 2). By Evren Ermis. Introduction to Information Retrieval. Retrieval models Vector-space-model Probabilistic model Relevance feedback Evaluation Performance evaluation Retrieval Performance evaluation Reference Collections Evaluation measures.

landen
Download Presentation

Introduction to Information Retrieval (Part 2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Information Retrieval(Part 2) By Evren Ermis

  2. Introduction to Information Retrieval • Retrieval models • Vector-space-model • Probabilistic model • Relevance feedback • Evaluation • Performance evaluation • Retrieval Performance evaluation • Reference Collections • Evaluation measures

  3. Vector-space-model • Binary weights are too limiting • Non-binary weights to index terms • In querie • In documents • Compute the degree of similarity • Sorting in order of similarity allows considering documents which match partially

  4. Vector-space-model • Considering every document as vector • Similarity by correlation between vectors

  5. Vector-space-model • Not predicting wether relevant or not • But ranking according to similarity • Document can be retrieved although matches the querie only partially • Use threshold d to filter documents with similarity < d

  6. Vector-space-model Index term weights • features that better describe the seeked documents: intra-cluster similarity • distinguish the seeked documents from the rest: inter-cluster dissimilarity

  7. Vector-space-model Index term weights • Intra-cluster similarity • Inter-cluster dissimilarity

  8. Vector-space-model Index term weights • The weight of a term in a document is then calculated as product of the tf factor and the idf factor • Or for the query

  9. Vector-space-model • Advantages • Improves retrieval performance • Partial matching allowed • Sort according to similarity • Disadvantages • Assumes that index terms are independent

  10. Probabilistic model • Assuming that there is a set of documents, containing exactly the relevant documents and no other (ideal answer set) • Problem is that we don‘t know that set‘s properties • Index terms to characterize the properties • Use a initial guess at query time to receive a probabilistic discription of the ideal answer set • Use this to retrieve a first set of documents • Interaction with user to improve probabilistic discription of ideal answer set

  11. Probabilistic model • Interaction with user to improve probabilistic discription of ideal answer set • The probabilistic approach is to model the description in probabilistic terms without the user • Problem: Don‘t know how to compute the probabilties of relevance

  12. Probabilistic model • how to compute the probabilties of relevance • As measure of similarity • P(dj relevant-to q)/P(dj non-relevant-to q) • Odds of document dj being relevant to query q • So using similarity function:

  13. Probabilistic model • Problem: we don‘t have the set R at the beginning • Necessary to find initial probabilities • Make two assumptions: • P(kj|R) is constant for all index terms • Distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all documents

  14. Probabilistic model • So we get: • Now we can retrieve documents containing query terms and provide initial probabilistic ranking for them

  15. Probabilistic model • Now we can use these retrieved documents to improve our assumed probabilities • Let V be a subset of the retrieved documents and Vi a subset of V containing the i-th index term, then:

  16. Probabilistic model • Advantages: • Documents are ranked in decreasing order of their probability being relevant • Disadvantages: • Need guess for initial separation of relevant and non-relevant documents • Does not consider frequence of occurences of index term in a document

  17. Relevance feedback • Query reformulation strategy • User depicts relevant documents out of the retrieval • Method selects important terms attached to the user-identified documents • Enhances new gained information in a new query formulation and reweighting of the terms

  18. Relevance feedback for vector model • vectors of relevant documents have similarity among themselves • non-relevant documents have vectors that are dissimilar to the relevant ones • Reformulate the query such that it gets closer to term-weight vector space of the relevant documents

  19. Relevance feedback for vector model

  20. Relevance feedback for probabilistic model • Replacing V by Dr and Vi by Dr,i, whereas Dr set of user chosen documents, and Dr,i is the subset of Dr containing the index term ki.

  21. Relevance feedback for probabilistic model • Using this replacement and rewriting the similarity function for probabilistic model we get: • Reweighting of the index terms already in the query • Not expanding the query by new index terms

  22. Relevance feedback for probabilistic model • Advantages: • Feedback directly related toderivation of new weights • Reweighting is optimal under assumptions of • term independence • Binary document indexing • Disadvantages: • Document term weights not regarded in feedback loop • Previous term weights in query disregarded • No query expansion • Not as effectively as vector modification method

  23. Evaluation Types of evaluation: • Performance of the system(time and space) • Functional analysis in which the specified system functionalities are tested • How precise is the answer set • Reference collection • Evaluation measure

  24. Performance Evaluation • Performance of the indexing structures • Interaction with the operating system • Delays in communication channels • Overheads introduced by the many software layers

  25. Retrieval performance evaluation • Reference collection consists of • collection of documents • Set of example information requests • Set of relevant documents for each request • Evaluation measure • Uses reference collection • Quantifies the similarity between the documents retrieved by a retrieval strategy and the provided set of relevant documents

  26. Reference collection • Exist several different reference collection • TIPSTER/TREC • CACM • CISI • Cystic Fibrosis • etc. • Choose TIPSTER/TREC for further discussion

  27. TIPSTER/TREC • conference „Text Retrieval Conference“ • Built under the TIPSTER program • Large test collection (over 1 million documents) • For each conference a set of reference experiments is designed • Research groups use these to compare their retrieval systems

  28. Evaluation measure • Exist several different evaluation measures • Recall and precision • Average precision • Interpolated precision • Harmonic mean( F-measure ) • E-measure • Satisfaction, Frustation, etc. • Choose Recall and precision as the most used ones for further discussion

  29. Recall and precision

  30. Recall and precision • Definitions for recall: • Recall is the fraction of relevant documents which has been retrieved. • And precision: • Precision is the fraction of retrieved documents which is relevant.

  31. Precision vs. Recall • Assume that all documents in A have been examined • But user is not confronted with all docs • Instead sorted according to relevance • Recall and precision vary as the user proceeds examination of docs • Proper evaluation requires precision vs. recall curve

  32. Precision vs. Recall

  33. Average precision • Example figure for one query • To evaluate the retrieval algorithm have to run several distinct queries • Get distinct precision vs. recall curves • Average the precision figures at each recall level

  34. Interpolated precision • Recall levels for each query distinct from 11 standard recall levels • Interpolation procedure is necessary • Let rj be the j-th standard recall level with j=1,2,…,10. Then,

  35. Interpolated precision

  36. Example figures

  37. Harmonic Mean( F-measure ) • Harmonic mean defined as: • F high if recall and precision high • Therefore maximum F interpreted as best compromise between recall and precision

  38. E-measure • User specifies if more interest in recall or precision • E-measure defined as: • b is user specified and reflects relative importance of recall and precision

  39. Conclusion • Introduced two most popular models for information retreival: • Vector space model • Probabilistic model • Introduced evaluation methods to quantify performance of Information Retrieval Systems ( Recall and Precision, … )

  40. References • Baeza-Yates: „Modern Information Retrieval“ (1999) • G.Salton: „The Smart Retrieval System – Experiments in Automatic Document Processing“ (1971) • S.E.Roberston, K.Spark Jones: Relevance weighting of search terms – Journal of American Society for Information Sciences (1976) • N.Fuhr: „Probabilistic model in information retrieval“ (1992) • TREC NIST website: http://trec.nist.gov • J.J.Rocchio: Relevance feedback in information retrieval (1971)

More Related