Probabilistic IR Models

Probabilistic IR Models Based on probability theory Basic idea: Given a document d and a query q,Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d) Compared to previous models:Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and dProbabilistic Models: Ranking based on estimated likelihood of d being relevant for query q

Probabilistic IR Models Again: Documents and queries are represented as vectors with binary weights (i.e. wij = 0 or 1) Relevance is seen as a relationship between an information need (expressed as query q) and a document. A document d is relevant if and only if a user with information need q "wants" d. Relevance is a function of various parameters, is subjective, can not always be exactly specified, Hence: Probabilistic description of relevance, i.e. instead of a vector space, we operate in an event space Q x D (Q = set of possible queries, D = set of all docs in collection) Interpretation: If a user with info need q draws a random document d from the collection, how big is its likelihood of being relevant, i.e. P(R | q,d)?

The Probability Ranking Principle Probability Ranking Principle (Robertson, 1977):Optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query. (Informal definition) Involves two assumptions:1. Dependencies between docs are ignored2. It is assumed that the probabilities can be estimated in the best possible way Main task: Estimation of probability P(R|q,d) for every document d in the document collection D REFERENCE: F. CRESTANI ET AL. IS THIS DOCUMENT RELEVANT? ... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN IR [3]

Probabilistic Modeling Given: Documents dj = (t1, t2, ..., tn), queries qi(n = no of docs in collection) We assume similar dependence between d and q as before, i.e. relevance depends on term distribution(Note: Slightly different notation here than before!) Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e. or

Probab. Modeling as Decision Strategy Decision about which docs should be returned based on a threshold calculated with a cost function Cj Example: Decision based on risk function that minimizes costs:

Probabilistic Modeling as Decision Strategy (Cont.)

Probability Estimation Different approaches to estimate P(d|R) exist:Binary Independence Retrieval Model (BIR)Binary Independence Retrieval Model (BII)Darmstadt Indexing Approach (DIA) Generally we assume stochastic independence between the terms of one document, i.e.

BIR TERMS DOCS LEARNING APPLICATION QUERIES Binary Independence Retr. Model (BIR) Learning:Estimation of probability distribution based on- a query qk- a set of documents dj- respective relevance judgments Application:Generalization to different documentsfrom the collection(but restricted to same query and terms from training)

BII TERMS DOCS LEARNING APPLICA-TION QUERIES Binary Indep. Indexing Model (BII) Learning:Estimation of probability distribution based on- a document dj- a set of queries qk- respective relevance judgments Application:Generalization to different queries(but restricted to same doc. and terms from training)

DIA TERMS DOCS LEARNING APPLICA-TION QUERIES Darmstadt Indexing Approach (DIA) Learning:Estimation of probability distribution based on- a set of queries qk- an abstract description of a set of documents dj- respective relevance judgments Application:Generalization to different queriesand documents

DIA - Description Step Basic idea: Instead of term-document pairs, consider relevance descriptions x(ti, dm) These contain the values of certain attributes of term ti, document dm and their relation to each other Examples:- Dictionary information about ti (e.g. IDF)- Parameters describing dm (e.g. length or no. of unique terms)- Information about the appearance of ti in dm (e.g. in title, abstract), its frequency, the distance between two query terms, etc. REFERENCE: FUHR, BUCKLEY [4]

DIA - Decision Step Estimation of probability P(R | x(ti, dm)) P(R | x(ti, dm)) is the probability of a document dm being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(ti, dm). Advantages:- Abstraction from specific term-doc pairs and thus generalization to random docs and queries- Enables individual, application-specific relevance descriptions

RELEVANCE DESCRIPTION: x(ti, dm) = (x1, x2) with x1 = x2 = 1, if ti title of dm 0, otherwise 1, if ti dm once 2, if ti dm at least twice DIA - (Very) Simple Example TRAINING SET: q1, q2, d1, d2, d3 EVENT SPACE:

Probabilistic IR Models

Probabilistic IR Models

Presentation Transcript

More Probabilistic Models

Probabilistic models

Probabilistic Models

Temporal Probabilistic Models

Temporal Probabilistic Models

Probabilistic Models

Probabilistic graphical models

Probabilistic Graphical Models

Probabilistic Graphical Models

Temporal Probabilistic Models

Temporal Probabilistic Models

Probabilistic Models

IR Models

Probabilistic Models

IR Models

Probabilistic Topic Models

IR Models: Review Vector Model and Probabilistic

Probabilistic Graphical Models

Probabilistic Models

Probabilistic Models

Probabilistic models

Probabilistic Topic Models