Language Models for Information Retrieval

Language Models for Information Retrieval Andy Luong and Nikita Sudan

Outline • Language Model • Types of Language Models • Query Likelihood Model • Smoothing • Evaluation • Comparison with other approaches

Language Model • A language model is a function that puts a probability measure over strings drawn from some vocabulary.

Language Models P(q|Md) instead of P(R=1|q,d)

Example • Doc1: “frog said that toad likes frog” • Doc2: “toad likes frog” 1/3 1/6 1/3

Example Continued q = “frog likes toad” P(q | M1) = (1/3)*(1/6)*(1/6)*0.8*0.8*0.2 P(q | M2) = (1/3)*(1/3)*(1/3)*0.8*0.8*0.2 P(q | M1) < P (S | M2)

Types of Language Models CHAIN RULE UNIGRAM LM BIGRAM LM

Multinomial distribution Frequency Order Constraint M is the size of the term vocabulary

Query Likelihood Model ≈

Query Likelihood Model • Infer LM for each document • Estimate P(q | Md(i)) • Rank documents based on probabilities

MLE

Smoothing • Basic Intuition • New word or unseen word in the document • P( t | Md) = 0 • Zero probabilities will make P ( q | Md) = 0 • Why else should we smooth?

Smoothing Continued Non-occurring term Probability Bound Linear Interpolation Language Model

Example • Doc1: “frog said that toad likes frog” • Doc2: “toad likes frog” 1/3 1/9 1/9 2/9 2/9

Example Continued q= “frog said” λ = ½ P(q | M1) = [(1/3 + 1/3)*(1/2)] * [(1/6 + 1/9)*(1/2)] = .046 P(q | M2) = [(1/3 + 1/3)*(1/2)] * [(0 + 1/9)*(1/2)] = .018 P(q | M1) > P (q | M2)

Evaluation • Precision = (relevant documents ∩ retrieved documents)/ retrieved documents • Recall = (relevant documents ∩ retrieved documents)/ relevant documents

Tf-Idf • The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Ponte and Croft’s Experiments

Pros and Cons • “Mathematically precise, conceptually simple, computationally tractable and intuitively appealing.” • Relevancy is not captured

Query vs. Document Model (a) Query Likelihood (b) Document Likelihood (c) Model Comparison

KL divergence

Thank you.

Questions?

Language Models for Information Retrieval

Language Models for Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Cross-Language Information Retrieval

Cumulative Progress in Language Models for Information Retrieval

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Advanced Information- Retrieval Models

Information Retrieval Models

Two-stage Language Models for Information Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Natural Language Processing for Information Retrieval

Language and Document Models in Information Retrieval

Information Retrieval Models

Language Modeling Frameworks for Information Retrieval

Dependence Language Model for Information Retrieval

Discriminative Models for Information Retrieval