490 likes | 517 Views
This study presents the development of a better generative model for information retrieval using textual analysis. By hyper-learning the model, parameters are learned from queries to rank documents more effectively. The empirical investigation compares the performance of the exponential framework model with existing probabilistic models, aiming to improve retrieval in applications such as classification and relevance feedback.
E N D
Empirical Development of anExponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI), MIT
Goal: Better Generative Model • Generative v. discriminative model • Applies to many applications • Information retrieval (IR) • Relevance feedback • Using unlabeled data • Classification • Assumptions explicit
Using a Model for IR Hyper-learn • Define model • Learn parameters from query • Rank documents • Better model improves applications • Trickle down to improve retrieval • Classification, relevance feedback, … • Corpus specific models
Overview • Related work • Probabilistic models • Example: Poisson Model • Compare model to text • Hyper-learning the model • Exponential framework • Investigate retrieval performance • Conclusion and future work
Related Work • Using text for retrieval algorithm • [Jones, 1972], [Greiff, 1998] • Using text to model text • [Church & Gale, 1995], [Katz, 1996] • Learning model parameters • [Zhai & Lafferty, 2002] Hyper-learn the model from text!
Probabilistic Models • Rank documents by RV =Pr(rel|d) • Naïve Bayesian models RV =Pr(rel|d)
Probabilistic Models • Rank documents by RV =Pr(rel|d) • Naïve Bayesian models # occs in doc = Pr(dt|rel) features t RV =Pr(rel|d) Pr(d|rel) 8 words • Open assumptions • Feature definition • Feature distribution family Defines the model!
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents Pr(dt|rel) =
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents • Poisson Model • θ: specifies term distribution dt -θ θ e Pr(dt|rel) = dt!
Example Poisson Distribution + θ=0.0006 Pr(dt|rel) Pr(dt|rel)≈1E-15 Term occurs exactlydt times
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents • Learn a θ for each term • Maximum likelihood θ • Term’s average number of occurrence • Incorporate prior expectations
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents • For each document, find RV • Sort documents by RV = Pr(dt|rel). words t RV
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents Which step goes wrong? • For each document, find RV • Sort documents by RV = Pr(dt|rel). words t RV
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents dt -θ θ e Pr(dt|rel) = dt!
How Good is the Model? + θ=0.0006 Pr(dt|rel) 15 times Term occurs exactlydt times
How Good is the Model? + θ=0.0006 Pr(dt|rel) Misfit! 15 times Term occurs exactlydt times
Hyper-learning a Better FitThrough Textual Analysis Using an Exponential Framework
Hyper-Learning Framework • Need framework for hyper-learning Mixtures Poisson Bernoulli Normal
Hyper-Learning Framework • Need framework for hyper-learning • Goal: Same benefits as Poisson Model • One parameter • Easy to work with (e.g., prior) Mixtures Poisson Bernoulli Normal One parameter exponential families
Exponential Framework • Well understood, learning easy • [Bernardo & Smith, 1994], [Gous, 1998] Pr(dt|rel) = f(dt)g(θ)e • Functions f(dt) and h(dt) specify family • E.g., Poisson: f(dt) = (dt!)-1,h(dt) = dt • Parameter θ term’s specific distribution θh(dt)
Using a Hyper-learned Model • Define model • Learn parameters from query • Rank documents
Using a Hyper-learned Model • Hyper-learn model • Learn parameters from query • Rank documents
Using a Hyper-learned Model • Hyper-learn model • Learn parameters from query • Rank documents • Want “best” f(dt) and h(dt) • Iterative hill climbing • Local maximum • Poisson starting point
Using a Hyper-learned Model • Hyper-learn model • Learn parameters from query • Rank documents • Data: TREC query result sets • Past queries to learn about future queries • Hyper-learn and test with different sets
Recall the Poisson Distribution + Pr(dt|rel) 15 times Term occurs exactlydt times
Poisson Starting Point - h(dt) + h(dt) Pr(dt|rel) =f(dt)g(θ)e θh(dt) dt
Hyper-learned Model - h(dt) Hyper-learned Model - h(dt) + h(dt) Pr(dt|rel) =f(dt)g(θ)e θh(dt) dt
Poisson Distribution + Pr(dt|rel) 15 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 15 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 5 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 30 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 300 times Term occurs exactlydt times
Performing Retrieval • Hyper-learn model • Learn parameters from query • Rank documents
Performing Retrieval Labeled docs • Hyper-learn model • Learn parameters from query • Rank documents θh(dt) Pr(dt|rel) = f(dt)g(θ)e • Learn θ for each term
Learning θ • Sufficient statistics • Summarize all observed data • τ1: # of observations • τ2: Σobservations d h(dt) • Incorporating prior easy • Map τ1 and τ2θ 20 labeled documents
Performing Retrieval • Hyper-learn model • Learn parameters from query • Rank documents
Results: Labeled Documents Results: Labeled Documents Precision Recall
Results: Labeled Documents Results: Labeled Documents Precision Recall
Performing Retrieval • Hyper-learn model • Learn parameters from query • Rank documents Short query
Retrieval: Query Retrieval: Query • Query = single labeled document • Vector space-like equation RV = Σa(t,d) + Σb(q,d) • Problem: Document dominates • Solution: Use only query portion • Another solution: Normalize t in doc q in query
Retrieval: Query Precision Recall
Retrieval: Query Precision Recall
Retrieval: Query Precision Recall
Conclusion • Probabilistic models • Example: Poisson Model • Hyper-learning the model • Exponential framework • Learned a better model • Investigate retrieval performance - Bad text model - Easy to work with - Heavy tailed! - Better …
Future Work • Use model better • Use for other applications • Other IR applications • Classification • Correct for document length • Hyper-learn on different corpora • Test if learned model generalizes • Different for genre? Language? People? • Hyper-learn model better
Questions? Contact us with questions: Jaime Teevan teevan@ai.mit.edu David Karger karger@theory.lcs.mit.edu