- 122 Views
- Uploaded on

Download Presentation
## 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

**龙星计划课程:信息检索Overview of Text Retrieval:**Part 2 ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu**Outline**• Other retrieval models • Implementation of a TR System • Applications of TR techniques**P-norm (Extended Boolean)(Salton et al. 83)**• Motivation: how to rank documents with a Boolean query? • Intuitions • Docs satisfying the query constraint should get the highest ranks • Partial satisfaction of query constraint can be used to rank other docs • Question: How to capture “partial satisfaction”?**P-norm: Basic Ideas**• Normalized term weights for doc rep ([0,1]) • Define similarity between a Boolean query and a doc vector Q= T1 ORT2 Q= T1 AND T2 (0,1) (1,1) (0,1) (1,1) (x,y) (x,y) (0,0) (1,0) (0,0) (1,0)**P-norm: Formulas**Since the similarity value is normalized to [0,1], these two formulas can be applied recursively. 1 P + vector-space Boolean/Fuzzy logic**P-norm: Summary**• A general (and elegant) similarity function for Boolean query and a regular document vector • Connecting Boolean model and vector space model with models in between • Allowing different “confidence” on Boolean operators (different p for different operators) • A model worth more exploration (how to learn optimal p values from feedback?)**Overview of Retrieval Models**Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Learn to Rank (Joachims 02) (Burges et al. 05)**Formally…**3 random variables: query Q, document D, relevance R {0,1} Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=? The Basic Question What is the probability that THIS document is relevant to THIS query?**Probability of Relevance**• Three random variables • Query Q • Document D • Relevance R {0,1} • Goal: rank D based on P(R=1|Q,D) • Evaluate P(R=1|Q,D) • Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents • Several different ways to refine P(R=1|Q,D)**Refining P(R=1|Q,D) Method 1: conditional models**• Basic idea: relevance depends on how well a query matches a document • Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,.. • P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), ) • Using training data (known relevance judgments) to estimate parameter • Apply the model to rank new documents • Special case: logistic regression**Logistic Regression (Cooper 92, Gey 94)**logit function: logistic (sigmoid) function: P(R=1|Q,D) Uses 6 features X1, …, X6 1.0 X**Features/Attributes**Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged**Logistic Regression: Pros & Cons**• Advantages • Absolute probability of relevance available • May re-use all the past relevance judgments • Problems • Performance is very sensitive to the selection of features • No much guidance on feature selection • In practice, performance tends to be average**Refining P(R=1|Q,D) Method 2:generative models**• Basic idea • Define P(Q,D|R) • Compute O(R=1|Q,D) using Bayes’ rule • Special cases • Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R) • Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking D**Document Generation**Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes A1…Ak ….(why?) Let D=d1…dk, where dk{0,1} is the value of attribute Ak (Similarly Q=q1…qk ) Non-query terms are equally likely to appear in relevant and non-relevant docs**Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)**(RSJ model) Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation**RSJ Model: No Relevance Info(Croft & Harper 79)**(RSJ model) How to estimate parameters? Suppose we do not have relevance judgments, - We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant N: # documents in collection ni: # documents in which term Ai occurs**RSJ Model: Summary**• The most important classic prob. IR model • Use only term presence/absence, thus also referred to as Binary Independence Model • Essentially Naïve Bayes for doc ranking • Most natural for relevance/pseudo feedback • When without relevance judgments, the model parameters must be estimated in an ad hoc way • Performance isn’t as good as tuned VS model**Improving RSJ: Adding TF**Basic doc. generation model: Let D=d1…dk, where dkis the frequency count of term Ak 2-Poisson mixture model Many more parameters to estimate! (how many exactly?)**BM25/Okapi Approximation(Robertson et al. 94)**• Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties • Observations: • log O(R=1|Q,D) is a sum of term weights Wi • Wi= 0, if TFi=0 • Wi increases monotonically with TFi • Wi has an asymptotic limit • The simple function is**Adding Doc. Length & Query TF**• Incorporating doc length • Motivation: The 2-Poisson model assumes equal document length • Implementation: “Carefully” penalize long doc • Incorporating query TF • Motivation: Appears to be not well-justified • Implementation: A similar TF transformation • The final formula is called BM25, achieving top TREC performance**The BM25 Formula**“Okapi TF/BM25 TF”**Extensions of “Doc Generation” Models**• Capture term dependence (Rijsbergen & Harper 78) • Alternative ways to incorporate TF (Croft 83, Kalt96) • Feature/term selection for feedback (Okapi’s TREC reports) • Other Possibilities (machine learning … )**Query Generation**Query likelihoodp(q| d) Document prior Assuming uniform prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Leading to the so-called “Language Modeling Approach” …**What is a Statistical LM?**• A probability distribution over word sequences • p(“Today is Wednesday”) 0.001 • p(“Today Wednesday is”) 0.0000000000001 • p(“The eigenvalue is positive”) 0.00001 • Context-dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model**The Simplest Language Model(Unigram Model)**• Generate a piece of text by generating each word INDEPENDENTLY • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution**Text Generation with Unigram LM**Text mining paper Food nutrition paper (Unigram) Language Model p(w| ) Sampling Document … text 0.2 mining 0.1 association 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health**Estimation of Unigram LM**… text ? mining ? association ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 (Unigram) Language Model p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100)**Language Models for Retrieval(Ponte & Croft 98)**Language Model … text ? mining ? assocation ? clustering ? … food ? … ? Which model would most likely have generated this query? … food ? nutrition ? healthy ? diet ? … Document Query = “data mining algorithms” Text mining paper Food nutrition paper**Ranking Docs by Query Likelihood**Doc LM Query likelihood d1 p(q| d1) p(q| d2) d2 p(q| dN) dN d1 q d2 dN**Retrieval as Language Model Estimation**Document language model • Document ranking based on query likelihood • Retrieval problem Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches**How to Estimate p(w|d)?**• Simplest solution: Maximum Likelihood Estimator • P(w|d) = relative frequency of word w in d • What if a word doesn’t appear in the text? P(w|d)=0 • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about …**Language Model Smoothing (Illustration)**Smoothed LM P(w) Max. Likelihood Estimate Word w**How to Smooth?**• All smoothing methods try to • discount the probability of words seen in a document • re-allocate the extra counts so that unseen words will have a non-zero count • A simple method (additive smoothing): Add a constant to the counts of each word • Problems? Counts of w in d “Add one”, Laplace smoothing Vocabulary size Length of d (total counts)**A General Smoothing Scheme**Discounted ML estimate Collection language model • All smoothing methods try to • discount the probability of words seen in a doc • re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words**Smoothing & TF-IDF Weighting**Doc length normalization (long doc is expected to have a smaller d) TF weighting IDFweighting Ignore for ranking • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain • Smoothing with p(w|C) TF-IDF + length norm.**Derivation of the Query Likelihood Retrieval Formula**Discounted ML estimate Reference language model Retrieval formula using the general smoothing scheme Key rewriting step Similar rewritings are very common when using LMs for IR…**More Smoothing Methods**• Method 1 (Absolute discounting): Subtract a constant from the counts of each word • Method 2 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) # uniq words parameter ML estimate**More Smoothing Methods (cont.)**• Method 3 (Dirichlet Prior/Bayesian): Assumepseudo counts p(w|REF) • Method 4 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way parameter**Dirichlet Prior Smoothing**• ML estimator: M=argmax M p(d|M) • Bayesian estimator: • First consider posterior: p(M|d) =p(d|M)p(M)/p(d) • Then, consider the mean or mode of the posterior dist. • p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…,N) : our prior on the model parameters • conjugate = prior can be interpreted as “extra”/“pseudo” data • Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)**Dirichlet Prior Smoothing (cont.)**Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing**Advantages of Language Models**• Solid statistical foundation • Parameters can be optimized automatically using statistical estimation methods • Can easily model many different retrieval tasks • To be covered more later**What You Should Know**• Global relationship among different probabilistic models • How logistic regression works • How the Robertson-Sparck Jones model works • The BM25 formula • All document-generation models have trouble when no relevance judgments are available • How the language modeling approach (query likelihood scoring) works • How Dirichlet prior smoothing works • 3 state of the art retrieval models: Pivoted Norm Okapi/BM25 Query Likelihood (Dirichlet prior smoothing)**IR System Architecture**docs INDEXING Query Rep query Doc Rep User Ranking SEARCHING results INTERFACE Feedback judgments QUERY MODIFICATION**Indexing**• Indexing = Convert documents to data structures that enable fast search • Inverted index is the dominating indexing method (used by all search engines) • Other indices (e.g., document index) may be needed for feedback**Inverted Index**• Fast access to all docs containing a given term (along with freq and pos information) • For each term, we get a list of tuples (docID, freq, pos). • Given a query, we can fetch the lists for all query terms and work on the involved documents. • Boolean query: set operation • Natural language query: term weight summing • More efficient than scanning docs (why?)**Inverted Index Example**Doc 1 Dictionary Postings This is a sample document with one sample sentence Doc 2 This is another sample document**Data Structures for Inverted Index**• Dictionary: modest size • Needs fast random access • Preferred to be in memory • Hash table, B-tree, trie, … • Postings: huge • Sequential access is expected • Can stay on disk • May contain docID, term freq., term pos, etc • Compression is desirable

Download Presentation

Connecting to Server..