Information Retrieval Models

Information Retrieval Models http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/14/2014

本次课大纲 Information Retrieval Models Vector Space Model (VSM) Latent Semantic Model (LSI) Language Model (LM)

Vector Space Model

Documents as vectors 每一个文档 j能够被看作一个向量，每个term 是一个维度，取值为log-scaled tf.idf So we have a vector space terms are axes docs live in this space 高维空间：即使作stemming, may have 20,000+ dimensions

Intuition t3 d2 d3 d1 θ φ t1 d5 t2 d4 Postulate: 在vector space中“close together” 的文档会talk about the same things. 用例：Query-by-example，Free Text query as vector

Cosine similarity 向量d1和d2的“closeness”可以用它们之间的夹角大小来度量具体的，可用cosine of the anglex来计算向量相似度. 向量按长度归一化Normalization t 3 d 2 d 1 θ t 1 t 2

#1.COS Similarity 计算查询 “digital cameras” 与文档 “digital cameras and video cameras” 之间的相似度. 假定 N = 10,000,000, query和document都采用logarithmic term weighting (wf columns), query采用idf weighting ，document采用cosine normalization. “and”作为stop word.

Sec. 6.4 ddd.qqq = lnc.ltc Columns headed ‘n’ are acronyms for weight schemes.

Relevance Feedback Query Expansion

A Brief History of IR Slides from Prof. Ray Larson University of California, Berkeley School of Information http://courses.sims.berkeley.edu/i240/s11/

Experimental IR systems Probabilistic indexing – Maron and Kuhns, 1960 SMART – Gerard Salton at Cornell – Vector space model, 1970’s SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?) IS 240 – Spring 2011

Historical Milestones in IR Research 1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft) IS 240 – Spring 2011

Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok) 1998 Language Models (Ponte, Croft) IS 240 – Spring 2011

Information Retrieval – Historical View Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) Research Industry IS 240 – Spring 2011

Research Systems Software INQUERY (Croft) OKAPI (Robertson) PRISE (Harman) http://potomac.ncsl.nist.gov/prise SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson) http://cheshire.berkeley.edu LEMUR toolkit Lucene Others IS 240 – Spring 2011

Latent Semantic Model

Vector Space Model: Pros Automatic selection of index terms Partial matching of queries and documents (dealing with the case where no document contains all search terms) Ranking according to similarity score(dealing with large result sets) Term weighting schemes (improves retrieval performance) Various extensions Document clustering Relevance feedback (modifying query vector) Geometric foundation

Problems with Lexical Semantics Polysemy: 词通常有multitude of meanings和不同用法。Vector Space Model不能区分同一个词的不同含义，即ambiguity. Synonymy: 不同的terms可能具有identical or a similar meaning. Vector Space Model里不能表达词之间的associations.

Issues in the VSM terms之间的独立性假设有些terms更可能在一起出现同义词，相关词汇，拼写错误，etc. 根据上下文，terms可能有不同的含义 term-document矩阵维度很高对每篇文档/每个词，真的有那么多重要的特征?

Singular Value Decomposition  DT Wtd T = r  r r  d t  d t  r • 对term-document矩阵作奇异值分解SingularValue Decomposition • r, 矩阵的rank • , singular values的对角阵（按降序排列） • D, T, 具有正交的单位长度列向量(TT’=I, DD’=I) WWT的特征值 WTW和WWT的特征向量

Singular Values •  gives an ordering to the dimensions • 值下降非常快 • 尾部的singular values at 代表"noise" • 在low-value dimensions截止可以减少 noise，提高性能

Low-rank Approximation  DT wtd T = r  r r  d t  d t  r  DT w'td T ≈ = k  k k  d t  d t  k

Latent Semantic Indexing (LSI) Perform a low-rank approximation of term-document matrix (typical rank 100-300) General idea Map documents (and terms) to a low-dimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). Compute document similarity based on the inner product in this latent semantic space

What it is 从原始的term-document矩阵Ar, 我们计算得到它的近似Ak. 在Ak中，每行对应一个term，每列对应一个document 区别是，文档在新的空间，它的维度k << r dimensions 怎样比较两个term? 怎样比较两个document? 怎样比较一个term和一个文档? AkAKT=TDT D TTT= (T)(T)T AKTAk= D T TT TDT= (DT)T( DT) Ak[I,j]

LSI Term matrix T T matrix 每个term在LSI space的向量原始matrix: terms向量是d-dimensional，T中要小很多 Dimensions是在相同文档中倾向于与这个词“同现”的一组terms synonyms, contextually-related words, variant endings (T) 用来计算term相似度

Document matrix D D matrix 在LSI space中文档的表示和T vectors有相同的dimensionality (DT) 用来计算document相似度可用于计算查询和一个文档的similarity

Retrieval with LSI LSI检索过程：查询映射/投影到LSI的DT空间，称为“folded in“ ： W=TDT，若q投影到DT中后为q’，则有q = Tq’T 既有q’= (-1T-1q)T = qT-1 Folded in 既为 document/query vector 乘上T-1 文档集的文档向量为DT 两者通过dot-product计算相似度

Improved Retrieval with LSI 性能提升来自… 去除了noise 不需要stem terms (variants will co-occur) 不需要stop list 没有速度和空间上的改进, though…

C= Tr= r=

DrT= 2= 2 D2T=

Example Map into 2-dimenstion space

Latent Semantic Analysis Latent semantic space: illustrating example courtesy of Susan Dumais

Empirical evidence Experiments on TREC 1/2/3–Dumais Precision at or above median TREC precision Top scorer on almost 20% of TREC topics Slightly better on average than straight vector spaces Effect of dimensionality:

LSI has many other applications 在很多场合,我们都有feature-object matrix. 矩阵是高维，有大量冗余，从而能使用low-rank approximation. 比如文本检索，the terms是features，the docs是objects. Latent Semantic Index 比如opinions和users … 数据不全(e.g., users’ opinions), 可以在低维空间里恢复. Powerful general analytical technique

Language Models

IR based on Language Model (LM) 通常的search方法：猜测作者写相关文档时使用的词，形成query The LM approach directly exploits that idea! d1 generation d2 … … dn Information need query document collection

Formal Language (Model) 传统的生成模型 generative model: 产生strings Finite state machines or regular grammars, etc. Example: I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … (I wish) *

Stochastic Language Models Models probability of generating strings in the language (commonly all strings over alphabet ∑) multiply Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 P(s | M) = 0.00000008

Stochastic Language Models Model probability of generating any string 0.2 the 0.01 class 0.0001 pleaseth 0.0001 yon maiden 0.0005 0.2 0.0001 0.02 0.1 0.01 Model M1 Model M2 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman P(s|M2) > P(s|M1)

Stochastic Language Models 用来生成文本的统计模型 Probability distribution over strings in a given language P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, ) M

Unigram and higher-order models Unigram Language Models Bigram (generally, n-gram) Language Models Other Language Models Grammar-based models (PCFGs), etc. Probably not the first thing to try in IR = P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective!

The fundamental problem of LMs 模型 M 是不知道的只有代表这个模型的样例文本从样例文本中来估计Model 然后计算观察到的文本概率 P ( | M ( ) ) M

Using Language Models in IR 每篇文档对应一个model 按P(d | q)对文档排序 P(d | q) = P(q | d) x P(d) / P(q) P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model Very general formal approach

Language Models for IR Language Modeling Approaches 为query generation process建模文档排序：按一个query作为由文档模型产生的随机样本而被观察到的概率the probability that a query would be observed as a random sample from the respective document model Multinomial approach

Retrieval based on probabilistic LM 把query的产生当作一个随机过程方法为每个文档Infer a language model. Estimate the probability：估计每个文档模型产生这个query的概率 Rank：按这个概率对文档排序. 通常使用Unigram model

Query generation probability (1) 排序公式用最大似然估计: : language model of document d : raw tf of term t in document d : total number of tokens in document d Unigram assumption: Given a particular language model, the query terms occur independently

Insufficient data Zero probability 一个文档里没有query中的某个term时… General approach 没有出现文档中的term按它出现在collection中的概率来代替. If , : raw count of term t in the collection : raw collection size(total number of tokens in the collection)

Insufficient data Zero probabilities spell disaster 使用平滑：smooth probabilities Discount nonzero probabilities Give some probability mass to unseen things 有很多方法，如adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation [See FSNLP ch. 6 if you want more] 使用混合模型：use a mixture between the document multinomial and the collection multinomial distribution

Information Retrieval Models