1 / 32

Document Representation

Document Representation. Bag-of-words Bag-of-facts Bag-of-sentences Bag-of-nets. Language Modeling in IR. 2008-03-06. Document = Bag of Words Document = Bag of Sentences, Sentence = word sequence p( 南京市长 )p( 江大桥 | 南京市长 ) <<

Download Presentation

Document Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Representation • Bag-of-words • Bag-of-facts • Bag-of-sentences • Bag-of-nets

  2. Language Modeling in IR 2008-03-06

  3. Document = Bag of Words • Document = Bag of Sentences, Sentence = word sequence p(南京市长)p(江大桥|南京市长) << p(南京市)p(长江大桥|南京市) p(中国人民大学) >> p(中国大学人民)

  4. Agenda • Introduction to Language Model • What is LM • How can we use LM • What are the major issues in LM?

  5. What is a LM? • “语言”就是其字母表上的某种概率分布,该分布反映了任何一个字母序列成为该语言的一个句子(或其他任何的语言单元) 的可能性,称这个概率分布为语言模型。 • 给定的一个语言,对于一个语言“句子”(符号串),可以估计其出现的概率。 • 例如:假定对于英语, p1 (a quick brown dog) > p2 ( dog brown a quick) > p3 (brown dog 棕熊) > p4 (棕熊) • 若p1=p2,称为一阶语言模型,否则称为高阶语言模型

  6. Basic Notation • M: language we are try to model, it can be thought as a source • s: observation (string of tokens) • P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M

  7. Basic Notation • Let S=s1s2….sn be any sentence • P(S) = P(s1)P(s2|s1)….P(sn|s1,s2…sn) • Under n-gram model P(si|s1,s2…si-1)=P(si|si-n+1,…si-1) • n =1, ungram P(si|s1,s2,…,si-1)=P(si)

  8. How can we use LMs in IR • Use LM to model the process of query generation: • Every document in a collection defines a “language” • P(s|MD) defines the probability that author would write down string ”s” • Now suppose “q” is the user’s query • P(q|MD) is the probability of “q” during random sampling from the D, and can be treated as rank of document D in the collection

  9. Other ways to Rank 查询相似(query-likelihood): 通过计算P (Q |MD) 进行排序,即通过计算文档模型能在多大程度上产生查询的概率来排序。 文档相似(document-likelihood): 通过计算P (D|MQ)进行排序,即通过计算查询模型能在多大程度上产生文档的概率来排序。 模型比较(model comparison): 通过计算P (MQ| | MD)进行排序,即通过计算查询模型与文档模型的相似性进行排序。

  10. Major issues in applying LMs • What kind of language model should we use? • Unigram or high-order models • How can we estimate model parameters? • Basic model or advanced model • Data smoothing approaches

  11. What kind of models is better? • Unigram model • Bigram model • High-order model

  12. Unigram LMs • Words are “sampled” independently of each other • Joint probability decomposes into a production of marginals • P(xyz)=p(x)p(y)p(z) • P(“brown dog”)=P(“dog”)P(“brown”) • Estimation of probability :simple counting

  13. Higher-order Models • n-gram: condition on preceding words • Cache: condition on a window • Grammar: condition on a parse tree • Are they useful? ? • Parameter estimation is very expensive!

  14. Comparison • Song 和Croft指出,把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型则好8%左右。不过,Victor Lavrenko指出,Song 和Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。 • David R.H.Miller 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。 • 也有研究认为词序对于检索结果影响不大.

  15. Major issues in applying LMs • What kind of language model should we use? • Unigram or high-order models • How can we estimate model parameters? • Basic model or advanced model • Data smoothing approaches

  16. Estimation of parameter • Given a string of text S(=Q or D), estimate its LM: Ms • Basic LMs • Maximum-likelihood estimation • Zero-frequency problem • Discounting technology • Interpolation method

  17. Maximum-likelihood estimation • Let V be vocabulary of M,Q=q1q2…qm be a query, qi \in V, S=d1d2…dn be a doc. • Let Ms be the language model of S • P(Q|Ms) =? ,called query likelihood • P (Ms|Q) = P(Q| Ms)P(Ms)/P(Q) can be treated as the ranking of doc S. ~ P(Q| Ms)P(Ms) • Estimating P(Q|Ms),and P(Ms)

  18. Maximum-likelihood estimation • 估计P(Q|Ms)的方法: • Multivarint Bernouli model • Multinomial model • Bernouli model • 只考虑词是否在查询中出现,而不考虑出现几次。查询被看成是|v|个相互独立的Bernouli试验的结果序列 • P(Q|Ms)=∏w∈Q P(w|Ms) ∏w∈Q (1-P(w|Ms))

  19. Maximum-likelihood estimation • Multinomial model(多项式模型) • 将查询被看成是多项试验的结果序列,因此考虑了词在查询中出现的次数。 • P(Q|Ms)=∏qi∈Q P(qi|Ms)= ∏w∈Q P(w|Ms)#(w,Q) • 上述两种办法都将转换成对P(w|Ms)的估计,也就是将IR问题转换成对文档语言模型的估计问题。从而可以利用LM的研究成果。

  20. Maximum-likelihood estimation • 最简单的办法就是采用极大似然估计:Count relative frequencies of words in S P(w|Ms)=#(w,S)/|S| • 0-frenquency problem (由数据的稀疏性造成) • Assume some event not occur in S, then the probability is 0! • It is not correct, and we need to avoid it

  21. Discounting Method • Laplace correction(add-one approach): • Add 1 to every count,(normalize) • P(w|Ms)=(#(w,S)+1)/(|S|+|V|) • Problematic for large vocabularies(|V|太大的时候) • Lindstone correction(广义add-one方法) • Add a small constant to every count • Leave-one-out discounting • Remove some word, compute p(S|Ms), repeat for every word in the document, and maximize overall likelihood • Ref. Chen SF and Goodman JT: an empirical study of smoothing technology for language modeling, proc. 34th annual meeting of the ACL,1996

  22. Smoothing methods • Discounting 方法对待所有未出现的词是一样的,但实际上,仍然有不同,可以使用一些背景知识,例如利用英语语言知识。 • P(S|Ms)=cPML(S|Ms)+(1-c)P(S) = PML(S|Ms)+\&P(S) • PML(S|Ms)为条件概率, • P(S) =P(S|REF)为先验概率

  23. Additive smoothing methods • PML(s|Ms)=[ #(w,S)+c]/[|S|+c|V|] • P(w) = 1/|V| • \&= c/[|S|+c|V|]

  24. Jelinek-Mercer 方法 • Set c to be a constant, independent of document and query • Tune to optimize retrieval performance on different database, query set, etc.

  25. Dirichlet 方法 • c=N/(N+u) ,1-c =u/(N+u), • N: sample size from the collection, or the length of S, u is a parameter

  26. 平滑对检索性能的影响 • Zhai CX, Lafferty J, A study of smoothing methods for language models applied to ad hoc information retrieval. ACM SIGIR 2001 • Zhai CX Lafferty J, A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2)179-214 • 平滑有两个作用:一是估计,解决0概率问题,二是查询建模,消除或者降低噪音的影响

  27. Translation Models • Basic LMs do not address word synonymy. • P(q|M) = ∑w P(w|M) P(q|w) • P(q|w) 就是q和w之间的关系。如果q和w是近似词,这个值就比较大。 • P (q|w) 可以依据词的共现关系/相同词根/词典等进行计算,这是该方法的关键 • P (w|M) 就是语言模型下w的概率。

  28. LM Tools • LEMUR • www.cs.cmu.edu/~lemur • CMU/UMass joint project • C++, good documentation, forum-based support • Ad-hoc IR, Clustering, Q/A systems • ML+smoothing, … • YARI • lavrenko@cs.umass.edu • Ad-hoc IR, cross-language,classification • ML+smoothing,…

  29. Other applications of LM • Topic detection and tracking • Treat “q” as a topic description • Classification/ filtering • Cross-language retrieval • Multi-media retrieval

  30. References • Ponte JM, Croft WB, A Language Modeling approach to Information Retrieval, ACM SIGIR 1998, pp275-281 • Ponte JM, A Language Modeling approach to Information Retrieval, PhD Dissertation, UMass, 1998

  31. Bag-of-nets • 如果文本的概念用本体来表达,也就是将从文本中抽取出的概念放在领域本体的背景下,形成一个概念的网络,情况将如何呢? • 可否利用Bayesian Network?关键是怎么理解词与词之间的关系,是否具有因果关系? • 比如上下位关系? 关联关系?

More Related