1 / 20

Chapter 23: Probabilistic Language Models

Chapter 23: Probabilistic Language Models. April 13, 2004. Corpus-Based Learning. Information Retrieval Information Extraction Machine Translation. 23.1 Probabilistic Language Models. There are several advantages Can be trained from data Robust (accept any sentence)

kalb
Download Presentation

Chapter 23: Probabilistic Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 23: Probabilistic Language Models April 13, 2004

  2. Corpus-Based Learning • Information Retrieval • Information Extraction • Machine Translation

  3. 23.1 Probabilistic Language Models • There are several advantages • Can be trained from data • Robust (accept any sentence) • Reflect fact that not all speakers agree on which sentences are part of a language • Can be used for disambiguation

  4. Unigram Model  P(wi) • Bigram Model  P(wi | wi-1) • Trigram Model  P(wi | wi-2, wi-1)

  5. Smoothing • Problem: many pairs (triples, etc.) of words never occur in the training text. • N: words in corpus • B: possible bigrams • c: actual count of bigram • Add-One Smoothing (c + 1) / (N + B)

  6. Smoothing • Linear Interpolation Smoothing P(wi | wi-2, wi-1) = c3 P(wi | wi-2, wi-1) + c2 P(wi | wi-1) + c1 P(wi) c1 + c2 + c3 = 1

  7. Segmentation • The task is to find the word boundaries in a text with no spaces • P(“with”) = .2 • P(“out”) = .1 • P(“with out”) = .02 (unigram model) • P(“without”) = .05 • Figure 23.1, Viterbi-based segmentation algorithm

  8. Probabilistic CFG (PCFG) • N-Gram models have no notion of grammar at distances greater than n • Figure 23.2, PCFG example • Figure 23.3, PCFG parse • Problem: context-free • Problem: preference for short sentences

  9. Learning PCFG Probabilities • Parsed Data: straight forward • Unparsed Data: two challenges • Learning the structure of the grammar rules. A Chomsky Normal Form bias can be used (X  Y Z, X  t). Something similar to SEQUITUR can be used. • Learning the probabilities associated with each rule (inside-outside algorithm, based on dynamic programming)

  10. 23.2 Information Retrieval • Components of IR System: • Document Collection • Query Posed in Query Language • Result Set • Presentation of Result Set

  11. Boolean Keyword Model • Boolean queries • Each word in a document is treated as a boolean feature • Drawbacks • Each word is a single bit of relevance • Boolean logic can be difficult to use correctly for the average user

  12. General Framework • r: Boolean random variable indicating relevance that has the value true • D: Document • Q: Query • P( r | D, Q) • Order results by decreasing probability

  13. Language Modeling • P(r | D, Q) • = P(D, Q | r) * P(r) / P(D, Q) Baye’s • = P(Q | D, r) * P(D | r) * P(r) / P(D, Q) chain rule • = P(Q | D, r) *  * P(r | D) * P(r) / P(D, Q) Baye’s rule, fixed D • maximize P(r | D, Q) / P( r | D, Q)

  14. Language Modeling • = P(Q | D, r) * P(r | D) / P(Q | D,  r) * P( r | D) • Eliminate P(Q | D, r). If a document is irrelevant to a query, then knowing the document won’t help determine the query. • = P(Q | D, r) * P(r | D) / P(r | D)

  15. Language Modeling • P(r | D) / P(r | D) is a query independent measure of document quality. This can be estimated by references to the document, the recency of the document, etc. • P(Q | D, r) = j P(Qj | D, r) where each Qj is a words in the query. • Figure 23.4.

  16. Evaluating IR Systems • Precision. Proportion of documents in result set that are actually relevant. • Recall. Proportion of relevant documents in the collection that are in the result set. • Average Reciprocal Rank. • Time to Answer. Length of time for user to find desired answer

  17. IR Refinements • Stemming. Can help recall, can hurt precision. • Case Folding. • Synonyms. • Use a bigram model. • Spelling Corrections. • Metadata.

  18. Result Sets • Relevance feedback from user. • Document classification. • Document clustering. • K-Means clustering • 1. Pick k documents at random as category seeds • 2. Assign every document to the closest category • 3. Computer the mean of each cluster and uses these means as the new seeds. • 4. Go to step 2 until convergence occurs.

  19. Implementing IR Systems • Lexicon. Given a word, return the location in the inverted index. Stop words are often omitted. • Inverted Index. Might be a list of (document, count) pairs.

  20. Vector Space Model • Used more often in practice than the probabilistic model • Documents are represented as vectors of unigram word frequencies. • A query is represented as a vector consisting of 0s and 1s, e.g. [0 1 1 0 0].

More Related