1 / 31

CPSC 503 Computational Linguistics

This lecture covers topics such as n-grams, model evaluation, and the introduction of Markov models in computational linguistics.

pquinn
Download Presentation

CPSC 503 Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 503Computational Linguistics Lecture 6 Giuseppe Carenini CPSC503 Winter 2009

  2. Today 25/9 • Finish n-grams • Model evaluation • Start Markov Models CPSC503 Winter 2009

  3. Knowledge-Formalisms Map(including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2009

  4. Two problems with applying: to CPSC503 Winter 2009

  5. Problem (1) • We may need to multiply many very small numbers (underflows!) • Easy Solution: • Convert probabilities to logs and then do …… • To get the real probability (if you need it) go back to the antilog. CPSC503 Winter 2009

  6. Problem (2) • The probability matrix for n-grams is sparse • How can we assign a probability to a sequence where one of the component n-grams has a value of zero? • Solutions: • Add-one smoothing • Good-Turing • Back off and Deleted Interpolation CPSC503 Winter 2009

  7. Add-One • Make the zero counts 1. • Rationale: If you had seen these “rare” events chances are you would only have seen them once. unigram discount CPSC503 Winter 2009

  8. Add-One: Bigram Counts …… CPSC503 Winter 2009

  9. BERP Original vs. Add-one smoothed Counts 6 5 2 2 19 15 CPSC503 Winter 2009

  10. Add-One Smoothed Problems • An excessive amount of probability mass is moved to the zero counts • When compared empirically with other smoothing techniques, it performs poorly • -> Not used in practice • Detailed discussion [Gale and Church 1994] CPSC503 Winter 2009

  11. Better smoothing technique • Good-Turing Discounting (clear example on textbook) More advanced: Backoff and Interpolation • To estimate an ngram use “lower order” ngrams • Backoff: Rely on the lower order ngrams when we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams • Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts CPSC503 Winter 2009

  12. N-Grams Summary: final Impossible to estimate! Sample space much bigger than any realistic corpus  Chain rule does not help  • Markov assumption : • Unigram… sample space? • Bigram … sample space? • Trigram … sample space? Sparse matrix: Smoothing techniques Look at practical issues: sec. 4.8 CPSC503 Winter 2009

  13. You still need a big corpus… • The biggest one is the Web! • Impractical to download every page from the Web to count the ngrams => • Rely on page counts (which are only approximations) • Page can contain an ngram multiple times • Search Engines round-off their counts • Such “noise” is tolerable in practice  CPSC503 Winter 2009

  14. Today 25/9 • Finish n-grams • Model evaluation • Start Markov Models CPSC503 Winter 2009

  15. Model Evaluation: Goal • You may want to compare: • 2-grams with 3-grams • two different smoothing techniques (given the same n-grams) On a given corpus… CPSC503 Winter 2008

  16. Model Evaluation: Key Ideas Corpus A:split Training Set Testing set B: train models C:Apply models • counting • frequencies • smoothing • Compare results Models: Q1 and Q2 CPSC503 Winter 2008

  17. Entropy • Def1. Measure of uncertainty • Def2. Measure of the information that we need to resolve an uncertain situation • Let p(x)=P(X=x); where x  X. • H(p)= H(X)= - xXp(x)log2p(x) • It is normally measured in bits. CPSC503 Winter 2009

  18. Model Evaluation Actual distribution How different? Our approximation • Relative Entropy (KL divergence) ? • D(p||q)=xXp(x)log(p(x)/q(x)) CPSC503 Winter 2008

  19. Entropy rate Language Entropy Assumptions: ergodic and stationary Entropy of Shannon-McMillan-Breiman NL? Entropy can be computed by taking the average log probability of a looooong sample CPSC503 Winter 2008

  20. Applied to Language Cross-Entropy Between probability distribution P and another distribution Q (model for P) Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower CPSC503 Winter 2008

  21. Model Evaluation: In practice Corpus A:split Training Set Testing set B: train models C:Apply models • counting • frequencies • smoothing • Compare cross-perplexities Models: Q1 and Q2 CPSC503 Winter 2008

  22. k-fold cross validation and t-test • Randomly divide the corpus in k subsets of equal size • Use each for testing (all the other for training) • In practice you do k times what we saw in previous slide • Now for each model you have k perplexities • Compare average models perplexities with t-test CPSC503 Winter 2008

  23. Today 25/9 • Finish n-grams • Model evaluation • Start Markov Models CPSC503 Winter 2009

  24. Example of a Markov Chain .6 1 a p h .4 .4 .3 .6 1 .3 e t 1 i .4 .6 .4 Start Start CPSC503 Winter 2007

  25. Markov-Chain .6 1 t i p a h e 0 .3 0 .3 .4 0 Formal description: t a p 1 Stochastic Transition matrix A h .4 0 .6 0 0 0 .4 i 0 0 1 0 0 0 p .4 0 0 .4 .6 0 0 a .3 .6 1 .3 0 0 0 0 0 1 h 1 0 0 0 0 0 e e t 1 i 2 Probability of initial states .4 .6 .4 Start t .6 Start i .4 CPSC503 Winter 2007

  26. Markov Assumptions • Let X=(X1, .., Xt) be a sequence of random variables taking values in some finite set S={s1, …, sn}, the state space, the Markov properties are: • (a) Limited Horizon: For all t,P(Xt+1|X1, .., Xt)=P(X t+1 | Xt) • (b)Time Invariant: For all t, P(X t+1 |Xt)=P(X2 | X1) i.e., the dependency does not change over time. CPSC503 Winter 2007

  27. Markov-Chain Probability of a sequence of states X1 … XT Example: Similar to …….? CPSC503 Winter 2007

  28. HMMs (and MEMM) intro They are probabilistic sequence-classifier / sequence-lablers: assign a class/label to each unit in a sequence We have already seen a non-prob. version... • Used extensively in NLP • Part of Speech Tagging e.g Brainpower_NN ,_, not_RBphysical_JJplant_NN ,_, is_VBZnow_RBa_DTfirm_NN 's_POSchief_JJasset_NN ._. • Partial parsing • [NP The HD box] that [NP you] [VP ordered] [PP from][NP Shaw][VP never arrived] • Named entity recognition • [John Smith PERSON] left [IBM Corp. ORG] last summer CPSC503 Winter 2007

  29. Hidden Markov Model(State Emission) .6 1 a .1 .5 b b s4 .4 s3 .4 .5 a .5 i .7 .6 .3 .1 1 i b s1 s2 a .9 .4 .6 .4 Start Start CPSC503 Winter 2007

  30. Hidden Markov Model .6 1 a .1 .5 b Formal Specification as five-tuple b s4 .4 s3 .4 .5 a Set of States .5 i Output Alphabet .7 .6 .3 Initial State Probabilities 1 .1 i b s1 s2 State Transition Probabilities a .9 .4 .6 .4 Start Symbol Emission Probabilities Start CPSC503 Winter 2007

  31. Next Time • Hidden Markov-Models (HMM) • Part-of-Speech (POS) tagging CPSC503 Winter 2009

More Related