1 / 49

N-grams Language Modeling and HW 5: Gaussians for Speaker Identification

aiko
Download Presentation

N-grams Language Modeling and HW 5: Gaussians for Speaker Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. N-grams & Language Modeling and HW 5: Gaussians for Speaker Identification with slides/figures/text/ideas/comics borrowed from: Dan Klein Gary Larson Chris Manning Mark Mao Andrew Ng Claude Shannon Ivica Rogina Jim Unger and most of all: The State of the Art in Language Modeling, Tutorial at AAAI 2002, by Joshua Goodman and Eugene Charniak Logo, include the roadmap (talk about how the focus is figuring out which taggers are the best for which items). Change fonts To arialLogo, include the roadmap (talk about how the focus is figuring out which taggers are the best for which items). Change fonts To arial

    2. Administrative Things HW #5 is posted: Speaker identification using Single-Gaussian Models. (special case of Gaussian Mixture Model, with only a single Gaussian in the “mix”.) Instead of modeling phonemes, we’ll be using Gaussians to model speakers.

    3. HW #5 Preview: Training

    4. HW #5 Preview: Testing

    5. A bad language model

    6. A bad language model

    7. A bad language model

    8. What’s a Language Model A Language model is a probability distribution over word sequences P(“And nothing but the truth”) ?? 0.001 P(“And nuts sing on the roof”) ? 0

    9. What’s a language model for? Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation (and anyone doing statistical modeling)

    10. How is language modeling used in speech recognition? Logo, include the roadmap (talk about how the focus is figuring out which taggers are the best for which items). Change fonts To arialLogo, include the roadmap (talk about how the focus is figuring out which taggers are the best for which items). Change fonts To arial

    11. Overview N-grams Smoothing Backoff Caching Skipping Beyond N-grams Parsing Trigger Words

    12. How Language Models work Hard to compute P(“And nothing but the truth”) Step 1: Decompose probability P(“And nothing but the truth) = P(“And”) ? P(“nothing|and”) ? P(“but|and nothing”) ? P(“the|and nothing but”) ? P(“truth|and nothing but the”)

    13. The n-gram Approximation Assume each word depends only on the previous (n-1) words (n words total) For example for trigrams (3-grams): P(“the|… whole truth and nothing but”) ? P(“the|nothing but”) P(“truth|… whole truth and nothing but the”) ? P(“truth|but the”)

    14. n-grams, continued How do we find probabilities? Get real text, and start counting! P(“the | nothing but”) ? C(“nothing but the”) / C(“nothing but”)

    15. Unigram probabilities (1-gram) http://www.wordcount.org/main.php Most likely to transition to “the”, least likely to transition to “conquistador”. Bigram probabilities (2-gram) Given “the” as the last word, more likely to go to “conquistador” than to “the” again.

    16. N-grams for Language Generation C. E. Shannon, ``A mathematical theory of communication,'' Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948.

    17. Evaluation How can you tell a good language model from a bad one? Task-dependent: Run a speech recognizer (or your application of choice), calculate word error rate Slow Specific to your recognizer

    18. Task-Independent Evaluation: Perplexity Intuition Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000 Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54 Perplexity is weighted equivalent branching factor.

    19. Evaluation: perplexity “A, B, C, D, E, F, G…Z”: perplexity is 26 “Alpha, bravo, charlie, delta…yankee, zulu”: perplexity is 26 Perplexity measures language model difficulty, not acoustic difficulty.

    20. Perplexity: Math Perplexity is geometric average inverse probability: Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) Imagine data: All 30,003 equally likely Example: Perplexity of test data, given model, is 119,829 Remarkable fact: the true model for data has the lowest possible perplexity Perplexity is geometric average inverse probability

    21. Perplexity: Math Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) Imagine data: All 30,003 equally likely Can compute three different perplexities Model (ignoring test data): perplexity 54 Test data (ignoring model): perplexity 30,003 Model on test data: perplexity 119,829 When we say perplexity, we mean “model on test” Remarkable fact: the true model for data has the lowest possible perplexity

    22. Perplexity: Is lower better? Remarkable fact: the true model for data has the lowest possible perplexity Lower the perplexity, the closer we are to true model. Typically, perplexity correlates well with speech recognition word error rate Correlates better when both models are trained on same data Doesn’t correlate well when training data changes

    23. Evaluation: entropy Entropy = log2 perplexity

    24. Back to N-grams… Trigram Probability, Before Smoothing Called Maximum Likelihood estimate. Lowest perplexity trigram on training data. Terrible on test data: If no occurrences of C(xyz), probability is 0.

    25. Why should we smooth?

    26. Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)

    27. Simplest Smoothing: Add One What is P(sing|nuts)? Zero? Leads to infinite perplexity! Add one smoothing: Works very badly. DO NOT DO THIS Add delta smoothing: Still very bad. DO NOT DO THIS

    28. Smoothing: Simple Interpolation Trigram is very context specific, very noisy Unigram is context-independent, smooth Interpolate Trigram, Bigram, Unigram for best combination Find ?0<???<1 by optimizing on “held-out” data Almost good enough

    29. Smoothing: Interpolated Absolute Discount Backoff: ignore bigram if have trigram Interpolated: always combine bigram, trigram

    30. Smoothing: Interpolated Multiple Absolute Discounts One discount is good Different discounts for different counts Multiple discounts: for 1 count, 2 counts, >2

    31. Smoothing: Kneser-Ney P(Francisco | eggplant) vs P(stew | eggplant) “Francisco” is common, so backoff, interpolated methods say it is likely But it only occurs in context of “San” “Stew” is common, and in many contexts Weight backoff by number of contexts word occurs in

    32. Smoothing: Kneser-Ney Interpolated Absolute-discount Modified backoff distribution Consistently best technique

    33. Caching If you say something, you are likely to say it again later. Interpolate trigram with cache

    34. Caching: Real Life Someone says “I swear to tell the truth” System hears “I swerve to smell the soup” Cache remembers! Person says “The whole truth”, and, with cache, system hears “The whole soup.” – errors are locked in. Caching works well when users corrects as they go, poorly or even hurts without correction.

    35. Skipping P(z|…rstuvwxy) ?? P(z|vwxy) Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-back word. Example: “P(time|show John a good)” -> P(time | show ____ a good) P(…rstuvwxy) ? ? ?P(z|vwxy) + ?P(z|vw_y) + (1-? ?)P(z|v_xy)

    36. What actually works?

    37. No data like mo’ data

    38. Tools: CMU Language Modeling Toolkit Can handle bigram, trigrams, more Can handle different smoothing schemes Many separate tools – output of one tool is input to next: easy to use Free for research purposes http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

    39. Tools: SRI Language Modeling Toolkit More powerful than CMU toolkit Can handle clusters, lattices, n-best lists, hidden tags Free for research use http://www.speech.sri.com/projects/srilm

    40. Beyond N-grams Decision Trees / Random Forests Parsing (“Structured Language Model”) Class Models For more about random forests, check out: Random Forests (Fred Jelinek talk) http://nlp.stanford.edu/events.shtml Feb. 24, 4:15pm Fred Jelinek Johns Hopkins University Random Forests for Language Modeling Lane History Corner, 200-205

    41. Parsing

    42. Probabilistic Context-free Grammars (PCFGs) S NP VP 1.0 VP V NP 0.5 VP V NP NP 0.5 NP Det N 0.5 NP Det N N 0.5 N salespeople 0.3 N dog 0.4 N biscuits 0.3 V sold 1.0

    43. Producing a Single “Best” Parse The parser finds the most probable parse tree given the sentence (s) For a PCFG we have the following, where r varies over the rules used in the tree :

    44. Parsers and Language Models Generative parsers are of particular interest because they can be turned into language models. If there is no parse for the sentence, then

    45. Class Models CLUSTERING = CLASSES (same thing) What is P(“Tuesday | party on”) Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration on”) Put words in clusters: WEEKDAY = Sunday, Monday, Tuesday, … EVENT=party, celebration, birthday, …

    46. Conclusions Use trigram models Use any reasonable smoothing algorithm (Katz, Kneser-Ney) Use caching if you have correction information. Parsing is promising technique. Clustering, sentence mixtures, skipping not usually worth effort.

    47. More Resources Caching: R. Kuhn. Speech recognition and the frequency of recently used words: A modified markov model for natural language. In 12th International Conference on Computational Linguistics, pages 348--350, Budapest, August 1988. R. Kuhn and R. D. Mori. A cache-based natural language model for speech reproduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):570--583, 1990. R. Kuhn and R. D. Mori. Correction to a cache-based natural language model for speech reproduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(6):691--692, 1992.

    48. More Resources: Clustering The seminal reference P. F. Brown, V. J. DellaPietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, December 1992. Two-sided clustering H. Yamamoto and Y. Sagisaka. Multi-class composite n-gram based on connection direction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing Phoenix, Arizona, May 1999. Fast clustering D. R. Cutting, D. R. Karger, J. R. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In SIGIR 92, 1992. Other: R. Kneser and H. Ney. Improved clustering techniques for class-based statistical language modeling. In Eurospeech 93, volume 2, pages 973--976, 1993.

    49. More Resources Structured Language Models Eugene’s web page Ciprian Chelba’s web page: http://www.clsp.jhu.edu/people/chelba/ Maximum Entropy Roni Rosenfeld’s home page and thesis http://www.cs.cmu.edu/~roni/ Stolcke Pruning A. Stolcke (1998), Entropy-based pruning of backoff language models. Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp. 270-274, Lansdowne, VA. NOTE: get corrected version from http://www.speech.sri.com/people/stolcke

    50. More Resources: Skipping Skipping: X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, and R. Rosenfeld. The SPHINX-II speech recognition system: An overview. Computer, Speech, and Language, 2:137--148, 1993.

More Related