Search and Decoding in Speech Recognition

Search and Decoding in Speech Recognition N-Grams

N-Grams • Problem of word prediction. • Example: “I’d like to make a collect …” • Very likely words: • “call”, • “international call”, or • “phone call”, and NOT • “the”. • The idea of word prediction is formalized with probabilistic models called N-grams. • N-grams – predict the next word from previous N-1 words. • Statistical models of word sequences are also called language models or LMs. • Computing probability of the next word will turn out to be closely related to computing the probability of a sequence of words. • Example: • “… all of a sudden I notice three guys standing on the sidewalk …”, vs. • “… on guys all I of notice sidewalk three a sudden standing the …” Veton Këpuska

N-grams • Estimators like N-grams that assign a conditional probability to possible next words can be used to assign a joint probability to an entire sentence. • N-gram models are on of the most important tools in speech and language processing. • N-grams are essential in any tasks in which the words must be identified from ambiguous and noisy inputs. • Speech Recognition – the input speech sounds are very confusable and many words sound extremely similar. Veton Këpuska

N-gram • Handwriting Recognition – probabilities of word sequences help in recognition. • Woody Allen in his movie “Take the Money and Run”, tries to rob a bank with a sloppily written hold-up note that the teller incorrectly reads as “I have a gub”. • Any speech and language processing system could avoid making this mistake by using the knowledge that the sequence “I have a gun” is far more probable than the non-word “I have a gub” or even “I have a gull”. Veton Këpuska

N-gram • Statistical Machine Translation – Example of translation of a Chinese source sentence: from a set of potential rough English translations: • he briefed to reporters on the chief contents of the statement • he briefed reporters on the chief contents of the statement • he briefed to reporters on the main contents of the statement • he briefed reporters on the main contents of the statement Veton Këpuska

N-gram • An N-gram grammar might tell us that briefed reporters is more likely than briefed to reporters, and main contents is more likely than chief contents. • Spelling Corrections – need to find correct spelling errors like the following that accidentally result in real English words: • They are leaving in about fifteen minuets to go to her house. • The design an construction of the system will take more than a year. • Problem – real words thus dictionary search will not help. • Note: “in about fifteen minuets” is a much less probable sequence than “in about fifteen minutes” • Spell-checker can use a probability estimator both to detect these errors and to suggest higher-probability corrections. Veton Këpuska

N-gram • Augmentative Communication – helping people who are unable to sue speech or sign language to communicate (Steven Hawking). • Using simple body movements to select words from a menu that are spoken by the system. • Word prediction can be used to suggest likely words for the menu. • Other areas: • Part of-speech tagging • Natural Language Generation, • Word Similarity, • Authorship identification • Sentiment Extraction • Predictive Text Input (Cell phones). Veton Këpuska

Corpora & Counting Words • Probabilities are based on counting things. • Must decide what to count. • Counting things in natural language is based on a corpus (plural corpora) – an on-line collection of text or speech. • Popular corpora “Brown” and “Switchboard”. • Brown corpus is a 1 million word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.) assembled at Brown university 1963-1964. • Example sentence from Brown corpus: • He stepped out into the hall, was delighted to encounter a water brother. • 13 words if don’t’ count punctuation-marks as words – 15 if we count punctuation. • Treatment of “,” and “.” depends on the task. • Punctuation marks are critical for identifying boundaries (, . ;) of things and for identifying some aspects of meaning (?!”) • For some tasks (part-of-speech tagging or parsing or sometimes speech synthesis) punctuation are treated as being separate words. Veton Këpuska

Corpora & Counting Words • Switchboard Corpus – collection of 2430 telephone conversations averaging 6 minutes each – total of 240 hours of speech with about 3 million words. • This kind of corpora do not have punctuation. • Complications with defining words. • Example: I do uh main- mainly business data processing. • Two kinds of disfluencies. • Broken-off word main- is called a fragment. • Words like uhum are called fillers or filled pauses. • Counting disfluencies as words depends on the application: • Automatic Dictation System based on Automatic Speech Recognition will remove disfluencies. • Speaker Identification application can use disfluencies to identify a person. • Parsing and word prediction can use disfluencies – Stolcke and Shriberg (1996) found that treating uh as a word improves next-word prediciton (?) and thus most speech recognition systems treat uh and um as words. Veton Këpuska

N-gram • Are capitalized tokens like “They” and un-capitalized tokens like “they” the same word? • In speech recognition they are treated the same. • In part-of-speech-tagging capitalization is retained as a separate features. • In this chapter models are not case sensitive. • Inflected forms – cats versus cat. These two words have the same lemma “cat” but are different wordforms. • Lemma is a set of lexical forms having the same • Stem • Major part-of-speech, and • Word-sense. • Wordform is the full • inflected or • derived form of the word. Veton Këpuska

N-grams • In this chapter N-grams are based on wordforms. • N-gram models and counting words in general requires that we do the kind of tokenization or text normalization that was introduced in previous chapter: • Separating out punctuation • Dealing with abbreviations (m.p.h) • Normalizing spelling, etc. Veton Këpuska

N-gram • How many words are there in English? • Must first distinguish • types – the number of the distinct words in a corpus or vocabulary size V, from • tokens – the total number N of running words. • Example: • They picnicked by the pool, then lay back on the grass and looked at the stars. • 16 Tokens • 14 Types Veton Këpuska

The Switchboard corpus has ~20,000 wordform types ~3 million worform tokens Shakspeare’s complete works have 29,066 wordform types 884,647 wordform tokens Brown corpus has: 61,805 wordform types 37,851 lemma types 1 million wordform tokens A very large corpus (Brown 1992a) found that it included 293,181 different wordform types 583 million wordform tokens Heritage third edition dictionary lists 200,00 boldface forms. It seems that the larger corpora the more word types are found: It is suggested that vocabulary size (the number of types) grows at least the square root of the number of tokens N-gram Veton Këpuska

Brief Introduction to Probability Discrete Probability

Discrete Probability Distributions • Definition: • Set called the sample space which contains the set of all possible outcomes: • For each element x of the set S; x ∊S , a probability value is assigned as a function of x; P(x) with the following properties: • P(x) ∊ [0,1], ∀ x ∊S, Veton Këpuska

Discrete Probability Distributions • Event is defined as any subset E of the sample space S. • The probability of the eventE is defined as: • Probability of the entire space Sis 1 as indicated by 2 in the previous slide. • Probability of the empty or null event is 0. • The function P(x) mapping a point in in the sample space to the “probability” value is called aprobability mass function (pmf). Veton Këpuska

Properties of Probability Function • If A and B are mutually exclusive events in S, then: • P(A∪B) = P(A)+P(B) • Mutual exclusive events are those that A∩B=∅ • In general for n mutually exclusive events: Venn Diagram A B Veton Këpuska

Elementary Theorems of Probability • If A is any event in S, then • P(A’) = 1-P(A)where A’ is set of all events not in A. • Proof: • P(A∪A’) = P(A)+P(A’), considering that • P(A∪A’) = P(S)= 1 • P(A)+P(A’) = 1 Veton Këpuska

Elementary Theorems of Probability • If A and B are any events in S, then • P(A∪B) = P(A)+P(B)- P(A∩B), • Proof: • P(A∪B) = P(A∩B’)+P(A∩B)+P(A’∩B)= • P(A∪B) = [P(A∩B’)+P(A∩B)] + [P(A’∩B)+P(A∩B) ] - P(A∩B) • P(A∪B) = P(A)+P(B)- P(A∩B) Venn Diagram S A∪B A B A∩B’ A∩B A’∩B Veton Këpuska

Conditional Probability • If A and B are any events in S, and P(B)≠0, the conditional probability of A relative to B is given by: • If A and B are any events in S, then Veton Këpuska

Independent Events • If A and B are independent events then : Veton Këpuska

Bayes Rule • If B1, B2, B3,…, Bn are mutually exclusive events of which one must occur, that is: , then Veton Këpuska

End of Brief Introduction to Probability End

Simple N-grams

Simple (Unsmoothed) N-Grams • Our goal is to compute the probability of a word w given some history h: P(w|h). • Example: • h⇒ “its water is so transparent that” • w ⇒ “the” • P(the | its water is so transparent that) • How can we compute this probability? • One way is to estimate it from relative frequency counts. • From a very large corpus count number of times we see “its water is so transparent that” and count the number of times this is followed by “the”- Out of the times we saw the history h, how many times was it followed by the word w”: Veton Këpuska

Estimating Probabilities • Estimating probabilities form counts works fine in many cases, it turns out that even the www is not big enough to give us good estimates in most cases. • Language is creative: • new sentences are created all the time. • It is not possible to count entire sentences. Veton Këpuska

Estimating Probabilities • Joint Probabilities – probability of an entire sequence of words like “its water is so transparent”: • Out of all possible sequences of 5 words how many of them are “its water is so transparent” • Must count of all occurrences of “its water is so transparent” and divide by the sum of counts of all possible 5 word sequences. • It seems a lot of work for a simple computation of estimates. Veton Këpuska

Estimating Probabilities • Must figure out cleverer ways of estimating the probability of • A word w given some history h, or • An entire word sequence W. • Introduction of formal notations: • Random variable – Xi • Probability Xi taking on the value “the” – P(Xi =“the”) = P(the) • Sequence of N words: • Joint probability of each word in a sequence having a particular value: Veton Këpuska

Chain Rule • Chain rule of Probability: • Applying the chain rule to words we get: Veton Këpuska

Chain Rule • The chain rule provides the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. • Equation presented in previous slide provides the way of computing joint probability estimate of an entire sequence based on multiplication of a number of conditional probabilities. • However, we still do not know any way of computing the exact probability of a word given a long sequence of preceding words: Veton Këpuska

N-grams • Approximation: • Idea of N-gram model is to approximate the history by just the last few words instead of computing the probability of a word given its entire history. • Bigram: • The bigram model approximates the probability of a word given all the previous words by the conditional probability of the preceding word . • Example: Instead of computing the probability: It is approximated with the probability: Veton Këpuska

Bi-gram • The following approximation is used when the bi-gram probability is applied: • The assumption that the conditional probability of the a word depends only on the previous word is called a Markov assumption. • Markov models are the class of probabilistic models that assume that wee can predict the probability of some future unit without looking too far into the past. Veton Këpuska

Bi-gram Generalization • Tri-gram: looks two words into the past • N-gram: looks N-1 words into the past. • General equation for N-gram approximation to the conditional probability of the next word in a sequence is: • The simplest and most intuitive way to estimate probabilities is the method called Maximum Likelihood Estimation or MLE for short. Veton Këpuska

Maximum Likelihood Estimation • MLE estimate for the parameters of an N-gram model is done by taking counts from a corpus, and normalizing them so they lie between 0 and 1. • Bi-gram: computing a particular bi-gram probability of a word y given a previous word x, the count C(xy) is computed and normalized by the sum of all bi-grams that share the same first word x. Veton Këpuska

Maximum Likelihood Estimate • The previous equation can be further simplified by noting: Veton Këpuska

Example • Mini-corpus containing three sentences marked with begging sentence marked <s> and ending sentence marker </s>: <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> (From Dr. Seuss series: “Green Eggs and Ham” book) • Some of the bi-gram calculations from this corpus: • P(I|<s>) = 2/3 = 0.66 P(Sam|<s>) = 1/3=0.33 • P(am|I) = 2/3 = 0.66 P(Sam|am) = ½=0.5 • P(</s>|Sam) = 1/3 = 0.33 P(</s>|am) = 1/3=0.33 • P(do|I) = 1/3 = 0.33 Veton Këpuska

N-gram Parameter Estimation • In general case, MLE is calculated for N-gram model using the following: • This equation estimates the N-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ration is called relative frequency. • Relative frequencies is one way how to estimate probabilities in Maximum Likelihood Estimation. • Conventional MLE is not always the best way to compute probability estimates (bias toward a training corpus – e.g., Brown). • MLE can be modified to address better those considerations. Veton Këpuska

Example 2 • Data used from Berkeley Restaurant Project Corpus consisting of 9332 sentences (available from the WWW): • can you tell me about any good cantonese restaurants close by • mid priced thai food is what I’m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i’am looking for a good place to eat breakfast • when is caffe venezia open during the day Veton Këpuska

Bigram counts for eight of the words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences Veton Këpuska

Bigram Probabilities After Normalization Unigram Counts • Some other useful probabilities: • P(i|<s>)=0.25 P(english|want)=0.0011 • P(food|english)=0.5 P(</s>|food)=0.68 • Clearly now we can compute probability of sentence like: • “I want English food”, or • “I want Chineze food” by multiplying appropriate bigram probabilities together as follows: Veton Këpuska

Bigram probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences Veton Këpuska

Bigram Probability • P(<s> i want english food </s>) = P(i|<s>)P(want|i) P(english|want) P(food|english) P(</s>|food) = 0.25 x 0.33 x 0.0011 x 0.5 x 0.68= 0.000031 • Exercise: Computer the probability of “I want chinese food”. • Some of the bigram probabilities encode some facts that we think of as strictly syntactic in nature: • What comes after eat is usually • a noun or • an adjective, or • What comes after to is usually • a verb Veton Këpuska

Trigram Modeling • Although we will generally show bigram models in this chapter for pedagogical purposes, note that when there is sufficient training data we are more likely to use trigram models, which condition on the previous two words rather than the previous word. • To compute trigram probabilities at the very beginning of sentence, we can use two pseudo-words for the first trigram (i.e., P(I|<s><s>). Veton Këpuska

Training and Test Sets • N-gram models are obtained from a corpus that is trained on. • Those models are used on some new data in some task (e.g. speech recognition). • New data or task will not be exactly the same as data that was used for training. • Formally: • Data that is used to build the N-gram (or any model) are called Training Set or Training Corpus • Data that are used to test the models comprise Test Set or Test Corpus. Veton Këpuska

Model Evaluation • Training-and-testing paradigm can also be used to evaluate different N-gram architectures: • Comparing N-grams of different order N, or • Using the different smoothing algorithms (to be introduced later) • Train various models using training corpus • Evaluate each model on the test corpus. • How do we measure the performance of each model on the test corpus? • Perplexity (introduced latter in the chapter) – computing probability of each sentence in the test set: the model that assigns a higher probability to the test set (hence more accurately predicts the test set) is assumed to be a better model. • Because evaluation metric is based on test set probability, it’s important not to let the test sentences into the training set. Avoiding training on the test set data. Veton Këpuska

Other Divisions of Data • Extra source of data to augment the training set is needed. This data is called a held-out set. • N-gram model is based on only training set. • Held-out set is used to set additional (other) parameters of our model. • Used to set interpolation parameters of N-gram model • Multiple test sets: • Test set that is used often in measuring performance of the model typically is called development (test) set. • Due to its high usage the models may be tuned to it. Thus a new completely unseen (or seldom used data set) should be used for final evaluation. This set is called evaluation (test) set. Veton Këpuska

Picking Train, Development Test and Evaluation Test Data • For training we need as much data as possible. • However, for Testing we need sufficient data in order for the resulting measurements to be statistically significant. • In practice often the data is divided into 80% training 10% development and 10% evaluation. Veton Këpuska

N-gram Sensitivity to the Training Corpus. • N-gram modeling, like many statistical models, is very dependent on the training corpus. • Often the model encodes very specific facts about a given training corpus. • N-grams do a better and better job of modeling the training corpus as we increase the value of N. • This is another aspect of model being tuned to specifically to training data at the expense of generality. Veton Këpuska

Visualization of N-gram Modeling • Shannon (1951) AND Miller & Selfridge (1950). • The simplest way to visualize how this works is for the unigram case: • All words of English language covering the probability space between 0 and 1 – each word thus covering an interval of size equal to its (relative) frequency. • Let us choose a random number between 0 and 1, and print out the word whose interval includes the real value we have chosen. • We continue choosing random numbers and generating words until we randomly generate the sentence-final token </s>. • The same technique can be used to generate bigrams by • first generating a random bigram that starts with <s> (according to its bigram probability) • Then choosing a random bigram to follow it (again according to its conditional probability), and so on. Veton Këpuska

Visualization of N-gram Modeling: Unigram • To provide an intuition of the increasing power of higher order N-grams, the example below is depicted that shows random sentences generated from unigram, bigram, trigram, and quadrigram models trained on Shakspeare’s work. • To him swallowed confess hear both. Which. Of save on trail for are ay device an rote life have • Every enter noe severally so, let • Hill he late speaks; or! A more to leg less first you enter • Are where exeunt and sighs have rise excellency took of. Sleep knave we. Near; vile like Veton Këpuska

Search and Decoding in Speech Recognition