Resolving Word Ambiguities

Resolving Word Ambiguities • Description:After determining word boundaries, the speech recognition process matches an array of possible word sequences from spoken audio • Issues to consider • determine the intended word sequence? • resolve grammatical and pronunciation errors • Applications:spell-checking, allophonic variations of pronunciation, automatic speech recognition • Implementation: Establish word sequence probabilities • Use existing corpora • Train program with run-time data

H(X) = r å - p lg p X is a random variable that can assume r values i i = i 1 Entropy Measures the quantity of information in a signal stream Questions answered if we can compute entropy? • How much information there is in a particular grammar, words, parts of speech, phonemes, etc? • How predictive is a language in computing the next word based on previous words? • How difficult is a speech recognition task? • Entropy is measured in bits • What is the least number of bits required to encode a piece of information Entropy of spoken languages could focus implementation possibilities

Entropy Example • Eight Horses are in an upcoming race • We want to take bets on the winner • Naïve approach is to use 3 bits • Better approach is to use less bits for the horses bet on more frequently • Entropy • What is the minimum number of bits needed? = -∑i=1,8p(i)log p(i) = - ½ log ½ - ¼ log ¼ - 1/8 log 1/8 – 1/16 log 1/16 – 4 * 1/64 log 1/64 = 1/2 + 2/4 +3/8 + 4/16 + 4 * 6/64 = (32 + 32 + 24 + 16 + 24)/64 = 2 • The table to the right shows the optimal coding scheme • Question: What if the odd were all equal (1/8)

Entropy of words and languages • What is the entropy of a sequence of words?H(w1,w2,…,wn) = -∑ p(win) log p (win) • win εL • P(win) = probability that win is in a sequence of n words • What is the entropy of a word appearing in an n word sequence? H(w1n) = - 1/n ∑ p(win) log p (win) • What is the entropy of a language? H(L) = lim n=∞ (- 1/n * ∑ p(win) log p (win))

Cross Entropy • We want to know the entropy of a language L, but don’t know its distribution • We model L by an approximation to its probability distribution • We take sequences of words, phonemes, etc from the real language but use the following formulaH(p,m) = limn->∞ - 1/n log m(w1,w2,…,wn) • Cross entropy will always be an upper bound for the actual language • Example • Trigram model of 583 million words of English • Corpus of 1,014,312 tokens • Character entropy computed based on a tri-gram grammar • Result 1.75 bits per character

Probability Chain Rule • Conditional Probability P(A1,A2) = P(A1) · P(A2|A1) • The Chain Rulegeneralizes to multiple events • P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1) • Examples: • P(the dog) = P(the) P(dog | the) • P(the dog bites) = P(the) P(dog | the) P(bites| the dog) • Conditional probability applies more than individual relative word frequencies because they consider the context • Dog may be relatively rare word in a corpus • But if we see barking, P(dog|barking) is much more likely n • In general, the probability of a complete string of words w1…wn is: P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) = 1 Detecting likely word sequences using probabilities

Counts • What’s the probability of “canine”? • What’s the probability of “canine tooth” or tooth | canine? • What’s the probability of “canine companion”? • P(tooth|canine) = P(canine & tooth)/P(canine) • Sometimes we can use counts to deduce probabilities. • Example: According to google: • P(canine): occurs 1,750,000 times • P(canine tooth): 6280 times • P(tooth | canine): 6280/1750000 = .0035 • P(companion | canine): .01 • So companion is the more likely next word after canine Detecting likely word sequences using counts/table look up

n times Word Prediction Approaches Simple vs. Smart Simple: *Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| – If English has 100,000 words, probability of each next word is 1/100000 = .00001 • Smarter: Probability of each next word is related to word frequency • – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) • – Assumes probability of each word is independent of probabilities of other words. • Even smarter: Look at probability given previous words • – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) • – Assumes probability of each word is dependent on probabilities of other words.

Common Spelling Errors Spell check without considering context will fail • They are leaving in about fifteen minuets • The study was conducted manly be John Black. • The design an construction of the system will take more than a year. • Hopefully, all with continue smoothly in my absence. • Can they lave him my messages? • I need to notified the bank of…. • He is trying to fine out. Difficulty: Detect grammatical errors, or nonsensical expressions

N-grams How many previous words should we consider? • 0 gram: Every word’s likelihood probability is equal • Each word of a 300,000 word corpora has .000033 frequency probability • Uni-gram: A word’s likelihood depends on frequency counts • The occurs 69,971 in the Brown corpus of 1,000,000 words • Bi-gram: word likelihood determined by the previous word • P(w|a) = P(w) * P(w|wi-1) • Theappears with frequency .07, rabbit appears with frequency .00001 • Rabbitis a more likely word that follows the word white than theis • Tri-gram: word likelihood determined by the previous two words • P(w|a) = P(w) * P(w|wi-1 &wi-2) • Question: How many previous words should we consider? • Test: Generate random sentences from Shakesphere • Results: Trigram sentences start looking like those of Shakesphere • Tradeoffs: Computational overhead and memory requirements

The Sparse Data Problem • Definitions • Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input • N-gram sequence: A sequence of n words whose context speech algorithms consider • Training data: A group of probabilities computed from a corpora of text data • Sparse data problem: How should algorithms handle n-grams that have very low probabilities? • Data sparseness is a frequently occurring problem • Algorithms will make incorrect decisions if it is not handled • Problem 1: Low frequency n-grams • Assume n-gram x occurs twiceand n-gram y occurs once • Is x really twice as likely to occur as y? • Problem 2: Zero counts • Probabilities compute to zero for n-grams not seen in the corpora • If n-gram y does not occur, should its probability is zero?

Smoothing Definition: A corpora is a collection of written or spoken material in machine-readable form An algorithm that redistributes the probability mass Discounting: Reduces probabilities of n-grams withnon-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora).

Add-One Smoothing • The Naïve smoothing technique • Add one to the count of all seen and unseen n-grams • Add the total increased count to the probability mass • Example: Uni-grams • Un-smoothed probability for word w: uni-grams • Add-one revised probability for word w: • N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered

Add-One Smoothing Example Note: This example assumes bi-gram counts and a vocabulary V = 1616 words Note: row = times that word in column precedes word on left, or starts a sentence P(wn|wn-1) = C(wn-1wn)/C(wn-1) P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V] Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459

c(wi,wi-1) Add-One Discounting Original Counts Revised Counts c’(wi,wi-1) =(c(wi,wi-1)i+1) * Note:High counts reduce by approximately a third for this example Note:Low counts get larger Note: N = c(wi-1), V = vocabulary size = 1616

Evaluation of Add-One Smoothing • Advantage: • Simple technique to implement and understand • Disadvantages: • Too much probability mass moves to the unseen n-grams • Underestimates the probabilities of the common n-grams • Overestimates probabilities of rare (or unseen) n-grams • Relative smoothing of all unseen n-grams is the same • Relative smoothing of rare n-grams still incorrect • Alternative: • Use a smaller add value • Disadvantage: Does not fully solve this problem

Unigram Witten-Bell Discounting Add probability mass to un-encountered words; discount the rest • Compute the probability of a first time encounter of a new word • Note: Every one of O observed words had a first encounter • How many Unseen words: U = V – O • What is the probability of encountering a new word? • Answer:P( any newly encountered word ) = O/(V+O) • Equally add this probability across all unobserved words • P( any specific newly encountered word ) = 1/U * O/(V+O) • Adjusted counts = V * 1/U*O/(V+O)) • Discount each encountered wordi to preserve probability space • Probability From: counti /V To: counti/(V+O) • Discounted Counts From: countiTo: counti * V/(V+O) O= observed words, U = words never seen, V = corpus vocabulary words

Bi-gram Witten-Bell Discounting Add probability mass to un-encountered bi-grams; discount the rest • Consider the bi-gram wnwn-1 • O(wn-1) = number of uniquely observed bi-grams starting with wn-1 • V(wn-1) = count of bi-grams starting with wn-1 • U(wn-1) = number of un-observed bi-grams starting with wn-1 • Compute probability of a new bi-gram (bin-1) starting with wn-1 • Answer: P( any newly encountered bi-gram ) = O(wn-1)/(V(wn-1) +O(wn-1)) • Note: We observed O(wn-1) bi-grams in V(wn-1)+O(wn-1) events • Note: An event is either a bi-gram or a first time encounter • Divide this probability among all unseen bi-grams (new(wn-1)) • Adjusted P(new(wn-1)) = 1/U(wn-1)*O(wn-1)/(V(wn-1)+O(wn-1)) • Adjusted count = V(wn-1) * 1/U(wn-1) * O(wn-1)/(V(wn-1)+O(wn-1)) • Discount observed bi-grams gram(wn-1)to preserve probability space • ProbabilityFrom: c(wn-1 wn)/V(wn-1) To: c(wn-1wn)/(V(wn-1) + O(wn-1)) • CountsFrom: c(wn-1 wn) To: c(wn-1wn) * V(wn-1)/(V(wn-1)+O(wn-1)) O= observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams

V V V + + + N N N V V V Witten-Bell Smoothing V, Oand U values are on the next slide Note:V, O, Urefer to wn-1 counts Original Counts c(wn,wn-1) Adjusted Add-One Counts c′(wn,wn-1)= (c(wn,wn-1)+1) Adjusted Witten-Bell Counts c′(wn,wn-1) = O/U if c(wn,wn-1)=0 c(wn,wn-1) otherwise

Bi-gram Counts for Example O(wn-1) = number of observed bi-grams starting with wn-1 V(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with

Evaluation of Witten-Bell • Estimates probability of already encountered grams to compute probabilities for unseen grams • Smaller impact on probabilities of already encountered grams • Generally computes reasonable probabilities

Back-off Discounting • The general Concept • Consider the trigram (wn,wn-1, wn-2) • If c(wn-1, wn-2) = 0, consider the ‘back-off’ bi-gram (wn, wn-1) • If c(wn-1) = 0, consider the ‘back-off’ unigram wn • Goal is to use a hierarchy of approximations • trigram > bigram > unigram • Degrade gracefully when higher level grams don’t exist • Given a word sequence fragment: wn-2 wn-1 wn … • Utilize the following preference rule • 1.p(wn |wn-2wn-1) if c(wn-2wn-1 wn )  0 • 2.1p(wn|wn-1 ) if c(wn-1 wn )  0 • 3.2p(wn) Note: 1 and 2 are values carefully computed to preserve probability mass

N-grams for Spell Checks • Non-word detection (easiest) Example: graffe => (giraffe) • Isolated-word (context-free) error correction • by definition cannot correct when error word is a valid word • Context-dependent (hardest) Example: your an idiot => you’re an idiot • when the mis-typed word happens to be a real word • 15% Peterson (1986), 25%-40% Kukich (1992)

Resolving Word Ambiguities

Resolving Word Ambiguities

Presentation Transcript

Resolving Conflict

Resolving Conflict

Resolving Conflicts

Resolving Vectors

Resolving Conflicts

RESOLVING POWER

CASE RESOLVING

Resolving Singularities

Resolving Dissolving

no iteration same ambiguities additional instabilities

Resolving disagreements

Resolving DTDs

Resolving Scarcity

Resolving Conflict

Resolving the

Resolving DNSsec

Resolving Vectors

RESOLVING CONFLICT

Resolving Ambiguities to Create a Natural Sketching Environment

Resolving DNSsec