200 likes | 284 Views
Explore advanced strategies to optimize automatic speech recognition results, improve confidence metrics, and refine error reduction methods for more effective human-computer conversations. Discover new approaches to manipulate ASR outcomes for better recognition accuracy.
E N D
Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett Coin Industrial Poet ejTalk, Inc. www.ejTalk.com
Who? • Emmett Coin • Industrial Poet • Rugged solutions via compact and elegant techniques • Focused on creating more powerful and richer dialog methods • ejTalk • Frontiers of Human-Computer conversation • What does it take to “talk with the machine”? • Can we make it meta?
What this talk is about • How applications typically use the recognition result • Why accuracy is not that important, BUT error rate is. • How some generic techniques can sometimes help reduce the effective recognition error rate.
How do most apps deal with recognition? • Specify a grammar (cfg or slm) • Specify a level of “confidence” • Wait for the recognizer to decide what happens (no result, bad, good) • Use the 1st nbest result when it is “good” • Leave all the errors and uncertainties to the dialog management level
Accuracy: confusing concept • 95% accuracy is good, 97% percent is a little better … or is it? • Think of roofing a house. • Do people accurately perceive the ratio of “correct” vs. “incorrect” recognition? • Users hardly notice when you “get it right”. They expect it. • When you get it wrong…
Confidence: What is it? • A sort of “closeness” of fit • Acoustic scores • How well it matches the expected sounds • Language model scores • How much work it took to find the phrase • A splash of recognizer vendor voodoo • How voice-like, admix of noise, etc. • All mixed together and reformed as a number between 0.0 and 1.0 (usually)
Confidence: How good is it? • Does it correlate with how a human would rank things? • Does it behave consistently? • long vs. short utterances? • Different word groups? • What happens when you rely on it?
Can we add more to the model? • We already use • Sounds – the Acoustic Model (AM) • Words – the Language Model (LM) • We can add • Meaning – the Semantic Model (SM) • Rethinking
Strategies that humans use • Rejection • Don’t hear repeated wrong utterances • Also called “skip lists” • Acceptance • Intentionally allowing only the likely utterances • Aka “pass lists” • Anticipation • Asking a question where the answer is known • Sometimes called “hints”
Rejection (skip) • The people and computers should not make the same mistake twice. • Keep a list of confirmed mis-recs • Remove those from the next recognition’s nbest list • But, beware the dark side ... • …the Chinese finger puzzle. • Remember: knowing what to reject is based on recognition too!
Acceptance (pass) • It is possible to specify the relative weights in the language model (grammar). • But there is a danger. It is a little like cutting the legs on a chair to make it level. Hasty modifications will have unintended interactions. • Another way is to create a sieve • This has the advantage of not changing the balance of the model. The other parts that do not pass the sieve become a defacto garbage collector.
Anticipation • Explicit • e.g. confirming identity, amounts, etc. • Probabilistic • Dialogs are journeys • Some parts of the route are routine, predictable
What should we disregard? • When is a recognition event truly the human talking to the computer? • The human is speaking • But not to the computer • But saying the wrong thing • Some human is saying something • Other noise • Car horn, mic bump, radio music, etc. • As dialogs get longer we need to politely ignore what we were not intended to respond to
In and Out of Grammar (oog) • The recognizer returned some text • Was it really what was said? • Can we improve over the “confidence”? • Look at the “scores” of the nbest • Use them as a “feature space” • Use example waves to discover clusters in feature space that correlate with “in” and “out” of Vocabulary
Where do we put it? • Where does all this heuristic post analysis go? Out in the dialog? • How can we minimize the cognitive load on the application developer? • We need to wrap up all this extra functionality inside a new container to hide the extra complexity
Re-listening • If an utterance is going to be rejected then try again. (Re-listen to the same wave) • If you can infer a smaller scope then listen with a grammar that “leans” that way. • Merge the nbests via some heuristic • Re-think the combined uttererance to see if it can now be considered “good and in grammar”
Serial Listening • The last utterance is not “good enough” • Prompt for a repeat and listen again (live audio from the user) • If it is “good” by itself then use it • Otherwise, heuristically merge the nbests based on similarities • Re-think the combined uttererance to see if it can now be considered “good and in grammar”
Parallel Listening • Listen on two recognizers • One with the narrow “expectation” grammar • The other with the wide “possible” grammar • If utterance is in both results process the “expectation” results • If not process the “possible” results
Conclusions • Error rate is the metric to watch • There is more information in the recognition result than the 1st good nbest • Putting conventional recognition inside a heuristic “box” makes sense • The information needed by the “box” is a logical extension of the listening context
Thank you Emmett Coin ejTalk, Inc emmett@ejTalk.com