Enhancing ASR Accuracy with Innovative Post-Computation Techniques

Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett Coin Industrial Poet ejTalk, Inc. www.ejTalk.com

Who? • Emmett Coin • Industrial Poet • Rugged solutions via compact and elegant techniques • Focused on creating more powerful and richer dialog methods • ejTalk • Frontiers of Human-Computer conversation • What does it take to “talk with the machine”? • Can we make it meta?

What this talk is about • How applications typically use the recognition result • Why accuracy is not that important, BUT error rate is. • How some generic techniques can sometimes help reduce the effective recognition error rate.

How do most apps deal with recognition? • Specify a grammar (cfg or slm) • Specify a level of “confidence” • Wait for the recognizer to decide what happens (no result, bad, good) • Use the 1st nbest result when it is “good” • Leave all the errors and uncertainties to the dialog management level

Accuracy: confusing concept • 95% accuracy is good, 97% percent is a little better … or is it? • Think of roofing a house. • Do people accurately perceive the ratio of “correct” vs. “incorrect” recognition? • Users hardly notice when you “get it right”. They expect it. • When you get it wrong…

Confidence: What is it? • A sort of “closeness” of fit • Acoustic scores • How well it matches the expected sounds • Language model scores • How much work it took to find the phrase • A splash of recognizer vendor voodoo • How voice-like, admix of noise, etc. • All mixed together and reformed as a number between 0.0 and 1.0 (usually)

Confidence: How good is it? • Does it correlate with how a human would rank things? • Does it behave consistently? • long vs. short utterances? • Different word groups? • What happens when you rely on it?

Can we add more to the model? • We already use • Sounds – the Acoustic Model (AM) • Words – the Language Model (LM) • We can add • Meaning – the Semantic Model (SM) • Rethinking

Strategies that humans use • Rejection • Don’t hear repeated wrong utterances • Also called “skip lists” • Acceptance • Intentionally allowing only the likely utterances • Aka “pass lists” • Anticipation • Asking a question where the answer is known • Sometimes called “hints”

Rejection (skip) • The people and computers should not make the same mistake twice. • Keep a list of confirmed mis-recs • Remove those from the next recognition’s nbest list • But, beware the dark side ... • …the Chinese finger puzzle. • Remember: knowing what to reject is based on recognition too!

Acceptance (pass) • It is possible to specify the relative weights in the language model (grammar). • But there is a danger. It is a little like cutting the legs on a chair to make it level. Hasty modifications will have unintended interactions. • Another way is to create a sieve • This has the advantage of not changing the balance of the model. The other parts that do not pass the sieve become a defacto garbage collector.

Anticipation • Explicit • e.g. confirming identity, amounts, etc. • Probabilistic • Dialogs are journeys • Some parts of the route are routine, predictable

What should we disregard? • When is a recognition event truly the human talking to the computer? • The human is speaking • But not to the computer • But saying the wrong thing • Some human is saying something • Other noise • Car horn, mic bump, radio music, etc. • As dialogs get longer we need to politely ignore what we were not intended to respond to

In and Out of Grammar (oog) • The recognizer returned some text • Was it really what was said? • Can we improve over the “confidence”? • Look at the “scores” of the nbest • Use them as a “feature space” • Use example waves to discover clusters in feature space that correlate with “in” and “out” of Vocabulary

Where do we put it? • Where does all this heuristic post analysis go? Out in the dialog? • How can we minimize the cognitive load on the application developer? • We need to wrap up all this extra functionality inside a new container to hide the extra complexity

Re-listening • If an utterance is going to be rejected then try again. (Re-listen to the same wave) • If you can infer a smaller scope then listen with a grammar that “leans” that way. • Merge the nbests via some heuristic • Re-think the combined uttererance to see if it can now be considered “good and in grammar”

Serial Listening • The last utterance is not “good enough” • Prompt for a repeat and listen again (live audio from the user) • If it is “good” by itself then use it • Otherwise, heuristically merge the nbests based on similarities • Re-think the combined uttererance to see if it can now be considered “good and in grammar”

Parallel Listening • Listen on two recognizers • One with the narrow “expectation” grammar • The other with the wide “possible” grammar • If utterance is in both results process the “expectation” results • If not process the “possible” results

Conclusions • Error rate is the metric to watch • There is more information in the recognition result than the 1st good nbest • Putting conventional recognition inside a heuristic “box” makes sense • The information needed by the “box” is a logical extension of the listening context

Thank you Emmett Coin ejTalk, Inc emmett@ejTalk.com

Enhancing ASR Accuracy with Innovative Post-Computation Techniques

Enhancing ASR Accuracy with Innovative Post-Computation Techniques

Presentation Transcript

ASR

“Better evidence, Better policies, Better development results“

Building Better Business Results Through Recognition

Automatic Speech Recognition (ASR)

ASR Recognition Team

Automatic Speech Recognition : Conditional Random Fields for ASR

Producing better results through better organization

Robust Recognition of Documents by Fusing Results of Word Clusters

Automatic Speech Recognition (ASR): A Brief Overview

Results PASAT Mood Manipulation

ASR

ASR

ASR Recognition Team

CRFs for ASR: Extending to Word Recognition

Manipulation By Pushing

On the use of Intonation in ASR: preliminary results

CRFs for ASR: Extending to Word Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

Better Insights…Better Results

Better Statistics Better Results

Automatic Speech Recognition (ASR): A Brief Overview

CRFs for ASR: Extending to Word Recognition