Better recognition by manipulation of asr results
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Better Recognition by manipulation of ASR results PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

Better Recognition by manipulation of ASR results. Generic concepts for post computation recognizer result components. Emmett Coin Industrial Poet. ejTalk, Inc. www.ejTalk.com. Who?. Emmett Coin Industrial Poet Rugged solutions via compact and elegant techniques

Download Presentation

Better Recognition by manipulation of ASR results

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Better recognition by manipulation of asr results

Better Recognition by manipulation of ASR results

Generic concepts for post computation recognizer result components.

Emmett Coin

Industrial Poet

ejTalk, Inc. www.ejTalk.com


Better recognition by manipulation of asr results

Who?

  • Emmett Coin

    • Industrial Poet

      • Rugged solutions via compact and elegant techniques

      • Focused on creating more powerful and richer dialog methods

  • ejTalk

    • Frontiers of Human-Computer conversation

      • What does it take to “talk with the machine”?

      • Can we make it meta?


What this talk is about

What this talk is about

  • How applications typically use the recognition result

  • Why accuracy is not that important, BUT error rate is.

  • How some generic techniques can sometimes help reduce the effective recognition error rate.


How do most apps deal with recognition

How do most apps deal with recognition?

  • Specify a grammar (cfg or slm)

  • Specify a level of “confidence”

  • Wait for the recognizer to decide what happens (no result, bad, good)

  • Use the 1st nbest result when it is “good”

  • Leave all the errors and uncertainties to the dialog management level


Accuracy confusing concept

Accuracy: confusing concept

  • 95% accuracy is good, 97% percent is a little better … or is it?

    • Think of roofing a house.

  • Do people accurately perceive the ratio of “correct” vs. “incorrect” recognition?

    • Users hardly notice when you “get it right”. They expect it.

    • When you get it wrong…


Confidence what is it

Confidence: What is it?

  • A sort of “closeness” of fit

    • Acoustic scores

      • How well it matches the expected sounds

    • Language model scores

      • How much work it took to find the phrase

    • A splash of recognizer vendor voodoo

      • How voice-like, admix of noise, etc.

    • All mixed together and reformed as a number between 0.0 and 1.0 (usually)


Confidence how good is it

Confidence: How good is it?

  • Does it correlate with how a human would rank things?

  • Does it behave consistently?

    • long vs. short utterances?

    • Different word groups?

  • What happens when you rely on it?


Can we add more to the model

Can we add more to the model?

  • We already use

    • Sounds – the Acoustic Model (AM)

    • Words – the Language Model (LM)

  • We can add

    • Meaning – the Semantic Model (SM)

    • Rethinking


Strategies that humans use

Strategies that humans use

  • Rejection

    • Don’t hear repeated wrong utterances

      • Also called “skip lists”

  • Acceptance

    • Intentionally allowing only the likely utterances

      • Aka “pass lists”

  • Anticipation

    • Asking a question where the answer is known

      • Sometimes called “hints”


Rejection skip

Rejection (skip)

  • The people and computers should not make the same mistake twice.

    • Keep a list of confirmed mis-recs

    • Remove those from the next recognition’s nbest list

  • But, beware the dark side ...

    • …the Chinese finger puzzle.

    • Remember: knowing what to reject is based on recognition too!


Acceptance pass

Acceptance (pass)

  • It is possible to specify the relative weights in the language model (grammar).

    • But there is a danger. It is a little like cutting the legs on a chair to make it level. Hasty modifications will have unintended interactions.

  • Another way is to create a sieve

    • This has the advantage of not changing the balance of the model. The other parts that do not pass the sieve become a defacto garbage collector.


Anticipation

Anticipation

  • Explicit

    • e.g. confirming identity, amounts, etc.

  • Probabilistic

    • Dialogs are journeys

    • Some parts of the route are routine, predictable


What should we disregard

What should we disregard?

  • When is a recognition event truly the human talking to the computer?

    • The human is speaking

      • But not to the computer

      • But saying the wrong thing

    • Some human is saying something

    • Other noise

      • Car horn, mic bump, radio music, etc.

  • As dialogs get longer we need to politely ignore what we were not intended to respond to


In and out of grammar oog

In and Out of Grammar (oog)

  • The recognizer returned some text

  • Was it really what was said?

  • Can we improve over the “confidence”?

    • Look at the “scores” of the nbest

    • Use them as a “feature space”

    • Use example waves to discover clusters in feature space that correlate with “in” and “out” of Vocabulary


Where do we put it

Where do we put it?

  • Where does all this heuristic post analysis go? Out in the dialog?

  • How can we minimize the cognitive load on the application developer?

  • We need to wrap up all this extra functionality inside a new container to hide the extra complexity


Re listening

Re-listening

  • If an utterance is going to be rejected then try again. (Re-listen to the same wave)

  • If you can infer a smaller scope then listen with a grammar that “leans” that way.

  • Merge the nbests via some heuristic

  • Re-think the combined uttererance to see if it can now be considered “good and in grammar”


Serial listening

Serial Listening

  • The last utterance is not “good enough”

  • Prompt for a repeat and listen again (live audio from the user)

  • If it is “good” by itself then use it

  • Otherwise, heuristically merge the nbests based on similarities

  • Re-think the combined uttererance to see if it can now be considered “good and in grammar”


Parallel listening

Parallel Listening

  • Listen on two recognizers

    • One with the narrow “expectation” grammar

    • The other with the wide “possible” grammar

  • If utterance is in both results process the “expectation” results

  • If not process the “possible” results


Conclusions

Conclusions

  • Error rate is the metric to watch

  • There is more information in the recognition result than the 1st good nbest

  • Putting conventional recognition inside a heuristic “box” makes sense

  • The information needed by the “box” is a logical extension of the listening context


Better recognition by manipulation of asr results

Thank you

Emmett Coin

ejTalk, Inc

[email protected]


  • Login