Confidence measures in speech recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Confidence Measures in Speech Recognition PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Confidence Measures in Speech Recognition. Stephen Cox School of Computing Sciences University of East Anglia Norwich, UK. [email protected] Talk Outline. Why do we need confidence measures in speech systems? Motivation for recogniser-independent measures

Download Presentation

Confidence Measures in Speech Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Confidence measures in speech recognition

Confidence Measures in Speech Recognition

Stephen Cox

School of Computing Sciences

University of East Anglia

Norwich, UK.

[email protected]


Talk outline

Talk Outline

  • Why do we need confidence measures in speech systems?

  • Motivation for recogniser-independent measures

  • PART 1: Two methods for estimating confidence measures, based on phone/word models

    • Phone correlation

    • Metamodels

  • PART 2: Using semantic information to estimate confidence measures

  • Discussion


Why confidence measures

Why Confidence Measures?

  • A confidence measure (CM) is a number between 0 and 1 indicating our degree of belief that a unit output by a recogniser (phrase, word, phone etc.) is correct

  • The most important application of CMs is in speech dialogue systems e.g. ticket booking, call-routing, information provision etc.

    • Uncorrected errors can be disastrous in a dialogue system, but confirmation of each content word is tedious

    • The system can use a CM to decide which words are correct and which need to be confirmed or corrected.

  • Unsupervised speaker adaptation—use the CM in adaptation oi the acoustic models (adapt only models of words that the system considers are likely to be correct)

  • Aids selection of multiple hypotheses


Previous work

Previous Work

  • Confidence measures (CMs) mostly based on deriving ad hoc features from “side-output” from recogniser e.g.

    • number of competing hypotheses when a word is decoded

    • likelihood ratio of hypotheses

    • stability of word in the output lattice (N-best)

    • number of instances of word or phonemes in word in training dataetc. etc.

  • Problem: These are usually highly recogniser-specific


Example number of hypothesized word ends as a confidence measure

Example: Number of hypothesized word-ends as a confidence measure


Part 1 a general approach i

Probability of aword sequence WLANGUAGE MODEL

Probability of some acoustics A given a word sequence WACOUSTIC MODELS

Pr(

A

|

W

)

Pr(

W

)

=

Pr(

W

|

A

)

Pr(

A

)

Probability ofa word sequence Wgiven some acoustics A

Probability ofsome acoustics A

PART 1: A General Approach I

Speech recognition relies on Bayes’ Theorem:

W = word sequence A = acoustics of speech signal


A general approach ii

A General Approach II

  • Errors occur when either Pr(W) (language model) is inaccurate or Pr(W|A) is inaccurate (acoustic models)

  • In decoding words in a recogniser, these two probabilities are integrated

  • We can attempt to disentangle their effects by using a parallel phone recogniser

  • Two approaches:

    • use correlation between phone recogniser string and word recogniser string as confidence measure

    • use phone recogniser to hypothesise word strings and correlate with word recogniser output


Use of a parallel phone recogniser

Use of a parallel phone recogniser


Pre processing for phone correlation

p

k

Pre-processing for phone correlation

Speech

Word

recogniser

Phoneme

Phoneme

recogniser

transcription

DP

p

p

p

p

p

p

p

3

alignment

1

3

1

2

2

2

q

q

q

q

q

q

q

...

1

2

1

1

2

1

2

Tagged frames

p

p

p

p

1

2

3

4

q

q

q

q

. . .

means that p is within

word k

2

1

4

3

Aligned phonemes


Phone correlation distance measure

Phone correlation: distance measure

p

p

p

...

Confusion matrix

3

2

1

q

q

q

...

3

2

1

Aligned phoneme-

sequences or tagged

frames


Phone correlation likelihood ratio

Phone correlation: likelihood ratio

Correctly decoded word

Incorrectly decoded word


Hypothesising words from phone strings

where P* is the most likely phoneme sequence

Hypothesising words from phone strings

Pr(P* | A) can be estimated from a parallel phoneme recogniser

Pr(W | P*) is estimated using two techniques:

LexList and Metamodels


Lexlist constructing hypothetical word sequences

LexList: Constructing hypothetical word-sequences


Hypotheses made by a sliding window of length 3 phonemes

Hypotheses Made by a Sliding Window of Length 3 Phonemes


Metamodels candidate word lists built using phoneme confusions

MetaModels—candidate word lists built using phoneme confusions

Motivation:

  • LexList method requires some ad hoc decisions about window-length, short words etc.

  • Combinatorial explosion in candidate words when confusion-matrix is used

  • MetaModel uses knowledge of phoneme confusions within an HMM framework to produce candidate word lists for CM estimation


Building a set of metamodels

Building a Set of Metamodels


Obtaining a confidence measure from a set of metamodels

Obtaining a confidence measure from a set of metamodels


Data and models

Data and Models

  • Recogniser built using WSJCAM0 database

    • Acoustic model training: SI-TR data, ~10000 sentences, 92 speakers

    • Testing: SI-DT dataset, ~1900 sentences, 20 speakers

    • Models: 8 Gaussian mixture triphone models with tying (~ 3500 states)

    • Bigram language model with backoff, 20000 word vocabulary, perplexity ~160

  • Confidence measures

    • Independent training and testing sets from SI-DT dataset


Performance measurement

Performance measurement

  • Use CM to tag each decoded word as‘C’(correct) or ‘I’ (incorrect)

  • Guessing measure (G) error-rate:

  • Confidence measure (CM) error-rate:

  • Improvement I:


Baseline n best confidence

Baseline: “N-best” Confidence

Confidences:

can = 4/9, an = 5/9, increase = 8/9 etc. etc.


Performance comparison

Performance comparison


Part ii use of semantic information in confidence measures

PART II: Use of semantic information in confidence measures

  • It is possible to define incorrect words in an utterance on semantical grounds e.g. Exxon corporations said earlier this week that it replaced one hundred forty percent its violin gas production in nineteen eighty serve on.(violin = “oil and”)

  • Clearly, only a small proportion of incorrect words can be identified on such grounds

  • However, this information is likely to be independent of measures based on decoder output, and so might be advantageously combined with other CMs.

  • Also, it requires no recogniser side information at all.


Preliminary experiment

Preliminary Experiment

  • Examined decodings of about 600 sentences from our recogniser

  • Marked any word that we considered to be incorrect on grounds of semantics

  • Checked results against the transcriptions:

  • Only 470 incorrect words were marked as incorrect (Recall = 470/3141 = 15%)

  • Of these words, 421 were incorrect (Precision = 412/470 = 90%)

  • So human performance may be useful, but at low recall


Latent semantic analysis

Latent Semantic Analysis

  • We need a way of identifying words that are “semantically distant” from the other decoded words in a sentence

  • Clustering words only works up to a point because of data sparsity

  • Also, many semantically close word-pairs may rarely co-occur and so not cluster e.g.movie and film (synonyms)striker and batsman (both sporting roles, but different games)

  • Latent Semantic Analysis (LSA) has been successfully used to associate semantically similar words


Co occurrence matrix w

N documents

Doc N

.

.

.

.

.

.

.

.

Doc 3

Doc 1

Doc 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a

about

access

account

you

you’ve

your

0

0

0

1

0

0

0

0

1

0

0

0

0

0

2

0

0

1

0

0

0

0

0

0

0

0

1

0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

M words

Co-occurrence matrix W


Singular value decomposition of w

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Singular Value Decomposition of W

WORD/DOCUMENTSPACE

LSA SPACE

M x N

M x R

R x R

R x N

d

d

documents

1

N

w

1

0

0

s

=

d

r

o

w

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

w

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

T

M

V

S

U

W

(W=USVT when R=N)

In this case, M  20000, N  20000, R  100


Data and representation

Data and Representation

  • Use the Wall Street Journal corpus (same material as utterances for recognition experiments).

  • The “Documents” are the paragraphs: each paragraph is (pretty much) semantically coherent

  • 19396 documents and 19685 different words

  • Each word represented in the LSA space by a 100-d vector

  • Computed “semantic similarity” between two words as the dot-product of the vectors representing the words:


Semantic score distributions for four words

Semantic Score Distributions for Four Words

OUTRank 1

CAUTIOUSRank 6763

DENOMINATIONRank 19666

ABOARDRank 13892


Confidence measures from lsa

Confidence measures from LSA

  • Several confidence measures for a decoded word were evaluated:

    • 1. Mean semantic score to the other decoded words (MSS)

    • 2. Mean rank of semantic score to other decoded words given the complete distribution of scores to all words (MR).

    • 3. Probability of observing the set of scores to the other decoded words given the distribution of scores for the word. The score distribution was approximated by a five component Gaussian mixture. (PSS)


Use of a stop list

Use of a Stop List

  • Very commonly occurring words (e.g. function words) co-occur with most words, so have high scores to most words, and so contribute noise.

  • Hence words whose mean semantic score to all words in training-set was above a threshold LT were omitted.

  • Recogniser baseline performance increases when these words are omitted and this is taken account of in results.


Distribution of pss scores

Largedifference

in distributionsfor high scores

Littledifference

in distributionsfor low scores

Distribution of PSS scores


Discussion i

Discussion I

  • We expected this technique to work by identifying as incorrectly decoded words that were semantically distant from other words.

  • However, PSS derives its discrimination by identifying the correctly decoded words.

  • Analysis revealed that the words associated with high values of PSS were predominantly words that commonly occurred in the WSJ data (numbers, financial terms etc.). These are highly cognate with each other.


Discussion ii

Discussion II

  • Inspection of the decoded words that had very low values of PSS associated with them showed that some of these were very common words that had been correctly decoded.

  • It is possible that the corpus used for making the LSA analysis does not have enough material to capture the large set of words that these common words co-occur with.

  • Hence the decoded utterances in the test-set contain previously unseen co-occurrences that lead to a low semantic score for these words.

  • Some test-set words are also out-of-vocabulary


Performance of semantic cm

Performance of semantic CM


Final comments

Final Comments

  • We have developed techniques for identifying incorrect words in the output of a speech recogniser that do not depend on “side-information” from the recogniser, which is highly recogniser-specific.

  • The most successful is the “metamodels” technique. This uses a parallel phone recogniser working with the word recogniser and then correlates the output of the word recogniser with possible words constructed using metamodels.

  • Using semantic information gives a small but significant confidence gain and requires no other recogniser. This may well be domain-dependent.

  • The final test of the utility of these measures comes when they are used in a real system.


  • Login