Speaker verification from research to reality
This presentation is the property of its rightful owner.
Sponsored Links
1 / 120

Speaker Verification: From Research to Reality PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on
  • Presentation posted in: General

Speaker Verification: From Research to Reality. ICASSP Tutorial Salt Lake City, UT 7 May 2001. Douglas A. Reynolds, PhD Senior Member of Technical Staff M.I.T. Lincoln Laboratory. Larry P. Heck, PhD Speaker Verification R&D Nuance Communications.

Download Presentation

Speaker Verification: From Research to Reality

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Speaker verification from research to reality

Speaker Verification: From Research to Reality

ICASSP Tutorial

Salt Lake City, UT

7 May 2001

Douglas A. Reynolds, PhD

Senior Member of Technical Staff

M.I.T. Lincoln Laboratory

Larry P. Heck, PhD

Speaker Verification R&D

Nuance Communications

This work was sponsored by the Department of Defense under Air Force contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense.


Speaker verification from research to reality1

Speaker Verification: From Research to Reality

ICASSP Tutorial

Salt Lake City, UT

7 May 2001

This material may not be reproduced in whole or part without written permission from the authors


Tutorial outline

Tutorial Outline

  • Part I : Background and Theory

    • Overview of area

    • Terminology

    • Theory and structure of verification systems

    • Channel compensation and adaptation

  • Part II : Evaluation and Performance

    • Evaluation tools and metrics

    • Evaluation design

    • Publicly available corpora

    • Performance survey

  • Part III : Applications and Deployments

    • Brief overview of commercial speaker verification systems

    • Design requirements for commercial verifiers

    • Steps to deployment

    • Examples of deployments


Goals of tutorial

Goals of Tutorial

  • Understand major concepts behind modern speaker verification systems

  • Identify the key elements in evaluating performance of a speaker verification system

  • Define the main issues and tasks in deploying a speaker verification system


Part i background and theory outline

Part I : Background and TheoryOutline

  • Overview of area

    • Applications

    • Terminology

  • General Theory

    • Features for speaker recognition

    • Speaker models

    • Verification decision

  • Channel compensation

  • Adaptation

  • Combination of speech and speaker recognizers


Extracting information from speech

Extracting Information from Speech

Goal:Automatically extract information transmitted in speech signal

Speech

Recognition

Words

“How are you?”

Language

Recognition

Language Name

Speech Signal

English

Speaker

Recognition

Speaker Name

James Wilson


Evolution of speaker recognition

Evolution of Speaker Recognition

Commercial application of speaker recognition technology

Aural and spectrogram matching

Hidden Markov Models Gaussian Mixture Models

Dynamic Time-Warping Vector Quantization

Template matching

2001

1990

1980

1970

1960

1930-

  • This tutorial will focus on techniques and performance of state-of-the art systems

Small databases, clean, controlled speech

Large databases, realistic, unconstrained speech


Speaker recognition applications

Speaker Recognition Applications

Access Control

Physical facilities

Computer networks and websites

Transaction Authentication

Telephone banking

Remote credit card purchases

Law Enforcement

Forensics

Home parole

Speech Data Management

Voice mail browsing

Speech skimming

Personalization

Intelligent answering machine

Voice-web / device customization


Terminology

Terminology

  • The general area of speaker recognition can be divided into two fundamental tasks

Speaker recognition

Identification

Verification

  • Any work on speaker recognition should identify which task is being addressed


Terminology identification

TerminologyIdentification

  • Determines whom is talking from set of known voices

  • No identity claim from user (one to many mapping)

  • Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification

?

Whose voice is this?

?

?

?


Terminology verification authentication detection

Terminology Verification/Authentication/Detection

  • Determine whether person is who they claim to be

  • User makes identity claim (one to one mapping)

  • Unknown voice could come from large set of unknown speakers - referred to as open-set verification

  • Adding “none-of-the-above” option to closed-set identification gives open-set identification

Is this Bob’s voice?

?


Terminology segmentation and clustering

Terminology Segmentation and Clustering

  • Determine when speaker change has occurred in speech signal (segmentation)

  • Group together speech segments from same speaker (clustering)

  • Prior speaker information may or may not be available

Where are speaker changes?

Which segments are from the same speaker?

Speaker A

Speaker B


Terminology speech modalities

TerminologySpeech Modalities

Application dictates different speech modalities:

  • Text-dependent recognition

    • Recognition system knows text spoken by person

    • Examples: fixed phrase, prompted phrase

    • Used for applications with strong control over user input

    • Knowledge of spoken text can improve system performance

  • Text-independent recognition

    • Recognition system does not know text spoken by person

    • Examples: User selected phrase, conversational speech

    • Used for applications with less control over user input

    • More flexible system but also more difficult problem

    • Speech recognition can provide knowledge of spoken text


Terminology voice biometric

TerminologyVoice Biometric

Are

Know

Have

  • Speaker verification is often referred to as a voice biometric

  • Biometric: a human generated signal or attribute for authenticating a person’s identity

  • Voice is a popular biometric:

    • natural signal to produce

    • does not require a specialized input device

    • ubiquitous: telephones and microphone equipped PC

  • Voice biometric can be combined with other forms of security

Strongest security

  • Something you have - e.g., badge

  • Something you know - e.g., password

  • Something you are - e.g., voice


Part i background and theory outline1

Part I : Background and TheoryOutline

  • Overview of area

    • Applications

    • Terminology

  • General Theory

    • Features for speaker recognition

    • Speaker models

    • Verification decision

  • Channel compensation

  • Adaptation

  • Combination of speech and speaker recognizers


General theory phases of speaker verification system

General TheoryPhases of Speaker Verification System

Model (voiceprint) for each speaker

Bob

Sally

Verification Phase

Feature extraction

Verification

decision

Accepted!

Claimed identity: Sally

Two distinct phases to any speaker verification system

Enrollment Phase

Enrollment speech for each speaker

Bob

Feature extraction

Model training

Model training

Sally

Verification

decision


General theory features for speaker recognition

General TheoryFeatures for Speaker Recognition

Difficult to automatically extract

High-level cues (learned traits)

Easy to automatically extract

Low-level cues (physical traits)

  • Humans use several levels of perceptual cues for speaker recognition

Hierarchy of Perceptual Cues

  • There are no exclusive speaker identity cues

  • Low-level acoustic cues most applicable for automatic systems


General theory features for speaker recognition1

General TheoryFeatures for Speaker Recognition

  • Desirable attributes of features for an automatic system (Wolf ‘72)

  • Occur naturally and frequently in speech

  • Easily measurable

  • Not change over time or be affected by speaker’s health

  • Not be affected by reasonable background noise nor depend on specific transmission characteristics

  • Not be subject to mimicry

Practical

Robust

Secure

  • No feature has all these attributes

  • Features derived from spectrum of speech have proven to be the most effective in automatic systems


General theory speech production

General TheorySpeech Production

Glottal pulses

Vocal tract

Speech signal

Time (sec)

Time (sec)

  • Speech production model: source-filter interaction

    • Anatomical structure (vocal tract/glottis) conveyed in speech spectrum


General theory features for speaker recognition2

General TheoryFeatures for Speaker Recognition

Vocal Tract

70

4

Cross Section of

Cross Section of

Male

Male

60

3

Vocal Tract

50

Speaker

Speaker

/I/

40

2

30

20

1

10

Magnitude (dB)

Magnitude (dB)

0

0

18

Female

Female

5

16

14

4

Speaker

Speaker

12

3

10

8

2

6

/AE/

4

1

2

0

0

0

2000

4000

6000

0

2000

4000

6000

Frequency (Hz)

Frequency (Hz)

  • Different speakers will have different spectra for similar sounds

  • Differences are in location and magnitude of peaks in spectrum

    • Peaks are known as formants and represent resonances of vocal cavity

  • The spectrum captures the format location and, to some extent, pitch without explicit formant or pitch tracking


General theory features for speaker recognition3

General TheoryFeatures for Speaker Recognition

  • Speech is a continuous evolution of the vocal tract

    • Need to extract time series of spectra

    • Use a sliding window - 20 ms window, 10 ms shift

...

Fourier Transform

Magnitude

  • Produces time-frequency evolution of the spectrum

Frequency (Hz)

Time (sec)


General theory features for speaker recognition4

General TheoryFeatures for Speaker Recognition

Magnitude

Frequency

  • The number of discrete Fourier transform samples representing the spectrum is reduced by averaging frequency bins together

    • Typically done by a simulated filterbank

  • A perceptually based filterbank is used such as a Mel or Bark scale filterbank

    • Linearly spaced filters at low frequencies

    • Logarithmically spaced filters at high frequencies


General theory features for speaker recognition5

General TheoryFeatures for Speaker Recognition

3.4

3.6

2.1

0.0

-0.9

0.3

.1

3.4

3.6

2.1

0.0

-0.9

0.3

.1

3.4

3.6

2.1

0.0

-0.9

0.3

.1

...

  • Primary feature used in speaker recognition systems are cepstral feature vectors

  • Log() function turns linear convolutional effects into additive biases

    • Easy to remove using blind-deconvolution techniques

  • Cosine transform helps decorrelate elements in feature vector

    • Less burden on model and empirically better performance

Fourier Transform

Magnitude

Log()

Cosine transform

One feature vector every 10 ms


General theory features for speaker recognition6

General TheoryFeatures for Speaker Recognition

Fourier

Transform

Cosine

transform

Magnitude

Log()


General theory features for speaker recognition7

General TheoryFeatures for Speaker Recognition

300

3300

  • Additional processing steps for speaker recognition features

  • To help capture some temporal information about the spectra, delta cepstra are often computed and appended to the cepstra feature vector

    • 1st order linear fit used over a 5 frame (50 ms) span

  • For telephone speech processing, only voice pass-band frequency region is used

    • Use only output of filters in range 300-3300 Hz


General theory features for speaker recognition8

General TheoryFeatures for Speaker Recognition

  • To help remove channel convolutional effects, cepstral mean subtraction (CMS) or RASTA filtering is applied to the cepstral vectors

|.|

FT()

h(t)

Log()

Cos Trans()

  • Some speaker information is lost, but generally CMS is highly beneficial to performance

  • RASTA filtering is like a time-varying version of CMS (Hermansky, 92)


General theory phases of speaker verification system1

General TheoryPhases of Speaker Verification System

Model (voiceprint) for each speaker

Bob

Sally

Two distinct phases to any speaker verification system

Enrollment Phase

Enrollment speech for each speaker

Bob

Feature extraction

Model training

Sally

Verification Phase

Feature extraction

Verification

decision

Accepted!

Claimed identity: Sally


General theory speaker models

General TheorySpeaker Models

  • Speaker models are used to represent the speaker-specific information conveyed in the feature vectors

  • Desirable attributes of a speaker model

    • Theoretical underpinning

    • Generalizable to new data

    • Parsimonious representation (size and computation)

  • Modern speaker verification systems employ some form of Hidden Markov Models (HMM)

    • Statistical model for speech sound representation

    • Solid theoretical basis

    • Existing parameter estimation techniques


General theory speaker models1

General TheorySpeaker Models

3.4

3.6

2.1

0.0

-0.9

0.3

.1

3.4

3.6

2.1

0.0

-0.9

0.3

.1

3.4

3.6

2.1

0.0

-0.9

0.3

.1

  • Treat speaker as a hidden random source generating observed feature vectors

    • Source has “states” corresponding to different speech sounds

Observed feature vectors

Speaker (source)

Hidden speech state


General theory speaker models2

General TheorySpeaker Models

Transition probability

  • Feature vectors generated from each state follow a Gaussian mixture distribution

  • Transition between states based on modality of speech

    • Text-dependent case will have ordered states

    • Text-independent case will allow all transitions

Feature distribution for state i

  • Model parameters

    • Transition probabilities

    • State mixture parameters

  • Parameters are estimated from training speech using Expectation Maximization (EM) algorithm


Speaker verification from research to reality

General TheorySpeaker Models

  • HMMs encode the temporal evolution of the features (spectrum)

  • HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

  • This provides a statistical model of how a speaker produces sounds

  • Designer needs to set

    • Topology (# states and allowed transitions)

    • Number of mixtures


General theory speaker models3

General TheorySpeaker Models

Fixed Phrase Word/phrase models

“Open sesame”

Prompted phrases/passwords Phoneme models

/t/

/e/

/n/

Text-independent single state HMM (GMM)

General speech

Form of HMM depends on the application


General theory speaker models4

General TheorySpeaker Models

  • The dominant model factor in speaker recognition performance is the number of mixtures used (Matsui and Furui, ICASSP92)

  • Selection of mixture order is dependent on a number of factors

    • Topology of HMM

    • Amount of training data

    • Desired model size

  • No good theoretical technique to pick mixtures order

    • Usually set empirically

  • Parameter tying techniques can help increase the effective number of Gaussians with limited total parameter increase


General theory speaker models5

General TheorySpeaker Models

states

x(1)

x(2)

x(3)

x(4)

  • The likelihood of a HMM given a sequence of feature vectors is computed as

Full likelihood score

Viterbi (best-path) score

time


General theory phases of speaker verification system2

General TheoryPhases of Speaker Verification System

Verification Phase

Feature extraction

Verification

decision

Accepted!

Claimed identity: Sally

Two distinct phases to any speaker verification system

Enrollment Phase

Enrollment speech for each speaker

Model (voiceprint) for each speaker

Bob

Feature extraction

Model training

Bob

Sally

Sally

Verification

decision


General theory verification decision

General TheoryVerification Decision

  • The verification task is fundamentally a two-class hypothesis test

    • H0: the speech S is from an impostor

    • H1: the speech S is from the claimed speaker

  • We select the most likely hypothesis (Bayes test for minimum error)

  • This is known as the likelihood ratio test


General theory verification decision1

General TheoryVerification Decision

Speaker model

+

Front-end

processing

L

S

Impostor

model

-

  • Usually the log-likelihood ratio is used

  • The H1 likelihood is computed using the claimed speaker model

  • Requires an alternative or impostor model for H0 likelihood


General theory background model

General TheoryBackground Model

  • Cohorts/Likelihood Sets/Background Sets (Higgins, DSPJ91)

  • Use a collection of other speaker models

  • The likelihood of the alternative is some function, such as average, of the individual impostor model likelihoods

  • General/World/Universal Background Model (Carey, ICASSP91)

  • Use a single speaker-independent model

  • Trained on speech from a large number of speakers to represent general speech patterns

Speaker

model

/

Bkg 1

model

Bkg 2

model

Speaker

model

Bkg 3

model

/

Universal

model

  • There are two main approaches for creating an alternative model for the likelihood ratio test


General theory background model1

General TheoryBackground Model

  • The background model is crucial to good performance

    • Acts as a normalization to help minimize non-speaker related variability in decision score

  • Just using speaker model’s likelihood does not perform well

    • Too unstable for setting decision thresholds

    • Influenced by too many non-speaker dependent factors

  • The background model should be trained using speech representing the expected impostor speech

    • Same type speech as speaker enrollment (modality, language, channel)

    • Representation of impostor genders and microphone types to be encountered


General theory background model2

General TheoryBackground Model

  • Selected highlights of research on background models

  • Near/Far cohortselection (Reynolds, SpeechComm95)

    • Select cohort speakers to cover the speaker space around speaker model

  • Phonetic based cohort selection (Rosenberg, ICASSP96)

    • Select speech and speakers to match the same speech modality as used for speaker enrollment

  • Microphone dependent background models (Heck, ICASSP97)

    • Train background model using speech from same type microphone as used for speaker enrollment

  • Adapting speaker model from background model (Reynolds, Eurospeech97, DSPJ00)

    • Use Maximum A Posteriori (MAP) estimation to derive speaker model from a background model


General theory components of speaker verification system

General TheoryComponents of Speaker Verification System

Bob

“My Name is Bob”

Bob’s model

SpeakerModel

ACCEPT

ACCEPT

Feature extraction

Input Speech

Decision

S

REJECT

ImpostorModel

Impostor model(s)

Identity Claim


Part i background and theory outline2

Part I : Background and TheoryOutline

  • Overview of area

    • Applications

    • Terminology

  • General Theory

    • Features for speaker recognition

    • Speaker models

    • Verification decision

  • Channel compensation

  • Adaptation

  • Combination of speech and speaker recognizers


Channel compensation

Channel Compensation

The largest challenge to practical use of speaker verification systems is channel variability

  • Variability refers to changes in channel effects between enrollment and successive verification attempts

  • Channel effects encompasses several factors

    • The microphones

      • Carbon-button, electret, hands-free, etc

    • The acoustic environment

      • Office, car, airport, etc.

    • The transmission channel

      • Landline, cellular, VoIP, etc.

  • Anything which affects the spectrum can cause problems

    • Speaker and channel effects are bound together in spectrum and hence features used in speaker verifiers

  • Unlike speech recognition, speaker verifiers can not “average” out these effects using large amounts of speech

    • Limited enrollment speech


Channel compensation examples

Channel CompensationExamples

Different!


Channel compensation examples1

Channel CompensationExamples

The Same !


Channel compensation1

Channel Compensation

ErrorRates

Using compensation techniques has driven down error rates in NIST evaluations

  • Three areas where compensation has been applied

    Feature-based approaches

    CMS and RASTA

    Nonlinear mappings

    Model-based approaches

    Handset-dependent background models

    Synthetic Model Synthesis (SMS)

    Score-based approaches

    Hnorm, Tnorm

Factor of 20worse

Factor of 2.5worse


Channel compensation feature based approaches

Channel CompensationFeature-based Approaches

Linear

Non

-

Linear

Linear

Linear

Linear

Non

Non

-

-

Linear

Linear

Linear

Linear

Linear

Non

-

Linear

Linear

Linear

Linear

Non

Non

-

-

Linear

Linear

Linear

Linear

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

filter

electret

electret

speech

speech

carbon button

carbon button

speech

speech

Discriminative training

Output features

Speaker Recognition system

Feature analysis

ANN Transform

Input features

  • CMS and RASTA only address linear channel effects on features

  • Several approaches have looked at non-linear effects

  • Non-linear mapping (Quatieri, TrSAP 2000)

  • Use Volterra series to map speech between different types of handsets

  • Discriminative feature design (Heck, SpeechCom 2000)

  • Use neural-net to find features to discriminate speakers not channels


Channel compensation model based approaches

Channel CompensationModel-based Approaches

synthesis

synthesis

cellular

electret

carbon button

  • It is generally difficult to get enrollment speech from all microphone types to be used

  • The SMS approach addresses this by synthetically generating speaker models as if they came from different microphones (Teunen, ICSLP 2000)

    • A mapping of model parameters between different microphone types is applied


Channel compensation score based approaches

Channel CompensationScore-based Approaches

LR scores

hnorm scores

elec

spk1

carb

  • During verification normalize LR score based on microphone label of utterance

elec

spk2

carb

  • Speaker model LR scores have different biases and scales for utterances from different handset types

  • Hnorm attempts to remove these bias and scale differences from the LR scores (Reynolds, NIST eval96)

  • Estimate mean and standard-deviation of impostor, same-sex utterances from different microphone-types


Channel compensation score based approaches1

Channel CompensationScore-based Approaches

Speaker model

Tnorm score

Cohort model

Cohort model

Cohort model

  • Tnorm/HTnorm - Estimates bias and scale parameters for score normalization using “cohort” set of speaker models (Auckenthaler, DSP Journal 2000)

    • Test time score normalization

    • Normalizes target score relative to a non-target model ensemble

    • Similar to standard cohort normalization except for standard deviation scaling

  • Used cohorts of same gender and channel as speaker

  • Can be used in conjunction with Hnorm


Part i background and theory outline3

Part I : Background and TheoryOutline

  • Overview of area

    • Applications

    • Terminology

  • General Theory

    • Features for speaker recognition

    • Speaker models

    • Verification decision

  • Channel compensation

  • Adaptation

  • Combination of speech and speaker recognizers


Adaptation

Adaptation

Accept

Speaker model

Front-end

processing

S

Decision

Impostor model

  • Model adaptation is important for maintaining performance in speaker verification systems

    • Limited enrollment speech

    • Speaker and speech environment change over time (Furui)

  • Most useful approach is unsupervised adaptation

    • Use verifier decision to select data to update speaker model

    • Adjust model parameters to better match new data (MAP adaptation)


Adaptation1

Adaptation

EER

Adaptation session

  • Adaptation parameter can be set in several ways

    • As a fixed value : Continuous adaptation

    • As a function of likelihood score : Adjust adaptation based on certainty of decision

    • As a function of verification sessions : Adapt aggressively early and taper off later

  • Experiments have shown that adapting with N utterances produces performance comparable to having extra N utterances during initial training

Largest gain at start

  • Potential problems with adaptation

    • Impostor contamination

    • Novel channels may be rejected and so never learned


Part i background and theory outline4

Part I : Background and TheoryOutline

  • Overview of area

    • Applications

    • Terminology

  • General Theory

    • Features for speaker recognition

    • Speaker models

    • Verification decision

  • Channel compensation

  • Adaptation

  • Combination of speech and speaker recognizers


Combination of speech and speaker recognizers

Combination of Speech and Speaker Recognizers

  • There are four basic ways speech recognition is used with speaker verifiers

    • For front-end speech segmentation

    • For prompted text verification

    • For knowledge verification

    • To extract idiolectal information


Speech and speaker recognizers front end segmentation

Speech and Speaker RecognizersFront-end Segmentation

Speaker HMMs

Speaker verifier

L

Speech recognizer

/a/ /b/ /c/ …

Impostor HMMs

  • Speech recognizer used to segment speech for training and verification

  • Depending on task, different linguist units are recognized

    • Words, phones, broad phonetic classes

  • The recognized phrase could also be providing claimed identity to verifier

    • E.g., Account number


Speech and speaker recognizers prompted text verification

Speech and Speaker RecognizersPrompted Text Verification

Speaker HMMs

Impostor HMMs

Speaker score

Speaker verifier

L

Combine

/a/ /b/ /c/ …

Speech recognizer

Text score

Prompted phrase (e.g., 82-32-71)

  • Prompted text systems used to help thwart play-back attack

  • Need to verify voice and that prompted text was said

  • Possible to have integrated speaker and text verification using speaker-dependent phrase decoding


Speech and speaker recognizers knowledge verification

Speech and Speaker RecognizersKnowledge Verification

Speaker verifier

Decision

Unsure

Response

Speech recognizer

KV score

Personal Info DB

Question

Answer

  • Compare response to personal question to known answer

    • E.g., “What is your date of birth?”

  • Can be used for initial enrollment speech collection

    • Use KV for first three accesses while collecting speech

  • Can also be used as fall-back verification in case speaker verifier is unsure after some number of attempts

  • Also known as Verbal Information Verification (Q. Li, ICSLP98)


Speech and speaker recognizers idiolectal information extraction

Speech and Speaker RecognizersIdiolectal Information Extraction

Bigram

(n=2)

Uh-I 0.022

Uh I think yeah …

Uh-yeah 0.001

Un-well 0.025

L

Speech recognizer

LR Computation

Speaker ngrams

Bigram

(n=2)

Uh-I 0.001

Uh-yeah 0.049

Uh-well 0.071

Background ngrams

  • Recent work by Doddington has found significant speaker information using ngrams of recognized words (Eurospeech 2001 and NIST website)

  • During training, create counts of ngrams from speaker trainng data and from a collection of background speakers

  • During verification compute LR score between speaker and background ngram models

  • Good example of using higher-levels of speaker information


Speaker verification from research to reality2

Speaker Verification: From Research to Reality

Part II : Evaluation and Performance


Part ii evaluation and performance outline

Part II : Evaluation and PerformanceOutline

  • Evaluation metrics

  • Evaluation design

  • Publicly available corpora

  • Performance survey


Evaluation metrics

Evaluation metrics

  • In speaker verification, there are two types of errors that can occur

    False reject: incorrectly reject a speaker

    Also known as a miss or a Type-I error

    False accept: incorrectly accept an impostor

    Also known as a Type-II error

  • The performance of a verification system is a measure of the trade-off between these two errors

    • The tradeoff is usually controlled by adjustment of the decision threshold

  • In an evaluation, Ntrue true trials (speech from claimed speaker) and Nfalse false trials (speech from an impostor) are conducted and the probability of false reject and false accept are estimated for different thresholds


Evaluation metrics1

Evaluation metrics

  • Evaluation errors are estimates of true errors using a finite number of trials

L

J


Evaluation metrics roc and det curves

Evaluation metricsROC and DET Curves

Plot of Pr(miss) vs. Pr(fa) shows system performance

DET plots Pr(miss) and Pr(fa) on normal deviate scale

Receiver Operator Characteristic (ROC)

Detection Error Tradeoff (DET)

Decreasing threshold

PROBABILITY OF FALSE REJECT (in %)

PROBABILITY OF FALSE REJECT (in %)

Decreasing threshold

Better performance

PROBABILITY OF FALSE ACCEPT (in %)

PROBABILITY OF FALSE ACCEPT (in %)


Evaluation metrics det curve

Evaluation metricsDET Curve

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for security

High Security

Equal Error Rate (EER) = 1 %

Balance

Toll Fraud:

False rejections alienate customers

Any fraud rejection is beneficial

High Convenience

Application operating point depends on relative costs of the two errors

Equal Error Rate (EER) is often quoted as a summary performance measure

PROBABILITY OF FALSE REJECT (in %)

PROBABILITY OF FALSE ACCEPT (in %)


Evaluation metrics decision cost function

Evaluation metricsDecision Cost Function

  • In addition to EER, a decision cost function (DCF) is also used to measure performance

C(miss) = cost of a miss

Pr(spkr) = prior probability of true speaker attempt

C(fa) = cost of a false alarm

Pr(imp) = 1-Pr(spkr) = prior probability of impostor attempt

  • For application specific costs and priors, compare systems based on minimum value of DCF


Evaluation metrics thresholds

Evaluation metricsThresholds

  • Deployed verification system must make decisions – that is set and use a priori thresholds

    • DET curves and EER are independent of setting thresholds

  • The DCF can be used as an objective target for setting and measuring goodness of a priori thresholds

    • Set threshold during development to minimize DCF

    • Measure how close to minimum DCF the threshold achieves in evaluation

  • For measuring system performance, speaker-independent thresholds should be used

    • Pr(miss) is computed by pooling all true trial scores from all speakers in evaluation

    • Pr(fa) is computed by pooling all false trial scores from all impostors in evaluation

  • Using speaker-dependent threshold DETs produces very optimistic performance which can not be achieved in practice


Part ii evaluation and performance outline1

Part II : Evaluation and PerformanceOutline

  • Evaluation metrics

  • Evaluation design

  • Publicly available corpora

  • Performance survey


Evaluation design data selection factors

Evaluation DesignData Selection Factors

Speech quality

  • Channel and microphone characteristics

  • Ambient noise level and type

  • Variability between enrollment and verification speech

Speech modality

  • Fixed/prompted/user-selected phrases

  • Free text

Speech duration

  • Duration and number of sessions of enrollment and verification speech

Speaker population

  • Size and composition

  • Experience

The evaluation data and design should match the target application domain of interest

  • Performance numbers are only meaningful when evaluation conditions are known


Evaluation design sizing of evaluation

Evaluation DesignSizing of Evaluation

  • The overarching concern is to design an evaluation which produces statistically significant results

    • Number and composition of speakers

    • Number of true and false trials

  • For performance goals of Pr(miss)=1% and Pr(fa)=0.1% this implies

    • 3,000 true trials [ 0.7% < Pr(miss) < 1.3% with 90% confidence

    • 30,000 impostor trials [ 0.07% < Pr(fa) < 0.13% with 90% confidence

  • Independence of trials is still an open issue

  • For the number of trials, we can use the “rule of 30” based on binomial distribution and independence assumption (Doddington)

To be 90 percent confident that the true error rate is within +/- 30% of the observed error rate, there must be at least 30 errors


Evaluation design trials

Evaluation DesignTrials

  • True trials are the limiting factor in evaluation design

  • False trials are easily generated by scoring all speaker models against all utterances

    • May not be possible for speaker specific fixed phrases

Speaker models

Test utts

  • Important that each trial only uses utterance and model under test

    • Otherwise system is using “known” impostors (closed set)

  • Can design trials to examine performance on sub-conditions

    • E.g., train on electret and test on carbon-button


Part ii evaluation and performance outline2

Part II : Evaluation and PerformanceOutline

  • Evaluation metrics

  • Evaluation design

  • Publicly available corpora

  • Performance survey


Publicly available corpora data providers

Publicly Available CorporaData Providers

  • Linguistic Data Consortium

    http://www.ldc.upenn.edu/

  • European Language Resources Association

    • http://www.icp.inpg.fr/ELRA/home.html


Publicly available corpora partial listing

Publicly Available CorporaPartial Listing

  • TIMIT, et. al (LDC) - Not particularly good for evaluations

  • SIVA (ELRA) – Italian telephone prompted speech

  • PolyVar (ELRA) – French telephone prompted and spontaneous speech

  • POLYCOST (ELRA) – European languages prompted and spontaneous speech

  • KING (LDC) – Dual wideband and telephone monologs

  • YOHO (LDC) – Office environmentcombination lock phrases

  • Switchboard I-II & NIST Eval Subsets (LDC) – Telephone conversational speech

  • Tactical Speaker Identification, TSID (LDC) – Military radio communications

  • Speaker Recognition Corpus (OGI) – Long term telephone prompted and spontaneous speech

Summary of corpora characteristics can be found at

http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/


Part ii evaluation and performance outline3

Part II : Evaluation and PerformanceOutline

  • Evaluation metrics

  • Evaluation design

  • Publicly available corpora

  • Performance survey


Performance survey range of performance

Performance Survey Range of Performance

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones

Moderate amount of training data

25%

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

10%

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

1%

Text-dependent (Digit strings)

Telephone Data

Multiple microphones

Small amount of training data

0.1%

Increasing constraints

Probability of False Reject (in %)

Probability of False Accept (in %)


Performance survey nist speaker recognition evaluations

Performance Survey NIST Speaker Recognition Evaluations

Linguistic Data Consortium

Data Provider

Evaluation Coordinator

Comparison of technologies on common task

Technology Developers

Evaluate

  • Annual NIST evaluations of speaker verification technology (since 1995)

  • Aim: Provide a common paradigm for comparing technologies

  • Focus: Conversational telephone speech (text-independent)

Improve

http://www.nist.gov/speech/tests/spk/index.htm


Performance survey nist speaker recognition evaluation 2000

Performance Survey NIST Speaker Recognition Evaluation 2000

  • DET curves for 10 US and European sites

    • Variable duration test segments (average 30 sec)

    • Two minutes of training speech per speaker

    • 1003 speakers (546 female, 457 male)

    • 6096 true trials, 66520 false trials

  • Equal error rates range between 8% and 19%

  • Dominant approach is adapted Gaussian Mixture Model based system (single state HMM)


Performance survey effect of training and testing duration

Performance Survey Effect of Training and Testing Duration

  • Results from 1998 NIST evaluation

Increasing training data

Increasing testing data


Performance survey effect of microphone mismatch

Performance Survey Effect of Microphone Mismatch

Using different handset types

  • In the NIST evaluation, performance was measured when speakers used the same and different telephone handset microphone types (carbon-button vs electret)

  • With microphone mismatch, equal error rate increases by over a factor of 2

Using same handset types

2.5 X


Performance survey effect of speech coding

Recognition from reconstructed speech

Error rate increases as bit rate decreases

GSM speech performs as well as uncoded speech

Performance Survey Effect of Speech Coding

  • Recognition from speech coder parameters

  • Negligible increase in EER with increased computational efficiency

Coder Rates:

T1 - 64.0 kb/s

GSM - 12.2 kb/s

G.729 - 8.0 kb/s

G.723 - 5.3 kb/s

MELP - 2.4 kb/s


Performance survey human vs machine

Performance Survey Human vs. Machine

Humans44%better

Humans15%worse

ErrorRates

Computer

Human

  • Motivation for comparing human to machine

    • Evaluating speech coders and potential forensic applications

  • Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)

    • Same amount of training data

    • Matched Handset-type tests

    • Mismatched Handset-type tests

    • Used 3-sec conversational utterances from telephone speech

  • Humans have more robustness to channel variabilities

    • Use different levels of information


Performance survey human forensic performance

Performance Survey Human Forensic Performance

Recorded threat

Suspect

  • In 1986, the Federal Bureau of Investigation published a survey of two thousand voice identification comparisons made by FBI examiners

    • Forensic comparisons completed over a period of fifteen years, under actual law enforcement conditions

    • The examiners had a minimum of two years experience, and had completed over 100 actual cases

    • The examiners used both aural and spectrographic methods

    • http://www.owlinvestigations.com/forensic_articles/aural_specetrographic/fulltext.html#research

From “Spectrographic voice identification: A forensic survey ,” J. Acoust. Soc. Am, 79(6) June 1986, Bruce E. Koenig


Performance survey comparison to other biometrics

Performance Survey Comparison to Other Biometrics

  • Raw accuracy is generally not a good way to compare different biometric techniques

    • The application will dictate other important factors

    • See “Fundamentals of Biometric Technology” at http://www.engr.sjsu.edu/biometrics/publications_tech.html for good discussion and comparison of biometrics

From “A Practical Guide to Biometric Security Technology ,” IEEE Computer Society, IT Pro - Security, Jan-Feb 2001, Simon Liu and Mark Silverman


Performance survey comparison to other biometrics1

Performance Survey Comparison to Other Biometrics

From CESG Biometric Test Programme Report (http://www.cesg.gov.uk/biometrics/)


Speaker verification from research to reality3

Speaker Verification: From Research to Reality

Part III : Applications and Deployments


Part iii applications and deployments outline

Part III : Applications and DeploymentsOutline

  • Brief overview of commercial speaker verification systems

  • Design requirements for commercial verification systems

    • General considerations

    • Dialog design

  • Steps to deploying speaker verification systems

    • Initial data collection

    • Tuning

    • Limited Deployment and Final Rollout

  • Examples of real deployments


Part iii applications and deployments outline1

Part III : Applications and DeploymentsOutline

  • Brief overview of commercial speaker verification systems

  • Design requirements for commercial verification systems

    • General considerations

    • Dialog design

  • Steps to deploying speaker verification systems

    • Initial data collection

    • Tuning

    • Limited Deployment and Final Rollout

  • Examples of real deployments


Commercial speaker verification

Commercial Speaker Verification

FinancialCharles Schwab (Nuance)

Access ControlMac OS9(Apple)

Law EnforcementHome Incarceration

(ITT Industries)

Telecom

Swisscom(Nuance)

Access ControlTI Corporate Facility(Texas Instruments)

2001

1995

1990

CommerceHome Shopping Network (Nuance)

1985

1980

TelecomSprint’s Voice FONCARD(Texas Instruments)

Law EnforcementPrison Call Monitoring (T-Netix)

Small Scale Deployments (100s)

Large-Scale Deployments (1M+)


Applications and deployments commercial speaker verification systems

Applications and Deployments Commercial Speaker Verification Systems

NUANCE

VERIFIER


Part iii applications and deployments outline2

Part III : Applications and DeploymentsOutline

  • Brief overview of commercial speaker verification systems

  • Design requirements for commercial verification systems

    • General considerations

    • Dialog design

  • Steps to deploying speaker verification systems

    • Initial data collection

    • Tuning

    • Limited Deployment and Final Rollout

  • Examples of real deployments


Applications and deployments design requirements in commercial verifiers

Applications and Deployments Design Requirements in Commercial Verifiers

Requirements:

  • Fast

    • Example: 50 simultaneous verifications on single PIII 500MHz processor

  • Accurate

    • < 0.1% FAR @ < 5% FRR with ~1-5% reprompt rate

  • Robust (channel/noise variability)

  • Compact Storage of Speaker Models

    • < 100KB/model with support for 1M+ users on standard DBs (e.g., Oracle)

  • Scalable (1 Million+ users with standard DBs)

  • Easy to deploy

  • International language/region support

  • Variety of operating modes:

    • Text-independent, text-prompted, text-dependent

  • Fully Integrated with state-of-the-art speech recognizer

Biggest challenge  Robustness to channel variability


Applications and deployments design requirements online adaptation

Applications and Deployments Design Requirements: Online Adaptation

Online Unsupervised Adaptation

  • Adaptation is one of the most powerful technologies to address robustness & ease of deployment

  • Additional requirements:

    • Minimizes cross channel corruptionAdapting on cellular improves performance on office phone

    • Minimizes cross-channel effects w/ no growth in storageSaves new information from addition channels in 1 channel

    • Minimizes model corruption from impostor attack

  • SMS with online adaptation (Heck, ICSLP 2000):

    • Addresses above requirements

    • 5222 speakers, 8 calls @ 12.5% impostor attack rate 61% reduction in EER (unsupervised)


Applications and deployments dialog design general principles

Applications and Deployments Dialog Design: General Principles

  • Dialog should be designed to be secureand convenient

    • Security often compromised by users if dialog not convenientExample: 4-digit PIN

      • Security = 1 out of 10,000 false accepts? NO!Users compromise security of PINs to make them easier to remember (writing down in wallet, on-line, etc.)

  • Dialog should be maximally constrained but flexible

    • More constraints  better accuracy for fixed length training

    • Example: balance between constraints on acoustic space while maintaining flexibility  digit sequences

Dialog Design GoalConstrained but flexible dialog to maximizesecuritywhile maintainingconvenience


Applications and deployments dialog design rules of thumb

Applications and Deployments Dialog Design: Rules of Thumb

  • Enrollment:

    • must be secure (e.g., rely on knowledge)

    • should be completed in single session

  • Identity claim should be:

    • unique (but perhaps not unique for multi-user accounts)

    • easy to recognize over large populations

    • useful for simultaneous verification

  • Verification utterances should be:

    • easy to remember

      • YES: SSN, DOB, home telephone number

      • NO: PIN, password

    • easy to recognize (both recognizer and verifier)

    • perceived as short, but contain lots of speech

      • Names: “Smith S M I T H”

      • Digits: “3 5 6 7, 3 5 6 7”

    • known only by user

    • widely accepted by user population (e.g., not too private)

    • difficult to record/synthesize


Applications and deployments dialog design simultaneous id claim verification

Applications and Deployments Dialog Design: Simultaneous ID Claim/Verification

Start Buffering Data

My name is John Doe

Simultaneous Identity Claim and Verification:

  • Buffer identity claim utterance

  • Recognize identity claim and retrieve corresponding model

  • “Re-process” data by verifying same utt. against model

My name is John Doe

Start Verification (“john_doe.model”)

Stop Verification & Stop Buffering Data


Applications and deployments dialog design confidence based reprompting

Applications and Deployments Dialog Design: Confidence-based Reprompting

  • 1st Utterance:

  • FAR = 0.07%

  • FRR = 0.2%

  • RPR = 5.7%

Initial positivescore threshold

FRR1

score threshold

Initial negativescore threshold

FRR2

FAR1

FAR2

RPR = Pr(spkr) (FRR1 – FRR2) + Pr(imp) (FAR2 – FAR1)

Confidence-based reprompting:

  • minimize average length of authentication process

  • Improve effective FAR/FRR by reprompting when unsure

  • “reprompt rate (RPR)” controlled by two new thresholds


Applications and deployments dialog design knowledge verification

Applications and Deployments Dialog Design: Knowledge Verification

Recognize“Who you are”

SpeakerVerification

Accept

Combine

Reject

Knowledge

Verification

Recognize“What you know”

VoicePrints

KnowledgeBase

Please enter your account number

“5551234”

Say your date of birth

“October 13, 1964”

You’re accepted by the system


Applications and deployments dialog design knowledge verification1

Applications and Deployments Dialog Design: Knowledge Verification

  • Methods to Combine Knowledge:

  • Sequential (“and” of decisions)

FAR = FAR(sv) * FAR(kv)

FRR = FRR(kv) + (1-FRR(kv)) * FRR(sv)

KnowledgeVerification

SpeakerVerification

KnowledgeVerification

FAR = FAR(sv) + FAR(kv)

FRR = FRR(sv) * FRR(kv)

SpeakerVerification

KnowledgeVerification

SpeakerVerification

CombineScores

  • Parallel (“or” of decisions)

  • Weighted Scores


Applications and deployments dialog design knowledge verification2

Applications and Deployments Dialog Design: Knowledge Verification

Recognize“Who you are”

SpeakerVerification

Accept

Combine

Reject

Knowledge

Verification

Recognize“What you know”

VoicePrints

KnowledgeBase

  • Example: Sequential Combination of Decisions

  • Easy to implement, focuses on improving overall security

FAR = FAR(sv) * FAR(kv)

0.01% = 0.1% * 10%

FRR = FRR(kv) + (1-FRR(kv)) * FRR(sv)

1.1% = 0.1% + (1 - 0.1%) * 1%


Applications and deployments dialog design security against recordings

Applications and Deployments Dialog Design: Security Against Recordings

Speaker HMMs

Impostor HMMs

Speaker score

Speaker verifier

L

Combine

/a/ /b/ /c/ …

Speech recognizer

Text score

Prompted phrase (e.g., 82-32-71)

Prompted Text Verification:

  • Prompt user to repeat random phrase

    • Example: “Please say 82-32-71, 82-32-71”

    • Serves as “liveness” test

  • Requires modification of enrollment dialog

    • (typically) longer enrollment to adequately cover acoustics


Part iii applications and deployments outline3

Part III : Applications and DeploymentsOutline

  • Brief overview of commercial speaker verification systems

  • Design requirements for commercial verification systems

    • General considerations

    • Dialog design

  • Steps to deploying speaker verification systems

    • Initial data collection

    • Tuning

    • Limited Deployment and Final Rollout

  • Examples of real deployments


Applications and deployments deployment steps

Applications and Deployments Deployment Steps

Tune

Limited Deployment

Tune

Rollout

Initial Data Collection


Applications and deployments deployment steps initial data collection

Applications and Deployments Deployment Steps: Initial Data Collection

Initial Data Collection

How do you collect the data?

  • “Probing/sampling” approach?

    • Employ persons to call app. under supervision

    • Not widely used (too difficult to collect enough data)

  • Assessment from actual in-field data?

    • Much easier to get volumes of data and more realistic

    • Impostor trials: common enrollment utterance for impostor trials

    • True Speaker Trials: Sort scores. Manually transcribe poor-scoring utts.

  • Need ~50 callers/gender

  • Need to observe 30 errors of each type/condition (“rule of 30”)

  • Each speaker enroll/verifies several times (across multiple channels)

Tune

LimitedDeployment

Tune

Rollout


Applications and deployments deployment tuning

Applications and Deployments Deployment: Tuning

What components can be tuned?

Initial Data Collection

  • Operating point (threshold)

    • Setting operating point a Priori is very difficult!

    • Speaker-independent and/or speaker-dependent thresholds?

    • Picking correct operating point is key to a successful deployment!

  • Dialog Design

    • Customer feedback and/or usage patterns can be used to simplify dialog design (e.g., removing confirmation steps, reducing reprompt rate)

  • Impostor Models (Acoustic)

    • Training with real application data results in more competitive impostor models with better representation of linguistics & noise & channels.

Tune

LimitedDeployment

Tune

Rollout


Applications and deployments deployment steps limited deployment rollout

Begin with limited set of actual users

Representative of entire caller population

Representative sampling of (telephone) network

Representative of noise and channel mismatch conditions

After rollout, track the following statistics:

Successful enrollment sessions (# of speaker models)

Successful verification sessions

In-grammar/Out-of-grammar analysis (recognition)

Verification rejects (correct & false) for each speaker

Duration of sessions

Applications and Deployments Deployment Steps: Limited Deployment/Rollout

What steps are there to deployment?

Initial Data Collection

Tune

LimitedDeployment

Tune

Rollout


Part iii applications and deployments outline4

Part III : Applications and DeploymentsOutline

  • Brief overview of commercial speaker verification systems

  • Design requirements for commercial verification systems

    • General considerations

    • Dialog design

  • Steps to deploying speaker verification systems

    • Initial data collection

    • Tuning

    • Limited Deployment and Final Rollout

  • Examples of real deployments


Applications and deployments first high volume deployment

Applications and Deployments First High-Volume Deployment

  • Benefits

  • Security

  • Personalization

  • Application

  • Speaker verification and identification based on home phone number

  • Provides secure access to customer record & credit card information

  • Size & Volume

  • 600k customers enrolled [email protected] calls/day

  • Full deployment:5 million customers @170K calls/day

  • Implementation

  • Nuance VerifierTM

  • Edify telephony platform

  • Deployed July 1999


Applications and deployments first high volume deployment1

Applications and Deployments First High-Volume Deployment

Successful Authentication

Successful Enrollment


Applications and deployments transaction authentication

Applications and Deployments Transaction Authentication

  • Toll fraud prevention

  • Telephone credit card purchases

  • Telephone brokerage (e.g., stock trading)


Applications and deployments 1st large scale high security deployment

Applications and Deployments 1st Large-Scale High Security Deployment

Yes

Confident?

Random phrase(4-digits)

Make Decision

No

Yes

No

2nd Utt?

PIN?

  • Charles Schwab “Service Broker”

    “No PIN to remember, no PIN to forget”

    • Built on Nuance Verifier

       pilot (2000): 10,000 users (SF bay area, NY) deployment: ~3 million users

    • National rollout beginning Q3, 2001

Account Number


Applications and deployments law enforcement

Applications and Deployments Law Enforcement

  • Monitoring

    • Remote time and attendance logging

    • Home parole verification

    • Prison telephone usage

  • ITT:

  • SpeakerKey - telephone based home incarceration service

  • Deployed at 10 sites in Wisconsin, Ohio, Pennsylvania, Georgia, and California

  • More than 12,000 home incarceration sessions in June, 1995

  • 0.66% false acceptance 4.3% false rejection

  • T-Netix:

  • Contain - validating the identity and location of a parolee

  • PIN-LOCK - validating the identity of an inmate prior to allowing an outbound prison call

  • Deployed in Arizona,Colorado and Maryland

  • 10K inmates using PIN-LOCK

  • Roughly 25,000 - 30,000 verifications performed daily


Applications and deployments demonstrations

Applications and Deployments Demonstrations

  • Nuance 1-888-NUANCE-8

    • http://www.nuance.com/demos/demo-shoppingnetwork.html

  • T-Netix 1-800-443-2748

    • http://www.t-netix.com/SpeakEZ/SpeakEZDemo.html

  • ITT

    • http://www.buytel.com/WebKey/index.asp

  • Voice Security

    • http://www.Voice-Security.com/KeyPad.html


Speaker verification from research to reality recap

Speaker Verification: From Research to RealityRecap

  • Part I : Background and Theory

    • Major concepts behind theory and operation of modern speaker verification systems

  • Part II : Evaluation and Performance

    • Key elements in evaluating performance of a speaker verification system

  • Part III : Applications and Deployments

    • Main issues and tasks in deploying a speaker verification system


Conclusions

Conclusions

Speaker recognition is one of the few recognition areas where machines can outperform humans

Speaker recognition technology is a viable technique currently available for applications

Speaker recognition can be augmented with other authentication techniques to increase security


Future directions

Future Directions

Research will focus on using speaker recognitionfor more unconstrained, uncontrolled situations

  • Audio search and retrieval

  • Increasing robustness to channel variability

  • Incorporating higher-levels of knowledge into decisions

Speaker recognition technology will become an integral part of speech interfaces

  • Personalization of services and devices

  • Unobtrusive protection of transactions and information


To probe further general resources

Conferences and Workshops:

2001: A Speaker Odyssey - The Speaker Recognition Workshop, Crete, Greece, 2001 http://www.odyssey.westhost.com/

Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (RLA2C), Avignon, France, 1998 [proceedings in English]

ESCA Workshop on Automatic Speaker Recognition, Identification, and Verification, Martigny, Switzerland 1994 http://www.isca-speech.org/workshops.html

Audio and Visual Based Person Authentication (AVBPA) 1997, 1999, 2001 http://www.hh.se/avbpa/

International Conference on Acoustics Speech and Signal Processing (ICASSP), annual [sessions on speaker recognition] http://www.icassp2001.org/

European Conference on Speech Communication and Technology (Eurospeech), biennial [sessions on speaker recognition] http://eurospeech2001.org/

International Conference on Spoken Language Processing (ICSLP), biennial [sessions on speaker recognition] http://www.icslp2000.org/

Journals:

IEEE Transactions on Speech and Audio Processing http://www.ieee.org/organizations/society/sp/tsa.html

Speech Communication http://www.elsevier.nl/locate/specom

Computer, Speech & Language http://www.academicpress.com/www/journal/0/@/0/la.htm

Web:

The Linguistic Data Consortium, http://www.ldc.upenn.edu/

European Language Resources Association http://www.icp.inpg.fr/ELRA/home.html

NIST Speaker Recognition Benchmarks http://www.nist.gov/speech/tests/spk/index.htm

Joe Campbell’s Site for Speaker Recognition Speech Corpora http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/

The Biometric Consortium http://www.biometrics.org/

Comp.Speech FAQ on speaker recognition http://www.speech.cs.cmu.edu/comp.speech/Section6/Q6.6.html

Search for “speaker verification” in Goggle search engine http://www.google.com/

To Probe FurtherGeneral Resources


To probe further selected references

Tutorials:

B. Atal, ``Automatic recognition of speakers from their voices,'' Proceedings of the IEEE, vol. 64, pp.~460--475, April 1976.

A. Rosenberg, ``Automatic speaker verification: a review,'' Proceedings of the IEEE, vol. 64, pp. 475--487, April 1976.

G. Doddington, ``Speaker recognition---identifying people by their voices,'‘ Proceedings of the IEEE}, vol. 73, pp. 1651--1664, November 1985.

D. O'Shaughnessy, ``Speaker recognition,'' {IEEE ASSP Magazine, vol. 3, pp. 4--17, October 1986.

J. Naik, ``Speaker verification: a tutorial,'' IEEE Communications Magazine, vol. 28, pp. 42--48, January 1990.

H. Gish and M. Schmidt, ``Text-independent speaker identification,'' IEEE Signal Processing Magazine, vol. 11, pp. 18--32, October 1994.

S. Furui, ``An overview of speaker recognition technology,'' in Automatic Speech and Speaker Recognition (C.-H. Lee, F.K. Soong, ed.), pp.~31--56, Kluwer Academic, 1996.

J. P. Campbell, ``Speaker recognition: A tutorial,'' Proceedings of the IEEE , vol. 85, pp. 1437--1462, September 1997.

Special Issue on Speaker Recognition, Digital Signal Processing, vol. 10, January 2000. http://www.idealibrary.com/links/toc/dspr/10/1/0

Technology:

M. Carey, E. Parris, and J. Bridle, ``A speaker verification system using alphanets,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 397--400, May 1991.

G. Doddington, ``Speaker recognition based on idiolectal differences between speakers,'' in Proceedings of the European Conference on Speech Communication and Technology, 2001.

C. Fredouille, J. Mariethoz, C. Jaboulet, J. Hennebert, J.-F. Bonastre, C. Mokbel, and F. Bimbot, ``Behavious of a Bayesian adaptation method for incremental enrollment in speaker verification,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2000.

L. P. Heck and M. Weintraub, ``Handset-dependent background models for robust text-independent speaker recognition,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 1071--1073, April 1997.

To Probe FurtherSelected References

  • G. Doddington, M. Przybocki, A. Martin, and D. A. Reynolds, ``The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective,'‘ Speech Communication}, vol. 31, pp. 225-254,March 2000.


To probe further selected references1

L. Heck and N. Mirghafori, ``On-line unsupervised adaptation for speaker verification,'' in Proceedings of the International Conference on Spoken Language Processing, 2000.

L. Heck, Y. Konig, M.K. Sonmez, and M. Weintraub, ``Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design,'' Speech Communication, Vol. 31, 2000, pp. 181-192.

H. Hermansky, N. Morgan, A. Bayya, and P. Kohn, ``RASTA-PLP speech analysis technique,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. I.121--I.124, March 1992.

A. Higgins, L. Bahler, and J. Porter, ``Speaker verification using randomized phrase prompting,'' Digital Signal Processing, vol. 1, pp. 89--106,1991.

B. Koenig, ``Spectrographic voice identification: A forensic survey,'‘ Journal of the Acoustical Society of America, vol. 79, pp. 2088--2090, June 1986.

Q. Li and B.-H. Juang, ``Speaker verification using verbal information verification for automatic enrollment,'' in Proceedings of the International Conference on Spoken Language Processing, 1998.

To Probe FurtherSelected References

  • A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, ``The DET curve in assessment of detection task performance,'' in Proceedings of the European Conference on Speech Communication and Technology, pp. 1895--1898, 1997.

  • T. Matsui and S. Furui, ``Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. II--157--II--164, March 1992.

  • M. Newman, L. Gillick, Y. Ito, D. McAllaster, and B. Peskin, ``Speaker verification through large vocabulary continuous speech recognition,'' in Proceedings of the International Conference on Spoken Language Processing, pp. 2419--2422, 1996.

  • T.F. Quatieri, D.A. Reynolds and G.C. O'Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition,” IEEE Transactions on Speech and Audio Processing, August 2000

  • D. A. Reynolds, ``Speaker identification and verification using Gaussian mixture speaker models,'' Speech Communication, vol. 17, pp. 91--108, August 1995.


To probe further selected references2

To Probe FurtherSelected References

  • D. A. Reynolds, ``HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 1535--1538, April 1997.

  • D. A. Reynolds, ``Comparison of background normalization methods for text-independent speaker verification,'' in Proceedings of the European Conference on Speech Communication and Technology, pp. 963--967, September 1997.

  • D.Reynolds, M.Zissman, T.Quateri, G.O'Leary, and B.Carlson, ``The effects of telephone transmission degradations on speaker recognition performance,'‘ in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp.329--332, May 1995

  • D.A. Reynolds and B.A. Carlson, ``Text-dependent speaker verification using decoupled and integrated speaker and speech recognizers,'' in Proceedings of the European Conference on Speech Communication and Technology, pp.647--650, September 1995.

  • D.A. Reynolds, ``The effects of handset variability on speaker recognition performance: Experiments on the Switchboard corpus,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp.113--116, May 1996.

  • A.E. Rosenberg, C.-H. Lee, ``Connected word talker verification using whole word hidden markov models,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 381--384, 1991.

  • A. E. Rosenberg and S. Parthasarathy, ``Speaker background models for connected digit password speaker verification,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 81--84, May 1996.

  • F. Soong and A. Rosenberg, ``On the use of instantaneous and transitional spectral information in speaker recognition,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 877--880, 1986.

  • R. Teunen, B. Shahshahani, and L. Heck, ``A model-based transformational approach to robust speaker recognition,'' in Proceedings of the International Conference on Spoken Language Processing, 2000.


  • Login