Automatic speaker recognition technologies evaluations and possible future
1 / 27

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future - PowerPoint PPT Presentation

  • Uploaded on

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future. Gérard CHOLLET CNRS-LTCI, GET-ENST Outline. Why Speaker Recognition ? Taxonomy (i.e. tasks) Applications (security, forensic,…) Pros and Cons Speaker Characteristics in the Speech Signal

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Automatic Speaker Recognition: Technologies, Evaluations and Possible Future' - roxy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Automatic speaker recognition technologies evaluations and possible future

Automatic Speaker Recognition:Technologies, Evaluations and Possible Future



Automatic Speaker Recogniton


  • Why Speaker Recognition ?

  • Taxonomy (i.e. tasks)

  • Applications (security, forensic,…)

  • Pros and Cons

  • Speaker Characteristics in the Speech Signal

  • How to perform Speaker Recognition ?

  • Evaluation (NIST,…)

  • Voice Transformations and Forgery (occasional, dedicated)

  • Audio-visual Speaker Verification

  • Conclusions, Perspectives

Automatic Speaker Recogniton

1 why should a computer recognize who is speaking
1. Why Should a Computer Recognize Who Is Speaking?

  • Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...)

  • Limited access (secured areas, data bases)

  • Personalization (only respond to its master’s voice)

  • Locate a particular person in an audio-visual document (information retrieval)

  • Who is speaking in a meeting ?

  • Is a suspect the criminal ? (forensic applications)

Automatic Speaker Recogniton

2 taxonomy of the automatic speaker recognition tasks
2. Taxonomy of the Automatic Speaker Recognition Tasks

  • Speaker verification (Voice Biometric??)

    • Are you really who you claim to be ?

  • Speaker identification (Speaker ID) :

    • Is this speech segment coming from a known speaker ?

    • How large is the set of speakers (population of the world) ?

  • Speaker detection, segmentation, indexing, retrieval, tracking :

    • Looking for recordings of a particular speaker

  • Combining speech and speaker recognition

    • Adaptation to a new speaker, speaker typology

    • Personalization in dialogue systems

Automatic Speaker Recogniton

3 applications
3. Applications

  • Access Control

    • Physical facilities, Computer networks, Websites

  • Transaction Authentication

    • Telephone banking, e-Commerce

  • Speech data Management

    • Voice messaging, Search engines

  • Law Enforcement

    • Forensics, Home incarceration

Automatic Speaker Recogniton

4 advantages and disadvantages
4. Advantages and Disadvantages

  • Advantages

    • Most suited modality over the telephone

    • Low cost (microphone, A/D), Ubiquity

    • Possible integration on a smart (SIM) card

    • Natural bimodal fusion : speaking face

  • Disadvantages

    • Lack of discretion

    • Possibility of imitation and electronic imposture

    • Lack of robustness to noise, distortion,…

    • Temporal drift

Automatic Speaker Recogniton

5 speaker characteristics in the speech signal
5. Speaker Characteristics in the Speech Signal

  • Differences in

    • Vocal tract shapes and muscular control

    • Fundamental frequency (typical values)

      • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child)

    • Glottal waveform

    • Phonotactics

    • Lexical usage

  • The differences between voices of twins is a limit case

  • Voices can also be imitated, disguised and electronically transformed

Automatic Speaker Recogniton

5 1 speaker characteristics different factors
5.1 Speaker Characteristics: different factors

Supra-segmental factors (>30ms)

speaking speed (timing and rhythm of speech units)

intonation patterns

dialect, accent, pronunciation habits

Segmental factors (~30ms)

glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)

vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef.)

spectral envelope of / i: /

Speaker A

Speaker B



Automatic Speaker Recogniton

5 2 speaker characteristics acoustic features
5.2 Speaker Characteristics: Acoustic Features

  • Short term spectral analysis

Automatic Speaker Recogniton

5 3 speaker characteristics intra and inter speaker variability
5.3 Speaker Characteristics: Intra- and Inter- Speaker Variability

Automatic Speaker Recogniton

6 1 how history
6.1 How: history

Automatic Speaker Recogniton

6 2 how current approaches
6.2 How: current approaches

Automatic Speaker Recogniton

6 3 how hmm structure is application dependent
6.3 How: HMM Structure is Application Dependent

Automatic Speaker Recogniton

6 4 how gaussian mixture models gmms
6.4 How: Gaussian Mixture Models (GMMs)

  • Parametric representation of the probability distribution of observations:

Automatic Speaker Recogniton

6 5 how gmm s example
6.5 How: GMM’s example

8 Gaussians per mixture

Automatic Speaker Recogniton

6 6 how decision theory for speaker verification
6.6 How: Decision Theory for Speaker Verification

  • Two types of errors :

    • False rejection (a client is rejected)

    • False acceptation (an impostor is accepted)

  • Decision theory : given an observation O and a claimed identity

    • H0 hypothesis : it comes from an impostor

    • H1 hypothesis : it comes from our client

  • H1 is chosen if and only if P(H1|O) > P(H0|O) ,which could be rewritten (using Bayes law) as:

Automatic Speaker Recogniton

6 8 how decision
6.8 How: Decision

Automatic Speaker Recogniton

6 9 how distribution of scores
6.9 How: Distribution of scores

Automatic Speaker Recogniton

6 10 how detection error tradeoff det curve
6.10 How: Detection Error Tradeoff (DET) Curve

Automatic Speaker Recogniton

7 evaluation
7. Evaluation

  • Decision cost (FA, FR, priors, costs,…)

  • Reference systems (open software)

    • Torch - a Machine Learning library (

    • ALIZE (

    • BECARS (

  • Evaluations (algorithms, field trials, ergonomics,…)

    • NIST Speaker detection campaigns

Automatic Speaker Recogniton

7 1 evaluations national institute of standards technology nist
7.1 Evaluations: National Institute of Standards & Technology (NIST)

  • Annual evaluation since 1995

  • Common paradigm for comparing technologies

Automatic Speaker Recogniton

7 2 evaluations nist 2004
7.2 Evaluations: NIST 2004 Technology (NIST)

Automatic Speaker Recogniton

8 voice transformations and forgery occasional dedicated
8. Voice Transformations and Forgery (occasional, dedicated)

  • Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems

  • Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available

  • Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures

    • Prevention by predicting many different forgery scenarios

Automatic Speaker Recogniton

Automatic speaker recognition technologies evaluations and possible future

Speaking Faces : Motivations

A person speaking in front of a camera offers 2 modalities for identity verification (speech and face).

The sequence of face images and the synchronisation of speech and lip movements could be exploited.

Imposture is much more difficult than with single modalities.

Many PCs, PDAs, mobile phones are equiped with a camera. Audio-Visual Identity Verification will offer non-intrusive security for e-commerce, e-banking,…

Automatic Speaker Recogniton

9 1 speaking faces audio visual approach
9.1 Speaking faces: Audio-Visual Approach

Automatic Speaker Recogniton

Automatic speaker recognition technologies evaluations and possible future

A talking face model

Using Hidden Markov Models (HMMs)

Each state of the model generates a sequence of feature vectors

Automatic Speaker Recogniton

10 conclusions and perspectives
10. Conclusions and Perspectives

  • Deliberate forgery is a challenge for speech only systems

  • Verification of identity based on features extracted from talking faces should be developed

  • Common databases and evaluation protocols are necessary

  • Free access to reference systems and databases, will facilitate future developments

  • Apply this paradigm to more than audio-visual modalities => see BioSecure-NoE

Automatic Speaker Recogniton