1 / 42

Speaker Recognition

Speaker Recognition. G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ ( chollet, kharroub,petrovsk ) @ tsi.enst.fr ggravier @ infres.enst.fr ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 http://www.tsi.enst.fr/~chollet. Our affiliations.

tertius
Download Presentation

Speaker Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.frggravier@infres.enst.fr ENST/CNRS-LTCI46 rue Barrault75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet

  2. Our affiliations ENST:Ecole Nationale Supérieure des Télécommunicationshttp://www.enst.fr CNRS:Centre National de la Recherche Scientifiquehttp://www.cnrs.fr LTCI:Laboratoire de Traitement et Communication de l’Information http://www.enst.fr/ura/ura.html

  3. What is ENST?Ecole Nationale Supérieure des Télécommunications • classed among the • ‘Grandes Ecoles d'Ingénieurs’. • 250 state certified engineers • each year . • part of ‘Groupement des Ecoles • de Télécommunications’

  4. PIN 111111111 SECURED SPACE Bla-bla Modalities for Identity Verification

  5. Modalities for Identity Verification • A device you own (key, smart card,…) A code you remember (password, …) • Could be lost or stolen • Physiological characteristics: • Face, iris, finger print, hand shape,… • Need special equipment • Behavioral characteristics: • Speech, signature, keystroke,… • Speech is the prefered modality over the telephone (but a ‘voice print’ is much more variable than a finger print)

  6. Outline • Where is the information about the speaker identity in the speech signal ? • How well could humans recognize a speaker ? • Applications of Speaker Recognition • Prior knowledge on what the speaker said • Combining Speech Recognition and Speaker Verification • Some research activities at ENST: • Speaker verification: • The CAVE-PICASSO projects (text dependent) • The ELISA consortium, NIST evaluations (text independent) • The EUREKA !2340 MAJORDOME project • Multimodal Identity Verification: • The M2VTS and BIOMET projects • Perspectives

  7. Speaker Identity in Speech • Differences in • Vocal tract shapes and muscular control • Fundamental frequency (typical values) • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) • Glottal waveform • Phonotactics • Lexical usage • The differences between Voices of Twins is a limit case • Voices can also be imitated or disguised

  8. Speaker Identity • suprasegmental factors • speaking speed (timing and rhythm of speech units) • intonation patterns • dialect, accent, pronunciation habits • segmental factors (~30ms) • glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness) • vocal tract:formant frequenciesand bandwidths spectral envelope of / i: / Speaker A Speaker B A f

  9. Inter-speaker Variability We were away a year ago.

  10. Intra-speaker Variability We were away a year ago.

  11. Vocal Apparatus

  12. Speech production

  13. Glottal Waveform Modeling • Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality A t original residual: bluesynthetic residual: red

  14. Applications of Speaker Recognition • Identification from an open set (unrealistic) • Identification from a closed set (who is speaking in a videoconference ?) • Verification of claimed identity (risk of deliberate imposture) The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)

  15. Speaker Verification • Typology of approaches (EAGLES Handbook) • Text dependent • Public password • Private password • Customized password • Text prompted • Text independent • Incremental enrolment • Evaluation

  16. What are the sources of difficulty ? • Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) • Recording conditions (filtering, noise,…) • Temporal drift • Intentional imposture • Voice disguise

  17. Text-dependent Speaker Verification • Uses Automatic Speech Recognition techniques (DTW, HMM, …) • Client model adaptation from speaker independent HMM (‘World’ model) • Synchronous alignment of client and world models for the computation of a score.

  18. Dynamic Time Warping (DTW)

  19. HMM structure depends on the application

  20. Signal detection theory

  21. Score normalisation • World model • Cohort normalisation • Discriminant techniques

  22. Detection Error Tradeoff (DET) Curve

  23. CAVE – PICASSO http://www.picasso.ptt-telecom.nl/project/

  24. Incremental enrolment of customised password • The client chooses his password using some feedback from the system. • The system attempts a phonetic transcription of the password. • Incremental enrolment is achieved on further repetitions of that password • Speaker independent phone HMM are adapted with the client enrolment data. • Synchronous alignment likelihood ratio scoring is performed on access trials.

  25. Deliberate imposture • The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client. • A transformation (Multiple Linear Regression) is computed from these aligned data. • The impostor has heard the target client password. • He records that password and applies the transformation to this recording. • The PICASSO reference system with less than 1 % EER is defeated by this procedure (more than 30 % EER)

  26. Speaker Verification (text independent) • The ELISA consortium • ENST, LIA, IRISA, ... • http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html • NIST evaluations • http://www.nist.gov/speech/tests/spk/index.htm • Ergodic HMM • Gaussian Mixture Model

  27. Gaussian Mixture Model • Parametric representation of the probability distribution of observations:

  28. Gaussian Mixture Models 8 Gaussians per mixture

  29. National Institute of Standards & Technology (NIST)Speaker Verification Evaluations • Annual evaluation since 1995 • Common paradigm for comparing technologies

  30. WORLDGMMMODEL GMMMODELING WORLD DATA Front-end TARGETGMMMODEL TARGET SPEAKER GMM model adaptation Front-end GMM speaker modeling

  31. HYPOTH.TARGETGMM MOD. Front-end WORLDGMMMODEL Baseline GMM method l Test Speech = LLR SCORE

  32. GMM Modeling Scoring SVM Support Vector Machines and Speaker Verification • Hybrid GMM-SVM system is proposed • SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs

  33. Separating hyperplans H , with the optimal hyperplan Ho Feature space Input space H y(X) X Class(X) Ho SVM principles

  34. Results

  35. Combining Speech Recognition and Speaker Verification. • Speaker independent phone HMMs • Selection of segments or segment classes which are speaker specific • Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)

  36. Selection of nasals in words in -ing being everything getting anything thingsomething things going

  37. Vecsys EDF Software602 KTH Mensatec UPC Airtel «MAJORDOME» Unified Messaging System Eureka Projet no 2340 D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant

  38. MAJORDOME ( E-mail • Speaker verification • Dialogue • Routing • Updating the agenda • Automatic summary Voice Fax Majordome’s Functionalities

  39. Voice technology in Majordome • Server side background tasks: continuous speech recognition applied to voice messages upon reception • Detection of sender’s name and subject • User interaction: • Speaker identification and verification • Speech recognition (receiving user commands through voice interaction) • Text-to-speech synthesis (reading text summaries, E-mails or faxes)

  40. PIN 111111111 SECURED SPACE Bla-bla BIOMET

  41. BIOMET • An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape. • Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications) • Emphasis will be on fusion of scores obtained from two or more modalities.

  42. Conclusions and Perspectives • Evaluation trials (as conducted by NIST) help improve technology. • A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification. • Whenever possible, text independent speaker verification should be confirmed by text dependent verification. • Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.

More Related