1 / 54

Time -Varying Robustness of Speaker Recognition Systems

Thomas Fang Zheng 16 Sep. 2014, Nanyang Technological University, Singapore. Time -Varying Robustness of Speaker Recognition Systems. Outline. Introduction Creation of Time-varying Voiceprint Database The Discrimination-emphasized Frequency-warping Method Experimental Results Summary.

kim
Download Presentation

Time -Varying Robustness of Speaker Recognition Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thomas Fang Zheng 16 Sep. 2014, Nanyang Technological University, Singapore Time-Varying Robustnessof Speaker Recognition Systems

  2. Outline Introduction Creation of Time-varying Voiceprint Database The Discrimination-emphasized Frequency-warping Method Experimental Results Summary 2

  3. 3 On Biometric Recognition • Information Technology has made people live in quite a different style, more convenience, richer information, • yet less and less privacy and safety because of illegal access due to unsafe access control, • so biometric recognition is becoming more and more popular. • Biometric Recognition is a kind of technologies for measuring and analyzing a person's physiological or behavioral characteristics. These can be used to verify or identify a person. • The term "biometrics" is derived from the Greek words bio (life) and metric (to measure).

  4. 4 Examples of biometrics • Fingerprint • Face • Palmprint • Hand geometry • Iris • Retina scan • DNA • Signatures • Gait/gesture • Keystroke • Voiceprint

  5. 5 Rich information contained in speech Where is he/she from? What was spoken? What language was spoken? Accent Recognition Language Recognition Speech Recognition Emotion Recognition Gender Recognition Speaker Recognition Positive? Negative? Happy? Sad? Male or Female? Who spoke?

  6. 6 Speaker recognition / Voiceprint recognition • Speaker recognition (or Voiceprint recognition) is the process of automatically identifying or verifying the identity of a person from his/her voice, using the characteristic vocal information included in speech. It enables access control of various services by voice. [Kunzel 94][Furui 97] • Various applications: • Access control (e.g.: security control for confidential information, remote access of computers, information and reservation services); • Transaction authentication (e.g.: telephone banking, telephone shopping); • Security and forensic prospects (e.g.: public security, criminal verification); • Rich transcription for conference meeting (e.g.: "Who Spoke When" and "Who Spoke What" speaker diarization); • etc.

  7. 7 Speaker recognition categories • Speaker Identification • Determining which identity in a specified speaker set is speaking during a given speech segment. • Closed-set / Open-set • Speaker Verification • Determining whether a claimed identity is speaking during a speech segment. It is a binary decision task. • Speaker Detection • Determining whether a specified target speaker is speaking during a given speech segment. • Speaker Tracking (Speaker Diarization = Who Spoke When) • Performing speaker detection as a function of time, giving the timing index of the specified speaker.

  8. 8 Performance evaluation (for verification and open-set identification) • Detection Error Trade-off (DET) Curve • A plot of error rates for binary classification systems, plotting false rejection rate (FRR) vs. false acceptance rate (FAR). • Equal Error Rate (EER) • The error rate corresponding to the location on a DET curve where FAR and FRR are equal. • Minimum Detection Cost Function (MinDCF) • Cdet=Cmiss X Pmiss X PTarget + CFalseAlarm X PFalseAlarm X (1-Ptarget)

  9. Open issues for speaker recognition research [Furui 1997] 1. How can human beings correctly recognize speakers? 2. Is it useful to study the mechanism of speaker recognition by human beings? 3. Is it useful to study the physiological mechanism of speech production to get new ideas for speaker recognition? 4. What feature parameters are appropriate for speaker recognition? 5. How can we fully exploit the clearly evident encoding of identity in prosody and other supra-segmental features of speech? 6. Is there any feature that can separate speakers whose voices sound identical, such as twins or imitators? 7. How do we deal with long term variability in people's voices (ageing)? 8. How do we deal with short term alteration due to illness, emotion, fatigue, …? 9. What are the conditions that speaker recognition must satisfy to be practical? 10. What about combing speech and speaker recognition? Furui, S., "Recent advances in speaker recognition," Pattern Recognition Letters 18 (1997) 859-872

  10. Factors affecting the speaker recognition system performance: • The quality of the speech signal; • The length of the training speech signal; • The length of the testing speech signal; • The size of the population tested by the system; • The phonetic content of the speech signal; • ...

  11. 11 Key issues for ROBUST speaker recognition • Cross channel • Multiple speakers • Background noise • Emotions • Short utterance • Time-varying (or Ageing)

  12. Time-varying (or Ageing) problem In all these typical situations, training and testing are usually separated by some period of time, which poses a possible threat to speaker recognition systems. 12 TIME GAP

  13. Open questions on ageing problem “Does the voice of an adult change significantly with time? If so, how?” [Kersta 1962] “How to deal with the long-term variability in people’s voice? Whether there was any systematic long-term variation that helped update speaker models to cope with the gradual changes in people’s voices? ” [Furui 1997] “Voice changes over time, either in the short-term (at different times of day), the medium-term (times of the year), or in the long-term (with age).” [Bonastreet al. 2003] 14

  14. Observations Performance degradation in presence of time intervals The longer the separation between the training and the testing recordings, the worse the performance. [Soong et al. 1985] A significant loss in accuracy (4~5% in EER)between two sessions separated by 3 months was reported [Kato & Shimizu 2003] and ageing was considered to be the cause [Hebert 2008]. Few researchers have figured out reasons behind this time-varying phenomenonexactly. 15

  15. More enrollment data -- a solution? Using training data with a larger time span.[Markel 1979] Performance can be improved, however the enrollment is quite time-consuming! What’s more, in some situation, it is impractical to obtain such data! Accepted testing/recognition speech segments be augmented to previous enrollment data to retrain the speaker model[Beigi 2009, Beigi 2010] Performance can be improved, too, but initial training data should be kept for later use, which is storage-consuming! 16

  16. Ageing-dependent decision boundary -- a solution? Using ageing-dependent decision boundary in the score domain[Kelly 2011, Kelly 2012] Performance can be improved. Problem is: how to determine the time lapse practically? 17

  17. Model-updating (adaptation) -- a solution? A simple and straightforward way [Lamel 2000, Beigi 2009, Beigi 2010]: to update speaker models from time to time It is effective to maintain representativeness. Obviously, it is costly, user-unfriendly, and sometimes, perhaps unrealistic. 18

  18. Efforts in frequency domain … The most essential way to stabilize performance is to extract exact acoustic features that are speaker-specific and further, stable across sessions. This is more like a dream for a long period! To take some findings into existing techniques… NUFCC [Lu & Dang 2007]: assign frequency bands with different resolution according to their discrimination sensitivity for speaker-specific information, which is a good idea! 19

  19. The idea of frequency-warping! To emphasize frequency bands that are more sensitive to speaker-specific information, yet not so sensitive to time-related session-specific information. To identify frequency bands that reveal high discrimination sensitivity for speaker-specific information but low discrimination sensitivity for session-specific information. Once these frequency bands are identified, something can be done to emphasize such bands. The Discrimination-emphasized Frequency-warping method. 20

  20. Outline Introduction Creation of Time-varying Voiceprint Database The Discrimination-emphasized Frequency-warping Method Experimental Results Summary 21

  21. 22 MARP Corpus • A proper long time-span database is necessary. • Time-related variability is the only focus. • The MARP corpus has been the only one published so far [Lawson 2009], though there were more variability. • The MARP corpus • 32 participants, 672 sessions from June 2005 to March 2008 • 10 minutes of free-flowing conversations for each session • “While the impact on speaker recognition accuracy between any two sessions is considerable, the long-term trend is statistically quite small.” • “The detrimental impact is clearly not a function of ageing or of the voice changing within this timeframe.”

  22. 23 • In free-flowing conversations, speech contents are not fixed and a speaker’s emotion, speaking style, or engagement can be easily influenced by his/her partner. • Hence, creation of a voiceprint database which specially focuses on the time-varying effect in speaker recognition is imperative for both research and practical applications.

  23. 24 Database design principles • The time-varying effect is the only focus, therefore other factors should be kept as constant as possible throughout all recording sessions, including: • recording equipments, software, conditions, environment, and so on • In the database design, two major factors were well considered: • prompt texts design, and • time intervals design.

  24. 25 Factor I: Fixed prompt texts • Speakers were requested to utter in a reading way with fixed prompt texts instead of free-style conversations. • Prompt texts were designed to remain unchanged throughout all recording sessions. • To avoid, or at least reduce, the impact of speech contents on speaker recognition accuracy. • And they are in form of sentences and isolated words.

  25. 26 • 100 Chinese sentences and 10 isolated Chinese words • The length of each sentence ranges from 8 to 30 Chinese characters with an average of 15. • Each isolated Chinese word contained 2 to 5 Chinese characters and was read five times in each session. • Of the 10 isolated words, 5 were unchanged throughout all sessions just like the sentences, while • the other 5 changed from session to session and reserved for future research of other purpose.

  26. 27 • Database statistics: Table 1. Acoustic coverage of prompt texts

  27. 28 Factor II: Gradient time intervals • No precedent time-interval design can be referred. • We didn’t use a fixed time interval, because • it is costly, and unnecessary to record in a fixed-length time interval for more than 10 times to obtain a possible trend. • So instead, gradient time intervals were used. • Initial sessions were of shorter time intervals, while following sessions of longer and longer time intervals. • Impacts of different time intervals can be easily analyzed, and this can reduce some labor costs.

  28. 29 sessions time • Plan: totally, 16 sessions from January 2010 to 2012. • Five different time intervals are used: one week, one month, two months, four months and half a year, as illustrated in the figure below. • The design of time intervals exactly voids the recordings in summer or winter vacations (university holidays). • In actual recording it is unrealistic to make all speakers record exactly on one specific day, so the session day is made flexible to a session interval. Figure 1. Illustration of different time intervals and session days

  29. 30 Speakers • 60 fresh students, w/ 30M + 30F. • Today, ~50 valid speakers left, for some left in mid-way • Speakers born in years between 1989 and 1993 with a majority in year 1990. • From various departments: • such as computer science, biology, English, humanities, and journalism • All of them speak standard Chinese well.

  30. 31 Recording conditions • An ordinary room in the laboratory for recording. • no burst noise but environmental noise in a low level. • Prompt texts were requested to read in a normal speaking rate, while the volume can be controlled by the recording software. • Most of the speakers could complete a session in about 25 minutes smoothly. • Speech signals are digitalized at both 8 kHz and 16 kHz sampling rates simultaneously in 16-bit precision. • 15 recording sessions had been finished so far, the last one will be finished by the end of 2012.

  31. 32 Database evaluation -- a first and quick look • Experimental setup • 1024-mixture GMM-UBM system with 32-dim MFCCs • Experimental results • The system performs best when training and testing utterances are taken from the same session. • However, performance gets worse and worse with the recording date difference between training and testing gets bigger and bigger. Figure 2. EER curves when using different sessions for model training

  32. Outline Introduction Creation of Time-varying Voiceprint Database The Discrimination-emphasized Frequency-warping Method Experimental Results Summary 33

  33. How to find IMPORTANT frequency bands? The proposed solution is to highlight in feature extraction the frequency bands that reveal high discrimination sensitivity for speaker-specific information while low discrimination sensitivity for session-specific information. How to determine the discrimination sensitivity of each frequency band? F-ratio serves as a criterion to produce the discrimination scores How to highlight target frequency bands? Weighting filter bank outputs Frequency warping on the basis of mel-scale or Hertz-scale 34

  34. F-ratio [Wolf 1972] The ratio of the between-group variance to the within-group variance. A higher F-ratio value means better feature selection for the target grouping. That is to say, the feature selection with a higher F-ratio possesses higher discrimination sensitivity against the target grouping. 35

  35. For F-ratio calculation in time-varying speaker recognition tasks There exist two kinds of grouping: by speakers for each session and by sessions for each speaker. The whole frequency range is divided into K frequency bands uniformly. Linear frequency scale triangle filters are used to process the power spectrum of utterances. Two F-ratio values are calculated for each frequency band. 36

  36. 37 …… …… …… …… …… …… …… …… …… …… …… Figure 3. An illustration of two kinds of grouping

  37. For each frequency band k, a discrimination score is defined as: Note: Our previous experiments used the following definition, which was not so good. 38

  38. How to EMPHASIZE the frequency bands? A straightforward idea is to weight! To weight each frequency band filter output during the MFCC calculation based on the discrimination score -- referred to as Weighted MFCC (WMFCC) 39

  39. Another try --frequency warping! Another way is to warp, strategy could be: Uniformly warping of those target frequency bands with discrimination scores above a threshold. Non-uniformly warping of the whole frequency range according to their discrimination scores. 40 Figure 4. The relationship among Hz, Mel scale, and MFW scale

  40. Frequency warping in cepstral coefficients calculation 41 FW Scaling WFCC Figure 5. A comparison of WMFCC (mel scale warping) and WFCC (hertz scale warping) extraction procedures with the traditional MFCC

  41. Outline Introduction Creation of Time-varying Voiceprint Database The Discrimination-emphasized Frequency-warping Method Experimental Results Summary 42

  42. The discrimination for the whole frequency band … Figure 7. Discrimination scores of frequency bands

  43. Comparison Table 3. Overall performance comparison among MFCC, weighted MFCC, WMFCC and WFCC

  44. Questions on generalization Does this idea still work for other databases? What if there is no sufficient data for discrimination curve estimation? Can it be used with the state-of-the-art i-vector method? 45

  45. Answers or solutions? Does this idea still work for other databases? Take NIST SRE 2008 Speaker Recognition Database as the test set What if there is no sufficient data for discrimination curve estimation? Use the curve estimated from the time-varying database Can it be used with the state-of-the-art i-vector method? Integrate the discrimination curve into the i-vector framework 46

  46. Modular representation of f-ratio in i-vector system 47

  47. Database Fisher Database 7,196 females speakers selected to train the projection matrix Tc for i-vector and the projection matrix G for LDA/PLAD NIST SRE 2008 Speaker Recognition Evaluation Database 1,997 female speakers selected from the core evaluation data set (short2-short3), and 59,343 trials (incl. 47,184 imposter trials) made Taken as the test database 48

  48. Configurations 19-dim MFCC + log energy extracted Delta and delta-delta were integrated to form 60-dim coefficients 30 frequency bands divided and so 30 corresponding Fk trained 2048-Gaussian-mixture UBM, gender-dependent 400-dim i-vector 150-dim LDA/PLDA 49

  49. NIST SRE 2008 Test Conditions (Tab. 1) 50

  50. Experimental results In both LDA and PLDA systems, the Fbank-weighted MFCC is significantly better than standard MFCC under most conditions. 51

More Related