1 / 22

Speaker Recognition

University of Joensuu, Department of Computer Science. PUMS 2003-2004 –seminaari 14.10.2004 Turku. Speaker Recognition. Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen. Research Group. PUMS project. Juhani Saastamoinen Project manager.

kamal
Download Presentation

Speaker Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University of Joensuu, Department of Computer Science PUMS 2003-2004 –seminaari 14.10.2004 Turku Speaker Recognition Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen

  2. Research Group PUMS project Juhani Saastamoinen Project manager Pasi Fränti Professor Evgeny Karpov Project researcher Tomi Kinnunen Researcher Ismo Kärkkäinen Clustering algorithms Ville Hautamäki Project researcher

  3. PUMS & JoY • Speaker Recognition • PUMS season 2003-2004: • Identification, no verification • Port it in mobile phone • Feature fusion • Real-time • http://cs.joensuu.fi/pages/pums

  4. Speaker Recognition Speaker Verification Speaker Identification Is this Bob’s voice? Whose voice is this? ? + (Claim) Identification Verification Imposter! Application Scenarios

  5. Speech Audio Signal Processing Speaker Modelling Feature Vectors Recognition: min. MSE within DB over input speech Identification System Add trained speaker profiles Use all profiles in recognition Speaker Profile Database Decision

  6. Results 2003-2004 TCL/TK (HY) console UI SpeakerProfiler sprofiler Windows console UI Winsprofiler Series60 ProfMatch Epocsprofiler common speaker recognition app. interface Fusion Real-time srlib Speech features (HY) DB

  7. Planned Results Large scale database Teleconference Applications Access control Mobile phone login? Results 2003-2004 sprofiler SpeakerProfiler Winsprofiler ProfMatch Epocsprofiler common speaker recognition app. interface common speaker recognition app. interface Fusion Verification Real-time Segmentation srlib Speech features (HY) VAD DB

  8. System in Mobile Phone Port to Symbian OS with Series 60 UI platform

  9. Symbian Phones • Series 60 phone features: • 16 MB ROM • 8 MB RAM • 176 x 208 display • 32-bit ARM-processor • No floating-point unit!!! UIQ Series 60 Series 80

  10. FFTGEN • Multiplication results must fit in 32 bits: truncate multiplication inputs • FFTGEN: Truncate to 16/16 bits (“16/16 FFT”) FFT layer input X FFT Twiddle Factor 16-bit integer 16-bit integer X 32-bit multiplication result 16 used bits 16 crop-off bits FFT layer output (part of it) Crop-off for next layer: 16 bits! 16-bit integer

  11. Proposed Information Preserving “22/10 FFT” • Approximate DFT operator F with G • Increase ||F-G||, preserve more signal information • minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024 • Truncate multiplication inputs to 22/10 bits (signal/op) FFT layer input X FFT Twiddle Factor 32-bit integer 22 used bits 10 crop-off bits 32-bit integer, 22 bits used 16-bit integer, 10 bits used X FFT layer output (part of it) Crop-off for next layer: 10 bits 32-bit multiplication result

  12. 16/16 22/10 Scale of Error in Proposed FFT

  13. Mobile Phone Results

  14. Improving Accuracy by Information Fusion feature vector Feature set 1 (e.g. 5 MFCCs) ... ... Feature set 2 (e.g. F0 + -F0) Feature set 3 (e.g. formants F1,F2,F3) Classifier 1 score 1 score 2 Classifier 2 Score combiner Decision Classifier 3 score 3

  15. Fusion succesfull Fusion sucks Feature set combination BASELINE: Best individual Feature-level fusion Score-level fusion Decision-level fusion MFCC + MFCC 16.8 15.8 14.6 LPCC + LPCC 16.0 19.8 14.7 ARCSIN + ARCSIN 17.1 18.2 16.8 FMT + FMT 19.4 29.9 52.0 All feature sets 16.0 21.2 15.2 12.6 Information Fusion Results N/A N/A N/A N/A

  16. Speed up NN search Fill buffer with new data Frame blocking Vantage-point tree (VPT) indexing of the code vectors Reducing # vectors Silence detection Feature extraction 1. Averaging 2. Random sampling 3. Decimation 4. Clustering (LBG) Reduce # speakers Pre-quantization Matching 1. Static pruning 2. Hierarchical pruning 3. Adaptive pruning 4. Confidence-based pruning Real-Time Speaker Identification Speech input stream Speaker database Speaker 1 model v All frames ... v Speaker N model Non-silent frames v v Feature vectors v v Active speakers Pruned speakers List of candidate speakers Redused set of vectors v v Database pruning v No Yes Decision ? END

  17. 4 x realtime Results: Baseline System (TIMIT) (Average length of test utterance = 8.9 s) Real-time requirement satisfied

  18. 9 x realtime Results: Pre-Quantization (TIMIT) (Codebook size = 64) • Averaging performs worst, clustering best • About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy

  19. 11 x realtime Results: Pruning Variants (TIMIT) (Codebook size = 64) • Recommended method : adaptive pruning (AP)

  20. 33 x realtime Results: PQ, Pruning and PQP (TIMIT) (Codebook size = 64) • Recommended method : Combination of pre-quantization and pruning (PQP)

  21. 13:1 speed-up without degradation 9:1 to 10:1 speed-up without degradation Results : VQ vs. GMM (TIMIT) (Average length of test utterance = 8.9 s) VQ GMM Best time : 0.27 s = 33 x realtime @ error rate 0.32 % Smallest error : 0.00 % @ 0.31 s = 28 x realtime Best time : 0.18 s = 49 x realtime @ error rate 0.16 % Smallest error : 0.16 % @ 0.18 s = 49 x realtime

  22. 13:1 to 16:1 speedup with minor degradation 23:1 to 34:1 speedup with minor degradation Results : VQ vs. GMM (NIST-1999) (Average length of test utterance = 30.4 s) VQ GMM Best time : 0.82 s = 37 x realtime @ error rate 19.36 % Smallest error: 16.90 % @ 37.9 s = 0.8 x realtime Best time : 0.48 s = 63 x realtime @ error rate 19.22 % Smallest error : 17.34 % @ 11.4 s = 3 x realtime

More Related