1 / 16

A Speaker Pruning Algorithm for Real-Time Speaker Identification

AVBPA 2003 Guildford, UK, June 9-11, 2003. A Speaker Pruning Algorithm for Real-Time Speaker Identification. University of Joensuu, FINLAND Department of Computer Science. Tomi Kinnunen, Evgeny Karpov, Pasi Fränti. Abstract. Speaker identification task is computationally very expensive

santos
Download Presentation

A Speaker Pruning Algorithm for Real-Time Speaker Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AVBPA 2003 Guildford, UK, June 9-11, 2003 A Speaker Pruning Algorithm for Real-Time Speaker Identification University of Joensuu, FINLAND Department of Computer Science Tomi Kinnunen, Evgeny Karpov, Pasi Fränti

  2. Abstract • Speaker identification task is computationally very expensive • Most computation originates from calculating the matching scores • Proposed method: drop out unlikely speakers “on the fly” • Reduced computation time with slightly increased error rate

  3. VQ-Based Speaker Identification Speaker model database Unknown voice Loop over the whole database C1 C2 C3 ... ... ... Feature extraction X Ci Ci Matching ... ... ... { D(X,C1),…,D(X,Ci), …,D(X,CN) } CN Select minimum

  4. Match Score Saturation

  5. Towards Speaker Pruning ... • Only a few vectors is enough to rule out most of the speakers • Confidence increases when more vectors are processed Speaker pruning: Drop the unlikely speakers out from competetion when more data arrives  No more distance calculations needed for the pruned speakers

  6. 1st pruning 2ndpruning 3rd pruning Decision Illustration of Pruning Unknown speakers voice sample

  7. Variant 1: Static Pruning Idea: Maintain an ordered list of match scores, and prune out K worst speakers Let C = {C1,…,CN} be the set of all speaker models ; Let X = Ø ; WHILE (C ≠ Ø AND vectors left in input buffer) DO Insert M new vectors from input buffer to set X ; Re-evaluate dissimilarities D(X, Ci) for all Ci in C ; Remove K most dissimilar models from C ; END RETURN arg mini { D(X, Ci) | Ci ЄC } ;

  8. Variant 2: Adaptive Pruning Idea: determine a pruning threshold θ from the distribution of active speakers distances Let C = {C1,…,CN} be the set of all speaker models ; Let X = Ø ; WHILE (C ≠ ØAND vectors left in input buffer) DO Insert M new vectors from input buffer to set X ; Re-evaluate dissimilarities D(X, Ci) for all Ci in C ; Compute μ and σ of the distribution { D(X, Ci) | Ci ЄC }; Let θ = μ + η σbe the pruning threshold ; Remove all speakers i from C satisfying D(X, Ci) > θ ; END RETURN arg mini { D(X, Ci) | Ci ЄC } ;

  9. Illustration of Adaptive Pruning Histograms of matching scores as a function of time Pruned speakers Frequency of occurrence Match score (distance)

  10. μ μ+ησ Parameters of the Variants • Static pruning: Number of speakers to prune at each interval • Adaptive pruning: The η - parameter in the pruning threshold • It is assumed that distances follow a Gaussian distribution with mean μand variance σ2 •  ηspecifies a certain confidence interval

  11. Experimental Setup • TIMIT-corpus: • N = 630 American English speakers, clean speech • Sample rate Fs = 8 kHz, 16 bps resolution • Pre-processing and MFCC feature extraction : • - Silence removed, pre-emphasis H(z) = 1 - 0.97z-1 • - 30 ms Hamming window, shifted by 10 ms • - 27 triangular bandpass filters spaced equally on mel-scale • - 0th cepstral coefficient excluded • Speaker models : • Codebooks of 64 vectors by Linde-Buzo-Gray algorithm • Training data: 8.8 seconds / speaker (without silence)

  12. Evaluation Criteria • Identification error rate + Avg. identification time per speaker •  Combined: error rate as a function of time • Reference point: • Full-search (no speaker pruning) achieves 0.15 % error rate (one misclassified speaker) on average in 230 seconds ( 4 minutes)

  13. Error < 0.5 % in 50 seconds Static Pruning [Full search: 0.15 % in 230 seconds]

  14. Error < 0.5 % in 25 seconds Adaptive Pruning [Full search: 0.15 % in 230 seconds]

  15. Static: 5.5 % Adaptive:0.5% Static: 0.5 % Adaptive: 0.18% 25 s. 50 s. Comparison of the Variants [Full search: 0.15 % in 230 seconds]

  16. Conclusions • Speed-up ratio 9:1 with only minor degration in accuracy • Full search: 629/630 correct in 220 seconds • Static pruning: 595/630 correct in 25 seconds • Adaptive pruning: 627/630 correct in 25 seconds • Adaptive variant outperforms static variant • Selection of the parameters not crucial •  Easy to apply in practice • Both variants are straightforward to implement • Easily extendable to other models (e.g. GMM)

More Related