1 / 47

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speech Processing Laboratory Temple University. Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized features Presented by: Piush Karmacharya Thesis Advisor: Dr. R. Yantorno Committee Members: Dr. Joseph Picone Dr. Dennis Silage.

lorene
Download Presentation

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Processing Laboratory Temple University Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized featuresPresented by:PiushKarmacharyaThesis Advisor:Dr. R. YantornoCommittee Members: Dr. Joseph PiconeDr. Dennis Silage Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

  2. Outline • Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

  3. Introduction • Branch of a more sophisticated stream – Speech Recognition • Identify keyword in a stream of written document or an audio (recorded or real time) • Confusion Matrix • True Positive - Hits • True Negative • False Negative – Misses • False Positive – False Alarms (FA) • Location if Present Results

  4. Introduction.. • Speaker dependent • High accuracy, limited application • Speaker Independent • Lower accuracy, Wide application • Performance Evaluation – hits, misses and false alarms • Receiver Operating Characteristic – Hits vs FA • Design Objective – maximize hits while keeping false alarms low • Accuracy -

  5. Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques • Related Research • System Design • Results • Future Work

  6. Motivation • Speech – Most general form of human communication • Information - embedded in redundant words • I would like to have french toast for breakfast. • Non-intentional sounds – cough, exclamation, noise • Efficient human-machine interface • Applications: Audio Document retrieval, Surveillance Systems, Voice commands/dialing

  7. Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

  8. Challenges • Similar (1, 3, 6; 13,14) and different keywords • Variation in length (4 - around 1500 samples, 14 - 4100 samples) Different instance of keyword REALLY from different speakers /KINGLISHA/ /Hey you have to speak English for some English/

  9. Common Approaches • Template Based Approach • Hidden Markov Models • Neural Network • Hybrid Methods • Discriminative Methods

  10. Template Matching • Started 1970’s • One (or more) keyword templates available • Search string – keyword; Search Space - the utterance • Flexible time search - Dynamic Time Warping • 1971, H Sakoe, S Chiba • Suitable for small scale applications • Drawback • Segment the utterance into isolated words • Fails to learn from the existing speech data

  11. Dynamic Time Warping • Time stretch/compress one signal so that it aligns with the other signal • Extremely efficient time-series similarity measure • Minimizes the effects of shifting and distortion • Prototype of the test keyword stored as a template; compared to each word in the incoming utterance

  12. Dynamic Time Warping • Reference and test keyword arranged along two side of the grid • Template keyword – vertical axis, test keyword – horizontal • Each block in the grid – distance between corresponding feature vectors • Best match – path through the grid that minimizes cumulative distance • But number of possible path increases exponentially with length!!

  13. DTW • Constraints • Monotonic condition: no backward • Continuity condition: no break in path • Adjustment window: optimal path does not wander away from diagonal • Boundary condition: starting/ending fixed • Constraint manipulated as desired (e.g. for connected word recognition [Myers, C.;   Rabiner, L.;   Rosenberg, A.; 1980]

  14. Hidden Markov Models • 1988 – Lawrence R. Rabiner • Statistical model – Hidden States/Observable Outputs • Emission probability – p(x|q1) • Transition probability – p(q2|q1) • First order Markov Process – probability of next state depend only on current state • Infer output given the underlying system • Estimate most likely system from observed sequence of output

  15. HMM • KWS Implementation • Large Vocabulary Continuous Speech Recognizer (LVSCR) • Model non-keywords using Garbage/Filler Models • Limitation • Large amount of training data required • Training data has to be transcribed in word level and/or phone level • Transcribed data costs time and money • Not available in all languages

  16. Neural Networks • Late 90’s • Classifier – learns from existing data • Multi-layer of interconnected nodes (neurons) • Different weights assigned to inputs; updated in every iteration • Requires large amount of transcribed data for training • Hybrid Systems – HMM/NN • Discriminative Approaches – Support Vector Machines

  17. Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

  18. Related Work • Segmental Dynamic Time Warping – Alex S. Park, James R. Glass, 2008 • Segmentation into isolated words not required • Choose starting point and adjustment window size • Proposed breaking words into smaller Acoustic Units • Speech – sequence of sounds • Acoustic units • Phonemes – Timothy J. Hazen, Wade Shen, Christopher White, 2009 • Gaussian Mixture Models (GMMs) – Yaodong Zhang, James R. Glass, 2009

  19. Related Work • Phonetic Posteriorgrams • Phonemes as Acoustic unit • Gaussian Posteriorgrams • Acoustic unit - GMMs • Posteriorgram - Probability vector representing the posterior probabilities of a set of classes for a speech frame • Every speech frame associated with one or more phonemes /SH/ /AA/ /ER/

  20. Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

  21. Research Methodology • Acoustic Unit – Mean of the cluster • Simple – K-means clustering • Likelihood – Euclidean distance from the cluster centroid • Segmental Dynamic Time Warping – Keyword Detection • Covariance information not required • Corpus • Call-Home Database • 30 minutes of stereo conversations over telephone channel • Switchboard Database • 2,400 two-sided telephone conversations among 543 speakers

  22. Steps • Training • Keyword Template Processing • Keyword Detection Training Speech Feature Extraction Trained Cluster K-means Clustering Distance Matrix • Training Speech • Diverse sound • Diverse speakers

  23. Speech Processing Feature vector – MFCC Speech Signal Pre-Emphasis Windowing FFT MFCC DCT Log Mel-Scaling • Model Human Perception • Multiplying with Filter banks • High Pass Filter • š[n] = s[n] - αs[n - 1]; α = 0.95 • Speech spectrum falls off at high frequencies • Emphasizes higher formants • Short-time stationary • Divide speech into short frames (20ms with 5ms spacing)

  24. K-Means Clustering • Feature space populated by entire training data. Select k random cluster centers Each data-point finds center it is closest to and associates itself with Each cluster now finds the centroid of the points it owns. Centroid updated with new means Repeat step 2 to 5 until convergence

  25. Distance Matrix • Feature far away from the centroid might fall into adjacent cluster • Likelihood Measure – Euclidean Distance • Vectors in region 3, 4 and 5 are closer to region 1 than region 6 • 2-D distance matrix optimize detection process Distance Matrix - D

  26. Keyword Templates • MFCC Feature Vector • Each frame associated to a cluster • 1-D template(s) stored into a folder Keywords Feature Extraction Vector Quantization 1-D string of cluster indices

  27. Keyword Detection • Speech utterance divided into overlapping segments (not isolated words) • Warping distance for each segment computed separately Speech Feature Extraction Vector Quantization 1-D cluster index Keyword Detection Decision Logic Segmental DTW

  28. Distance Plot • Kwd- C1-C2-C4-C6-C1 • Utterance – C2-C4-C5-C6-C1-C3-C4 • Keyword – vertical axis; utterance – horizontal axis • Each cell – distance measure • Grayscale • Dark – Low distance • Bright – Large distance • Minimum distance path – candidate keyword

  29. Segment -1 Segment -3 Segment -2 Segmental DTW • Speech utterance divided into overlapping segments • Choose the starting point • Adjustment window constraint – Segment Span ± R (=3) • Segment Width – 2 R +1 • Segment Overlap – R • Each segment has its own warping distance score • Candidate Keyword – ones with low warping distance • Precision Error – 2 R Distance S1=(0+0+0+7+9)/5 =3.2 S2= (0+5+0+7+9)/5 = 4.2

  30. Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

  31. Results • Some templates fail to produce low distance at keyword location • Average score can be used with a Threshold Distortion Score for keyword UNIVERSITY

  32. Decision Logic • N/2 Voting based Approach, N – No. of templates available • Top ten lowest distance segments for each keyword template • Frequency of occurrence for each segment • Top ten scorers for more than half the keywords – considered the keyword

  33. Experimental Setup • Feature Vector • 13MFCC, 13∆ and 13 ∆ ∆ = 39 Features • 24 Filterbank • 20ms frame with ¾ overlap • Cluster Size – 64, 128, 256 • Training Data – 14 speakers (10 male, 4 female) * 5 mins = 70 mins • Segment span R = 2 to 20 • Number of keywords – 14 • Test Utterance – 10 sec to 2 min • Keyword Location - Cut-off Precision Error – 30%

  34. Keyword Statistics • Long length keywords – easier to detect? • Higher variance – lower detection rate • Question: Syllable vs Phoneme [http://www.howmanysyllables.com/, http://www.speech.cs.cmu.edu/cgi-bin/cmudict]

  35. Operation Characteristic • Hits vs Segment Span – R • Smaller R – Restrictive • Large R – More flexible • Larger R – Large precision error • Maximum Hits at R = 5-7 • Compared to result on S-DTW on Gaussian Posteriorgram for Speech Pattern discovery [Y. Zhang, J. R. Glass, 2010]

  36. Operation Characteristic • Misses vs R • Small R – restrictive • Larger R – Flexible, more noise • Minimum misses at R = 5-7 • False Alarm vs R • Small R – Less false alarm • Large R – Flexible, more FA

  37. Operation Characteristic • Speed vs R No. of Segments = (UL-margin-1)/R + 1 • Smaller R – More segments/Processing time • Larger R – Fewer segments/Less time • For R=5, 1 minute of utterance – 5 secs per keyword template ≈ 12 templates possible in real time • 1 hr speech - 10 mins on 200 CPUs using GP and graph clustering on SDTW segments [ Y. Zhang and J. R. Glass 2010] Execution time per keyword template per minute of utterance

  38. Results • Results vary for different keywords • Frequency of use of the word more important than length (University/Relationship vs. Something) • Pronunciation – context dependent [ H. Ketabdar, J. Vepa, S Bengio and H. Bourlard, 2006]

  39. Future Work • Implement relevance feedback technique so that generic templates are assigned higher weights after every iteration [Hazen T.J., Shen W., White C.M., 2009] • Retraining the cluster for different environment • Testing on more data with refined keyword templates (isolation of keywords from the speech data was time consuming and required several iteration) • Using model keyword instead of several keyword templates [*Olakunle]

  40. Thank You

  41. Backup Slides

  42. Model Keyword • Develop a model keyword from all available keyword templates • Implement Self Organizing Maps (SOMs) • Cluster grouping is random in K-means clustering • Data belonging to same clusters are grouped into one in SOM

  43. System Design • Vector Quantization • Quantize data into finite clusters – training data for populating the feature space need not be transcribed. • Feature for same sound fall into same cluster • Reduce dimension – feature vector reduced to codebook • Likelihood Estimation • Account for data that might fall just outside the cluster • Segmental Dynamic Time Warping • DTW requires fixed ends – utterance segmentation into isolated words • Divide the utterance into segments (not necessarily words) and compute distortion score for each segment using DTW

  44. Hidden Markov Models • HMM for Speech Recognition • Each word – sequence of unobservable states with certain emission probabilities (features) and transition probabilities (to next state) • Estimate the model for each word in the training vocabulary • For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm • Grammatical constraints are applied to improve recognition accuracy • Vocabulary Continuous Speech Recognizer (LVSCR) • Model non-keywords using Garbage/Filler Models

  45. Phonetic Posteriorgrams • Each element represents the posterior probability of a specific phonetic class for a specific time frame. • Can be computed directly from the frame-based acoustic likelihood scores for each phonetic class at each time frame. • Time vs Class matrix representation

  46. Gaussian Posteriorgrams • Each dimension of the feature vector approximated by sum of weighted Gaussian – GMM • Parameterized by the mixture weights, mean vectors, and covariance matrices • Gaussian posteriorgram is a probability vector representing the posterior probabilities of a set of Gaussian components for a speech frame • GMM can computed over unsupervised training data instead of using a phonetic recognizer

  47. GP • S = (s1,s2,…,sn) • GP(S) = (q1,q2,…qn) • qi = ( P(C1|Si), P(C2|si), …. , P(Cm|si) ) • Ci - ith Gaussian component of a GMM • m - number of Gaussian components • Difference betn two GP • D(p,q) = - log (p . q) • DTW is used to find low distortion segment

More Related