Download Presentation
## Speech Recognition

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Speech Recognition**Mital Gandhi Brian Romanowski**Objective - Speech Recognition**• Isolated Word Recognition • Portable and Fast**Recognition – Conceptually**• Data Acquisition • Training Hidden Markov Models for word set • Recognition & Analysis**Theory – Hidden Markov Models**• Used to model semi-stationary random processes, like speech • Example: • cat = / k a t /**Viterbi-based Recognition**• Calculates the log-maximum likelihood of a series of observations given a particular HMM. • “Which model did this set of data most likely come from?” • Saves time by calculating only a subset of possible paths through the HMM network. • At each new frame, only the most likely transition/observation state pairs are used. • Concepts similar to Dynamic Time Warping**System Components I Volume Box**• Sound Input • Amplifier • Reference Voltage • Resistor network (Voltage Dividers) • Voltage followers • Comparator • Microphone voltage vs. Reference • Output • LED bargraph**System Components II Hidden Markov Modeling ToolKit**• Data Acquisition • Data Preparation • Parameter Enhancements • Recognition & Analysis**System Components II (cont.) HTK: Data Acquisition &**Preparation • Data Acquisition • Recording using HSLab • Live audio input using HVite • Data Preparation • External files: dictionary, config, word lists • Initialization of prototype models (HCompV)**System Components II (cont.) HTK: Sample External Files**• Config • Prototype Model**System Components II (cont.) HTK: Training &**Recognition • HERest – parameter re-estimation and enhancement tool • Uses information from the energy, delta, & acceleration features in the cepstral domain • HVite for Recognition • Recognition of pre-recorded files or live audio input • A host of external files to support the recognition • Analysis tool HResults to compute accuracy & correctness results**System Components II (cont.) HTK: Results & Analysis**• HResults • Computes % values for recognition accuracy and correctness • Results Analysis • NREF = percentage of reference labels correctly recognized • Correction does not penalize for insertion errors**System Components II (cont.) HTK: Preliminary Results**====================== HTK Results Analysis ====================== Date: Mon Sep 30 16:50:59 2002 Ref : 4word_word.mlf Rec : recout.mlf ------------------------ Overall Results -------------------------- SENT: %Correct=25.00 [H=1, S=3, N=4] WORD: %Corr=25.00, Acc=25.00 [H=9, D=0, S=3, I=0, N=12] ======================**System Components II (cont.) HTK: Techniques, Solutions**• Input File Specifications • Config • Cepstral mean subtraction, energy enormalization • Prototype model • Number of states per word model • “Optimality” in transition probability assignments (matrix) • Data • “Noise-free” data • As many tokens/samples of each word for training**DSP – System Overview**• Initialization • Threshold/Recording • MFCC • Viterbi • Output**DSP - Matlab**• Prototype of all important algorithms • Pre-calculated data • Run-time altering of data (debugging) • Downloading and visualization of data • MFCCs**DSP – Recording/Thresholding**• Speech Input • Process • Poll A/D for input data (TI-provided code used) • Take only one channel as input • Downsample • Save samples only when signal threshold has been crossed • Lead buffer • Tail buffer • PROBLEMS • Sample transfer modes, single channel selection, threshold values, external microphones • TESTING • Visual and audio inspection in Matlab**DSP – MFCC calculation (1)**• Thank You to Takuya Ooura for his Public Domain FFT code. • MFCCs provide an uncorrelated and small set of observation vectors for the HMMs • Process: • Remove DC gain • Pre-emphasize • Hamming window • FFT magnitude • Mel-filter bank • DCT • Lifter**DSP – MFCC calculation (2)**• PROBLEMS: • An incorrectly coded pre-emphasis filter • TESTING: • Graphically compared DSP generated MFCCs to: • Matlab MFCCs -> DSP numerical issues • HTK MFCCs -> reference implementation**DSP – Viterbi/Recognition**• Uses HTK derived HMMs whose data is contained in a Matlab-generated #include file • PROBLEMS • Numerical concerns • Errors in deriving and coding the formulas.**Final Component Results I: HTK**• Pre-recorded Files: ====================== HTK Results Analysis ====================== Date: Mon Dec 02 11:37:46 2002 Ref : testwords.mlf Rec : testwordsoutput.mlf ------------------------ Overall Results -------------------------- SENT: %Correct=94.85 [H=92, S=5, N=97] WORD: %Corr=98.28, Acc=98.28 [H=286, D=0, S=5, I=0, N=291] ====================== • Live Audio Input: ~ 83% • DSP MFCC Files: ~ 65 %**Final Component Results II: DSP**• 95% recognition accuracy over 90 trials • 4 words • Trained speaker • Speaker Independence • Indication of some recognition for non-modeled speakers, but not much • Speech => Decision takes approximately 0.88 seconds**Challenges**• Speed • Complex project • System integration • Microphone input • Volume Box • HTK • MATLAB & DSP**Recommendations**• HTK and DSP • Larger training corpus • Multiple Gaussian mixtures • Channel independence • Continuous Recognition • Real-time MFCC transmission from DSP to HTK • DSP • Code style-fixes • Better user interface**Thank You**• Dan Block – For use of his lab and equipment**DSP – MFCC calculation**• Thank You to Takuya Ooura for his Public Domain FFT code. • MFCC’s provide an uncorrelated and small set of observation vectors for the HMM’s • Process: • Remove DC gain • Pre-emphasize • Hamming window • FFT magnitude • Mel-filter bank • DCT • Lifter • PROBLEMS: • An incorrectly coded pre-emphasis filter • TESTING: • Graphically compared DSP generated MFCC’s to: • Matlab MFCC’s -> DSP numerical issues • HTK MFCC’s -> reference implementation