1 / 15

200 likes | 419 Views

Speech Recognition. Feature Extraction. Speech recognition simplified block diagram. Training. Speech Capture. Feature Extraction. Models. Pattern Matching. Process Results. Text. Speech capture. Use good quality noise cancelling mic Use bandwidth of 4kHz for phone

Download Presentation
## Speech Recognition

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Speech Recognition**Feature Extraction**Speech recognition simplified block diagram**Training Speech Capture Feature Extraction Models Pattern Matching Process Results Text**Speech capture**• Use good quality noise cancelling mic • Use bandwidth of 4kHz for phone • Use bandwidth of 8kHz for desktop • Sample at 8kHz or 16 kHz • Alias filter the input • Avoid background noise • Speak clearly but naturally**Spectral Features**• Need to extract key frequency components • Visible in a spectrogram – 2d real time examples**Feature extraction**• Need to extract frequency content (spectrogram) • Matching on raw data is inefficient • Much of the data is redundant for information • Analyse the signal and extract key features • The same word spoken by different people looks very different in time domain • In the frequency domain, patterns are more evident • Generally use Mel Frequency Cepstral Coefficients**The process**• MFCCs are short-term spectral features They are calculated as follows • Divide signal into frames • For each frame, obtain the amplitude spectrum • Take the natural logarithm • Convert to Mel spectrum (cepstrum) • Take the discrete cosine transform (DCT)**Divide signal into frames**Apply window function – typically Hamming window • Select about 25mS of speech data and window it to cleanly cut it out of the data stream • Shift window by about 10mS and do the same continuously**Now have a series of vectors being produced**If sampling at 8kHz then sample period = 125uS Vector size = 25mS/125uS = 25000 / 125 = 200 element array**Feed the speech frame into an FFT to get frequency component**of that slice • Calculate the power of the spectrumfor each element of the vector • s[k]=(Real X[k])2 + (Imag X[k])2 where X is FFT coef • Use a set of filters to split up frequency bands • Typically use mel scale filter to match the Basilar Membrane. Get energy in each band • Sphinx III uses 40 filters over 8kHz bandwidth**Frequency response is non-linear**• Mel(ody) = 1127.01048 x log_e(1+f/700) • f = 700(e^{m x 1127.01048} – 1) • Bark =13 x arctan(0.76f x 1000) + 3.5 x arctan((f x 7500)^2)**Calculate mel spectrum by multiplying the power spectrum by**each of the of the triangular mel weighting filters and integrating the result.**Calculate the mel cepstrum**• A DCT is applied to the natural logarithm of the mel spectrum to obtain the mel cepstrum. C=num of cepstral coefficients required (n=0 to 12 to get 13 for Sphinx III) and L is the number of filter banks and S[i] is the mel spectrum coefficient – one for each filter output. n is usually less than C as the DCT has the effect of compressing the spectrum such that the bulk of the information is in the first few coefficients. Sphinx III uses 40 filters but keeps only the first 13 cepstral coefficients.

More Related