1 / 28

Cepstrum and MFCC

Cepstrum and MFCC. Cepstrum MFCC. Cepstrum. A new word by reversing the first 4 letters of spectrum  cepstrum. It is the spectrum of a spectrum of a signal. Cepstrum. Glottis and cepstrum Speech wave (X)= Excitation (E) . Filter (H). (S). Output So voice has a strong glottis

pearlgreene
Download Presentation

Cepstrum and MFCC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cepstrum and MFCC Cepstrum MFCC Speech processing

  2. Cepstrum • A new word by reversing the first 4 letters of spectrum  cepstrum. • It is the spectrum of a spectrum of a signal. Speech processing

  3. Cepstrum Speech processing

  4. Glottis and cepstrumSpeech wave (X)= Excitation (E) . Filter (H) (S) Output So voice has a strong glottis Excitation Frequency content In Ceptsrum We can easily identify and remove the glottal excitation (H) (Vocal tract filter) (E) Glottal excitation From Vocal cords (Glottis) http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif Speech processing

  5. Cepstral analysis • Signal(s)=convolution(*) of • glottal excitation (e) and vocal_tract_filter (h) • s(n)=e(n)*h(n), n is time index • After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)} • Convolution(*) becomes multiplication (.) • n(time) w(frequency), • S(w) = E(w).H(w) • Find Magnitude of the spectrum • |S(w)| = |E(w)|.|H(w)| • log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|} Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1 Speech processing

  6. X(n) X(w) Log|x(w)| S(n) C(n) windowing DFT Log|x(w)| IDFT N=time index w=frequency I-DFT=Inverse-discrete Fourier transform Cepstrum • C(n)=IDFT[log10 |S(w)|]= • IDFT[ log10{|E(w)|} + log10{|H(w)|} ] • In c(n), you can see E(n) and H(n) at two different positions • Application: useful for (i) glottal excitation (ii) vocal tract filter analysis Speech processing

  7. Cepstral for pitch detection • The theory behind the cepstral detector is that the fourier transform of a pitched signal usually have a number of regularly peaks, who is representing the harmonic spectrum. • When log magnitude of a spectrum is taken, these peaks are reduced (their amplitude brought into a usable scale). • The result is a periodic waveform in the frequency domain, where the period is related to the fundamental frequency of the original signal. • This means that a fourier transformation of this waveform has a peak representing the fundamental frequency. Speech processing

  8. MFCC • MFCC is an efficient speech feature based on human hearing perceptions, i.e. MFCC is based on known variation of the human ear’s critical bandwidth with frequency. Speech processing

  9. Cont’d Speech processing

  10. MFCC • If x(n) is the input signal, then the short time Fourier transform for frame a is given • is called power spectrum, and if it is passed through triangular filters of Mel frequency filer bank . Speech processing

  11. Cont’d • Human ear perception of frequency contents of sounds for speech signal does not follow a linear scale. • Therefore, for each tone with an actual frequency f, measured in Hz, a subjective pitch is measured on a scale called the MEL scale. The mel frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz. To compute the mels for a given frequency f in Hz, a the following approximate formula is used. • Mel (f) = Sk = 2595*log10 (1 + f/700) Speech processing

  12. Cont’d • The subjective spectrum is simulated with the use of a filter bank, one filter for each desired mel-frequency component. The filter bank has a triangular band pass frequency response, and the spacing as well as the bandwidth is determined by a constant mel-frequency interval. • Furthermore, we convert the log mel spectrum back to time by using a discrete cosine transform (DCT) of the logarithm of S(m) is calculated to find the MFCC as Speech processing

  13. Filtering • Ways to find the spectral envelope • Filter banks: uniform • Filter banks can also be non-uniform Spectral envelop spectral envelop energy filter2 output filter1 output filter3 output filter4 output freq.. Speech processing

  14. Input waveform Time frame i 30ms 30ms 30ms Time frame i+1 Time frame i+2 Filtering method • For each frame (ex 10 - 30 ms), a set of filter outputs will be calculated. (ex frame overlap 5ms) • There are many different methods for setting the filter bandwidths -- uniform or non-uniform Filter outputs (v1,v2,…) Filter outputs (v’1,v’2,…) Filter outputs (v’’1,v’’2,…) Speech processing 5ms

  15. How to determine filter band ranges • Uniform filter banks • Log frequency banks • Mel filter bands Speech processing

  16. Uniform Filter Banks • Uniform filter banks • bandwidth B= Sampling Freq... (Fs)/no. of banks (N) • For example Fs=10Kz, N=20 then B=500Hz • Simple to implement but not too useful V Filter output v3 v1 v2 .... Q 1 2 3 4 5 ... freq.. (Hz) 1K 1.5K 2K 2.5K 3K ... 500 Speech processing

  17. Non-uniform filter banks: Log frequency • Log. Freq... scale : close to human ear V Filter output v1 v2 v3 200 400 800 1600 3200 freq.. (Hz) Speech processing

  18. Inner ear and the cochlea(human also has filter bands) • Ear and cochlea Speech processing http://universe-review.ca/I10-85-cochlea2.jpg http://www.edu.ipa.go.jp/chiyo/HuBEd/HTML1/en/3D/ear.html

  19. Mel filter bands (found by psychological and instrumentation experiments) Filter output • Freq. lower than 1 KHz has narrower bands (and in linear scale) • Higher frequencies have larger bands (and in log scale) • More filter below 1KHz • Less filters above 1KHz Speech processing http://instruct1.cit.cornell.edu/courses/ece576/FinalProjects/f2008/pae26_jsc59/pae26_jsc59/images/melfilt.png

  20. Mel scale (Melody scale)From http://en.wikipedia.org/wiki/Mel_scalecomparisons. • Measure relative strength in perception of different frequencies. • The mel scale, named by Stevens, Volkman and Newman in 1937 is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000Hz tone, 40 dB above the listener's threshold. …. The name mel comes from the word melody to indicate that the scale is based on pitch comparisons. Speech processing

  21. Critical band scale: Mel scale • Based on perceptual studies • Log. scale when freq. is above 1KHz • Linear scale when freq. is below 1KHz • Popular scales are the “Mel” (stands for melody) or “Bark” scales Mel Scale (m) m f (f) Freq in hz Below 1KHz, fmf, linear Above 1KHz, f>mf, log scale Speech processing • http://en.wikipedia.org/wiki/Mel_scale

  22. Work examples: • Exercise 1: When the input frequency ranges from 200 to 800 Hz (f=600Hz), what is the delta Mel (m) in the Mel scale? • Exercise 2: When the input frequency ranges from 6000 to 7000 Hz (f=1000Hz), what is the delta Mel (m) in the Mel scale? Speech processing

  23. Work examples: • Answer1: also m=600Hz, because it is a linear scale. • Answer 2: By observation, in the Mel scale diagram it is from 2600 to 2750, so delta Mel (m) in the Mel scale from 2600 to 2750, m=150 . It is a log scale change. We can re-calculate result using the formula M=2595 log10(1+f/700), • M_low=2595 log10(1+f_low/700)= 2595 log10(1+6000/700), • M_high=2595 log10(1+f_high/700)= 2595 log10(1+7000/700), • Delta_m(m) = M_high - M_low = (2595* log10(1+7000/700))-( 2595* log10(1+6000/700)) = 156.7793 (agrees with the observation, Mel scale is a log scale) Speech processing

  24. Example of cepstrumhttp://www.cse.cuhk.edu.hk/%7Ekhwong/www2/cmsc5707/demo_for_ch4_cepstrum.zipRun spCepstrumDemo in matlab 'sor1.wav‘=sampling frequency 22.05KHz Speech processing

  25. s(n) time domain signal x(n)=windowed(s(n)) Suppress two sides |x(w)|=dft(x(n)) = frequency signal (dft=discrete Fourier transform) Log (|x(w)|) C(n)= iDft(Log (|x(w)|)) gives Cepstrum Glottal excitation cepstrum Vocal track cepstrum Speech processing http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1

  26. Liftering (to remove glottal excitation) • Low time liftering: • Magnify (or Inspect) the low time to find the vocal tract filter cepstrum • High time liftering: • Magnify (or Inspect) the high time to find the glottal excitation cepstrum (remove this part for speech recognition. Vocal tract Cepstrum Used for Speech recognition Glottal excitation Cepstrum, useless for speech recognition, Cut-off Found by experiment Frequency =FS/ quefrency FS=sample frequency =22050 Speech processing

  27. Reasons for lifteringCepstrum of speech • Why we need this? • Answer: remove the ripples • of the spectrum caused by • glottal excitation. Too many ripples in the spectrum caused by vocal cord vibrations (glottal excitation). But we are more interested in the speech envelope for recognition and reproduction Fourier Transform Input speech signal x Spectrum of x Speech processing http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf

  28. Liftering method: Select the high time and low time liftering Signal X Cepstrum Select high time, C_high Select low time C_low Speech processing

More Related