1 / 54

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 6: Speech Recognition Acoustic & Auditory Model Features.

ayoka
Download Presentation

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

  2. Lecture 6: Speech Recognition Acoustic & Auditory Model Features • Log spectral features: log FFT, cepstrum, MFCC • Time-domain features: energy, zero crossing rate, autocorrelation • Model-based features: LPC, LPCC, PLP • Modulation filtering: cepstral mean subtraction, RASTA • Auditory model based features: auditory spectrogram, correlogram, summary correlogram

  3. Log Magnitude STFT

  4. The Problem with FFT: Euclidean Distance ≠ Perceptual Distance

  5. The “Complex Cepstrum” Cepstrum = Even Part of Complex Cepstrum

  6. Euclidean Distance Between Two Spectra = Cepstral Distance… … but Windowed Cepstral Distance = Distance Between Smoothed Spectra

  7. Cepstrally smoothed spectra

  8. Short-Time Fourier Transform = Filterbank with Uniformly Spaced Bands

  9. How to Implement Non-Uniform Filters Using the STFT

  10. Mel-Scale Bandpass Filters

  11. The Mel Frequency Scale: Humans Can Distinguish Tones 3 Mel Apart

  12. The Bark Scale (a.k.a. “Critical Band Scale”): Noise Within 1 Bark Can “Mask” a Tone

  13. Bark-Scale Warped Spectrum

  14. Mel-Scale Spectral Coefficients (MFSC)

  15. Mel-Scale Spectra of Music(Petruncio, B.S. Thesis University of Illinois, 2003) Piano Saxophone Tenor Opera Singer Drums

  16. Mel-Scale Cepstral Coefficients (MFCC)

  17. MFCC of Music(Petruncio, 2003) Piano Saxophone Tenor Opera Singer Drums

  18. Time-Domain Features

  19. “Time-Domain Features” = Features that can be computed frequently (e.g., once/millisecond) • Energy-based features: energy, sub-band energies • Low-order cepstral features: energy, spectral tilt, spectral centrality • Zero-crossing rate • Spectral flatness • Autocorrelation

  20. Example: 3 Features/1ms(Niyogi and Burges, 2002) Waveform Energy HF Energy Spectral Flatness Stop-Detection SVM TargetOutput Figure from Niyogi & Burges, 2002

  21. Short-Time Analysis: First, Window with Overlapping Windows

  22. Energy-Based Features • Filter the signal, to get the desired band • [0,400]: is the signal voiced? (doesn’t work for telephone speech) • [300,1000]: is the signal sonorant? • [1000,3000]: distinguish nasals from glides • [2000,6000]: detect frication energy • Full Band (no filtering): syllable detection • Window with a short window (4-6ms in length) • Compute the energy:

  23. Cepstrum-Based Features • Average(log(energy)) = c[0] • c[0] = ʃ log|X(w)|dw = ½ ʃ log |X(w)|2 dw • Not the same as log(average(energy)), which is log ʃ |X(w)|2dw • Spectral Tilt: one measure is -c[1] • -c[1] = -ʃ log|X(w)|cos(w)dw ≈ HF log energy – LF log energy • A More Universally Accepted Measure: • Spectral Tilt = ʃ (w-p/2) log|X(w)| dw • Spectral Centrality: -c[2] • c[2] = -ʃ log|X(w)|cos(2w)dw • c[2]≈ Mid Frequency Energy (p/4 to 3p/4) – Low and High Frequency Energy (0 to p/4 and 3p/4 to p)

  24. Measures of Turbulence • Zero Crossing Rate: • Count the number of times that the signal crosses zero in one window. Many: frication. Some: sonorant. Few: silence. • A related measure, used often in speech coding: “alternation rate” = the number of times the derivative crosses zero • Spectral Flatness: • average(log(energy)) – log(average(energy)) • Equal to zero if spectrum is flat (white noise, e.g., frication) • Negative if spectrum is peaky (e.g., vowels)

  25. Autocorrelation • Autocorrelation: measures the similarity of the signal to a delayed version of itself • Sonorant (low-frequency) signals: R[1] is large • Fricative (high-frequency) signals: R[1] is small or negative • R[0] is the energy • -R[0] ≤ R[k] ≤ R[0] for all k

  26. Model-Based Features: LPC, LPCC, PLP

  27. During Vowels and Glides, VT Transfer Function is All-Pole(All-Pole Model sometimes OK at other times too)

  28. Finding LPC Coefficients: Solve the “Normal Equations” • LPC Filter Prediction of s[n] is Saks[n-k]. Error is En: • ak minimize the error if they solve the Normal Equations:

  29. Roots of the LPC Polynomial • Roots of the LPC Polynomial: • Roots include: • Complex pole pair at most formant frequencies, rk and rk* • In a vowel or glide, there are additional poles at zero frequency: • One or two with bandwidth ≈ 100-300Hz; these give a negative tilt to the entire spectrum • One or two with bandwidth ≈ 2000-3000Hz; these attenuate high frequencies • In a fricative: poles may be at w=p, causing the whole spectrum to be high-pass

  30. Reflection Coefficients • LPC Speech Synthesis Filter can be implemented using a reflection line. This reflection line is mathematically equivalent to a p-tube model of the vocal tract: • PARCOR coefficients (= reflection coefficients) are found using the Levinson-Durbin recursion:

  31. LAR and LSF • Log Area Ratio (LAR) is bilinear transform of the reflection coefficients: • Line Spectral Frequencies (LSF) are the resonances of two lossless vocal tract models. Set U(0,jW)=0 at glottis; result is P(z). Set P(0,jW)=0 at glottis, result is Q(z). (Hasegawa-Johnson, JASA 2000)

  32. LSFs Tend to Track Formants • When LPC finds the formants (during vowels), the roots of P(z) and the roots of Q(z) each tend to “bracket” one formant, with a Q(z) root below, and a P(z) root above. • When LPC can’t find the formants (e.g., aspiration), LSFs interpolate between neighboring syllables

  33. LPC Cepstrum: Efficient Recursive Formula

  34. LPC Cepstrum: Efficient Recursive Formula

  35. Perceptual LPC(Hermansky, J. Acoust. Soc. Am., 1990) • First, warp the spectrum to a Bark scale: • The filters, Hb(k), are uniformly spaced in Bark frequency. Their amplitudes are scaled by the equal-loudness contour (an estimate of how loud each frequency sounds):

  36. Perceptual LPC • Second, compute the cube-root of the power spectrum • Cube root replaces the logarithm that would be used in MFCC • Loudness of a tone is proportional to cube root of its power Y(b) = S(b)0.33 • Third, inverse Fourier transform to find the “Perceptual Autocorrelation:”

  37. Perceptual LPC • Fourth, use Normal Equations to find the Perceptual LPC (PLP) coefficients: • Fifth, use the LPC Cepstral recursion to find Perceptual LPC Cepstrum (PLPCC):

  38. Modulation Filtering: Cepstral Mean Subtraction, RASTA

  39. Reverberation • Reverberation adds echos to the recorded signal: • Reverberation is a linear filter: x[n] = Sk=0∞ak s[n-dk] • If ak dies away fast enough (ak≈0 for dk>N, the STFT window length), we can model reverberation in the STFT frequency domain: X(z) = R(z) S(z) • Usually, STFT frequency-domain modeling of reverberation works for • Electric echoes (e.g., from the telephone network) • Handset echoes (e.g., from the chin of the speaker) • But NOT for free-field echoes (e.g., from the walls of a room, recorded by a desktop microphone)

  40. Reverberation: Recorded and Simulated Room Response

  41. Cepstral Mean Subtraction: Subtract out Short-Term Reverb • Log Magnitude Spectrum: Constant Filter → Constant Additive Term • Reverberation R(z) is Constant during the whole sentence • Therefore: Subtract the average value from each frame’s cepstrum  log R(z) is completely subtracted away • Warning: if the utterance is too short (contains too few phonemes), CMS will remove useful phonetic information!

  42. Modulation Filtering • Short-Time Log-Spectrum, log|Xt(w)|, is a function of t (frame number) and w. • Speaker information (log|Pt(w)|), Transfer function information (log|Tt(w)|), and Channel/Reverberation Information (log|Rt(w)|) may vary at different speeds with respect to frame number t. log|Xt(w)| = log|Rt(w)| + log|Tt(w)| + log|Pt(w)| • Assumption: Only log|Tt(w)| carries information about phonemes. Other components are “noise.” • Wiener filtering approach: filter log|Xt(w)| to compute an estimate of log|Tt(w)|. log|Tt*(w)| = Sk hk log|Xt-k(w)|

  43. RASTA (RelAtive SpecTral Amplitude)(Hermansky, IEEE Trans. Speech and Audio Proc., 1994) • Modulation-filtering of the cepstrum is equivalent to modulation-filtering of the log spectrum: ct*[m] = Sk hk ct-k[m] • RASTA is a particular kind of modulation filter:

  44. Features Based on Models of Auditory Physiology

  45. Processing of Sound by the Inner Ear • Bones of the middle ear act as an impedance matcher, ensuring that not all of the incoming wave is reflected from the fluid-air boundary at the surface of the cochlea. • The basilar membrane divides the top half of the cochlea (scala vestibuli) from the bottom half (scala tympani). The basal end is light and stiff, therefore tuned to high frequencies; the apical end is loose and floppy, therefore tuned to low frequencies. Thus the whole system acts like a bank of mechanical bandpass filters, with Q=centerfrequency/bandwidth≈6. • Hair cells on the surface of the basilar membrane release neurotransmitter when they are bent down, but not when they are pulled up. Thus they half-wave rectify the wave-like motion of the basilar membrane. • Neurotransmitter, in the cleft between hair cell and neuron, takes a little while to build up or to dissipate. The inertia of neurotransmitter acts to low-pass filter the half-wave rectified signal, with a cutoff around 2kHz. Result is a kind of localized energy in a ~0.5ms window.

  46. Filtering: Different Frequencies Excite Different Positions on the Basilar Membrane Inner and Outer Hair Cells on the Basilar Membrane. Each column of hair cells is tuned to a slightly different center frequency.

  47. Half-Wave Rectification: Only Down-Bending of the Hair Cells Excites a Neural Response Close-up view of outer hair cells, in a “V” configuration

  48. Neural Response to a Synthetic Vowel(Cariani, 2000)

  49. Temporal Structure of the Neural Response • Neural response patterns carries more information than just average energy (spectrogram) • For example: periodicity • Correlogram (Licklider, 1951): Measure periodicity on each simulated neuron by computing its autocorrelation • Recursive Neural Net (Cariani, 2000): Measure periodicity by building up response strength in an RNN with different delay loops • YIN pitch tracker (de Cheveigne and Kawahara, 2002): Measure periodicity using the absolute value of the difference between delayed signals

  50. Correlogram of a Sine Wave: Center Frequency vs. Autocorrelation Delay, Snapshot at one Instant in Time

More Related