Skip this Video
Download Presentation
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog

Loading in 2 Seconds...

play fullscreen
1 / 54

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog - PowerPoint PPT Presentation

  • Uploaded on

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 6: Speech Recognition Acoustic & Auditory Model Features.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog' - ayoka

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology

Mark Hasegawa-Johnson

[email protected]

University of Illinois at Urbana-Champaign, USA

lecture 6 speech recognition acoustic auditory model features
Lecture 6: Speech Recognition Acoustic & Auditory Model Features
  • Log spectral features: log FFT, cepstrum, MFCC
  • Time-domain features: energy, zero crossing rate, autocorrelation
  • Model-based features: LPC, LPCC, PLP
  • Modulation filtering: cepstral mean subtraction, RASTA
  • Auditory model based features: auditory spectrogram, correlogram, summary correlogram
the complex cepstrum
The “Complex Cepstrum”

Cepstrum = Even Part of Complex Cepstrum

euclidean distance between two spectra cepstral distance
Euclidean Distance Between Two Spectra = Cepstral Distance…

… but Windowed Cepstral Distance = Distance Between Smoothed Spectra

mel scale spectra of music petruncio b s thesis university of illinois 2003
Mel-Scale Spectra of Music(Petruncio, B.S. Thesis University of Illinois, 2003)







mfcc of music petruncio 2003
MFCC of Music(Petruncio, 2003)







time domain features features that can be computed frequently e g once millisecond
“Time-Domain Features” = Features that can be computed frequently (e.g., once/millisecond)
  • Energy-based features: energy, sub-band energies
  • Low-order cepstral features: energy, spectral tilt, spectral centrality
  • Zero-crossing rate
  • Spectral flatness
  • Autocorrelation
example 3 features 1ms niyogi and burges 2002
Example: 3 Features/1ms(Niyogi and Burges, 2002)



HF Energy

Spectral Flatness

Stop-Detection SVM


Figure from Niyogi & Burges, 2002

energy based features
Energy-Based Features
  • Filter the signal, to get the desired band
    • [0,400]: is the signal voiced? (doesn’t work for telephone speech)
    • [300,1000]: is the signal sonorant?
    • [1000,3000]: distinguish nasals from glides
    • [2000,6000]: detect frication energy
    • Full Band (no filtering): syllable detection
  • Window with a short window (4-6ms in length)
  • Compute the energy:
cepstrum based features
Cepstrum-Based Features
  • Average(log(energy)) = c[0]
    • c[0] = ʃ log|X(w)|dw = ½ ʃ log |X(w)|2 dw
    • Not the same as log(average(energy)), which is log ʃ |X(w)|2dw
  • Spectral Tilt: one measure is -c[1]
    • -c[1] = -ʃ log|X(w)|cos(w)dw ≈ HF log energy – LF log energy
  • A More Universally Accepted Measure:
    • Spectral Tilt = ʃ (w-p/2) log|X(w)| dw
  • Spectral Centrality: -c[2]
    • c[2] = -ʃ log|X(w)|cos(2w)dw
    • c[2]≈ Mid Frequency Energy (p/4 to 3p/4) – Low and High Frequency Energy (0 to p/4 and 3p/4 to p)
measures of turbulence
Measures of Turbulence
  • Zero Crossing Rate:
    • Count the number of times that the signal crosses zero in one window. Many: frication. Some: sonorant. Few: silence.
    • A related measure, used often in speech coding: “alternation rate” = the number of times the derivative crosses zero
  • Spectral Flatness:
    • average(log(energy)) – log(average(energy))
    • Equal to zero if spectrum is flat (white noise, e.g., frication)
    • Negative if spectrum is peaky (e.g., vowels)
  • Autocorrelation: measures the similarity of the signal to a delayed version of itself
    • Sonorant (low-frequency) signals: R[1] is large
    • Fricative (high-frequency) signals: R[1] is small or negative
  • R[0] is the energy
    • -R[0] ≤ R[k] ≤ R[0] for all k
During Vowels and Glides, VT Transfer Function is All-Pole(All-Pole Model sometimes OK at other times too)
finding lpc coefficients solve the normal equations
Finding LPC Coefficients: Solve the “Normal Equations”
  • LPC Filter Prediction of s[n] is Saks[n-k]. Error is En:
  • ak minimize the error if they solve the Normal Equations:
roots of the lpc polynomial
Roots of the LPC Polynomial
  • Roots of the LPC Polynomial:
  • Roots include:
    • Complex pole pair at most formant frequencies, rk and rk*
    • In a vowel or glide, there are additional poles at zero frequency:
      • One or two with bandwidth ≈ 100-300Hz; these give a negative tilt to the entire spectrum
      • One or two with bandwidth ≈ 2000-3000Hz; these attenuate high frequencies
    • In a fricative: poles may be at w=p, causing the whole spectrum to be high-pass
reflection coefficients
Reflection Coefficients
  • LPC Speech Synthesis Filter can be implemented using a reflection line. This reflection line is mathematically equivalent to a p-tube model of the vocal tract:
  • PARCOR coefficients (= reflection coefficients) are found using the Levinson-Durbin recursion:
lar and lsf
  • Log Area Ratio (LAR) is bilinear transform of the reflection coefficients:
  • Line Spectral Frequencies (LSF) are the resonances of two lossless vocal tract models. Set U(0,jW)=0 at glottis; result is P(z). Set P(0,jW)=0 at glottis, result is Q(z).

(Hasegawa-Johnson, JASA 2000)

lsfs tend to track formants
LSFs Tend to Track Formants
  • When LPC finds the formants (during vowels), the roots of P(z) and the roots of Q(z) each tend to “bracket” one formant, with a Q(z) root below, and a P(z) root above.
  • When LPC can’t find the formants (e.g., aspiration), LSFs interpolate between neighboring syllables
perceptual lpc hermansky j acoust soc am 1990
Perceptual LPC(Hermansky, J. Acoust. Soc. Am., 1990)
  • First, warp the spectrum to a Bark scale:
  • The filters, Hb(k), are uniformly spaced in Bark frequency. Their amplitudes are scaled by the equal-loudness contour (an estimate of how loud each frequency sounds):
perceptual lpc
Perceptual LPC
  • Second, compute the cube-root of the power spectrum
    • Cube root replaces the logarithm that would be used in MFCC
    • Loudness of a tone is proportional to cube root of its power

Y(b) = S(b)0.33

  • Third, inverse Fourier transform to find the “Perceptual Autocorrelation:”
perceptual lpc1
Perceptual LPC
  • Fourth, use Normal Equations to find the Perceptual LPC (PLP) coefficients:
  • Fifth, use the LPC Cepstral recursion to find Perceptual LPC Cepstrum (PLPCC):
  • Reverberation adds echos to the recorded signal:
  • Reverberation is a linear filter:

x[n] = Sk=0∞ak s[n-dk]

  • If ak dies away fast enough (ak≈0 for dk>N, the STFT window length), we can model reverberation in the STFT frequency domain:

X(z) = R(z) S(z)

  • Usually, STFT frequency-domain modeling of reverberation works for
    • Electric echoes (e.g., from the telephone network)
    • Handset echoes (e.g., from the chin of the speaker)
    • But NOT for free-field echoes (e.g., from the walls of a room, recorded by a desktop microphone)
cepstral mean subtraction subtract out short term reverb
Cepstral Mean Subtraction: Subtract out Short-Term Reverb
  • Log Magnitude Spectrum: Constant Filter → Constant Additive Term
  • Reverberation R(z) is Constant during the whole sentence
  • Therefore: Subtract the average value from each frame’s cepstrum  log R(z) is completely subtracted away
  • Warning: if the utterance is too short (contains too few phonemes), CMS will remove useful phonetic information!
modulation filtering
Modulation Filtering
  • Short-Time Log-Spectrum, log|Xt(w)|, is a function of t (frame number) and w.
  • Speaker information (log|Pt(w)|), Transfer function information (log|Tt(w)|), and Channel/Reverberation Information (log|Rt(w)|) may vary at different speeds with respect to frame number t.

log|Xt(w)| = log|Rt(w)| + log|Tt(w)| + log|Pt(w)|

  • Assumption: Only log|Tt(w)| carries information about phonemes. Other components are “noise.”
  • Wiener filtering approach: filter log|Xt(w)| to compute an estimate of log|Tt(w)|.

log|Tt*(w)| = Sk hk log|Xt-k(w)|

rasta relative spectral amplitude hermansky ieee trans speech and audio proc 1994
RASTA (RelAtive SpecTral Amplitude)(Hermansky, IEEE Trans. Speech and Audio Proc., 1994)
  • Modulation-filtering of the cepstrum is equivalent to modulation-filtering of the log spectrum:

ct*[m] = Sk hk ct-k[m]

  • RASTA is a particular kind of modulation filter:
processing of sound by the inner ear
Processing of Sound by the Inner Ear
  • Bones of the middle ear act as an impedance matcher, ensuring that not all of the incoming wave is reflected from the fluid-air boundary at the surface of the cochlea.
  • The basilar membrane divides the top half of the cochlea (scala vestibuli) from the bottom half (scala tympani). The basal end is light and stiff, therefore tuned to high frequencies; the apical end is loose and floppy, therefore tuned to low frequencies. Thus the whole system acts like a bank of mechanical bandpass filters, with Q=centerfrequency/bandwidth≈6.
  • Hair cells on the surface of the basilar membrane release neurotransmitter when they are bent down, but not when they are pulled up. Thus they half-wave rectify the wave-like motion of the basilar membrane.
  • Neurotransmitter, in the cleft between hair cell and neuron, takes a little while to build up or to dissipate. The inertia of neurotransmitter acts to low-pass filter the half-wave rectified signal, with a cutoff around 2kHz. Result is a kind of localized energy in a ~0.5ms window.
filtering different frequencies excite different positions on the basilar membrane
Filtering: Different Frequencies Excite Different Positions on the Basilar Membrane

Inner and Outer Hair Cells on the Basilar Membrane. Each column of hair cells is tuned to a slightly different center frequency.

half wave rectification only down bending of the hair cells excites a neural response
Half-Wave Rectification: Only Down-Bending of the Hair Cells Excites a Neural Response

Close-up view of outer hair cells, in a “V” configuration

temporal structure of the neural response
Temporal Structure of the Neural Response
  • Neural response patterns carries more information than just average energy (spectrogram)
  • For example: periodicity
    • Correlogram (Licklider, 1951): Measure periodicity on each simulated neuron by computing its autocorrelation
    • Recursive Neural Net (Cariani, 2000): Measure periodicity by building up response strength in an RNN with different delay loops
    • YIN pitch tracker (de Cheveigne and Kawahara, 2002): Measure periodicity using the absolute value of the difference between delayed signals
Correlogram of a Sine Wave: Center Frequency vs. Autocorrelation Delay, Snapshot at one Instant in Time
Correlogram of a Periodic Signalwith spectral peaks at 2F0, 3F0, etcetera but none at F0 (missing fundamental)
correlogram of an owl hooting
Correlogram of an Owl Hooting
  • Y axis = neuron’s center frequency
  • X axis = autocorrelation delay (same as on previous two slides
  • Time = time lapsed in the movie (real-time movie)
  • Notice: pitch fine structure, within each band, could be used to separate two different audio input signals, performing simultaneous recognition of two speech signals.
  • Log spectrum, once/10ms, computed with a window of about 25ms, seems to carry lots of useful information about place of articulation and vowel quality
    • Euclidean distance between log spectra is not a good measure of perceptual distance
    • Euclidean distance between windowed cepstra is better
    • Frequency warping (mel-scale or Bark-scale) is even better
    • Fitting an all-pole model (PLP) seems to improve speaker-independence
    • Modulation filtering (CMS, RASTA) improve robustness to channel variability (short-impulse-response reverb)
  • Time-domain features (once/1ms) can capture important information about manner of articulation and landmark times
  • Auditory model features (correlogram, delayogram) are useful for recognition of multiple simultaneous talkers