Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology - PowerPoint PPT Presentation

Landmark-Based Speech Recognition:
1 / 54

  • Uploaded on
  • Presentation posted in: General

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign, USA. Lecture 6: Speech Recognition Acoustic & Auditory Model Features.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Log magnitude stft

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology

Mark Hasegawa-Johnson

University of Illinois at Urbana-Champaign, USA

Lecture 6 speech recognition acoustic auditory model features

Lecture 6: Speech Recognition Acoustic & Auditory Model Features

  • Log spectral features: log FFT, cepstrum, MFCC

  • Time-domain features: energy, zero crossing rate, autocorrelation

  • Model-based features: LPC, LPCC, PLP

  • Modulation filtering: cepstral mean subtraction, RASTA

  • Auditory model based features: auditory spectrogram, correlogram, summary correlogram

Log magnitude stft

Log Magnitude STFT

The problem with fft euclidean distance perceptual distance

The Problem with FFT: Euclidean Distance ≠ Perceptual Distance

The complex cepstrum

The “Complex Cepstrum”

Cepstrum = Even Part of Complex Cepstrum

Euclidean distance between two spectra cepstral distance

Euclidean Distance Between Two Spectra = Cepstral Distance…

… but Windowed Cepstral Distance = Distance Between Smoothed Spectra

Log magnitude stft

Cepstrally smoothed spectra

Short time fourier transform filterbank with uniformly spaced bands

Short-Time Fourier Transform = Filterbank with Uniformly Spaced Bands

How to implement non uniform filters using the stft

How to Implement Non-Uniform Filters Using the STFT

Mel scale bandpass filters

Mel-Scale Bandpass Filters

The mel frequency scale humans can distinguish tones 3 mel apart

The Mel Frequency Scale: Humans Can Distinguish Tones 3 Mel Apart

The bark scale a k a critical band scale noise within 1 bark can mask a tone

The Bark Scale (a.k.a. “Critical Band Scale”): Noise Within 1 Bark Can “Mask” a Tone

Bark scale warped spectrum

Bark-Scale Warped Spectrum

Mel scale spectral coefficients mfsc

Mel-Scale Spectral Coefficients (MFSC)

Mel scale spectra of music petruncio b s thesis university of illinois 2003

Mel-Scale Spectra of Music(Petruncio, B.S. Thesis University of Illinois, 2003)







Mel scale cepstral coefficients mfcc

Mel-Scale Cepstral Coefficients (MFCC)

Mfcc of music petruncio 2003

MFCC of Music(Petruncio, 2003)







Time domain features

Time-Domain Features

Time domain features features that can be computed frequently e g once millisecond

“Time-Domain Features” = Features that can be computed frequently (e.g., once/millisecond)

  • Energy-based features: energy, sub-band energies

  • Low-order cepstral features: energy, spectral tilt, spectral centrality

  • Zero-crossing rate

  • Spectral flatness

  • Autocorrelation

Example 3 features 1ms niyogi and burges 2002

Example: 3 Features/1ms(Niyogi and Burges, 2002)



HF Energy

Spectral Flatness

Stop-Detection SVM


Figure from Niyogi & Burges, 2002

Short time analysis first window with overlapping windows

Short-Time Analysis: First, Window with Overlapping Windows

Energy based features

Energy-Based Features

  • Filter the signal, to get the desired band

    • [0,400]: is the signal voiced? (doesn’t work for telephone speech)

    • [300,1000]: is the signal sonorant?

    • [1000,3000]: distinguish nasals from glides

    • [2000,6000]: detect frication energy

    • Full Band (no filtering): syllable detection

  • Window with a short window (4-6ms in length)

  • Compute the energy:

Cepstrum based features

Cepstrum-Based Features

  • Average(log(energy)) = c[0]

    • c[0] = ʃ log|X(w)|dw = ½ ʃ log |X(w)|2 dw

    • Not the same as log(average(energy)), which is log ʃ |X(w)|2dw

  • Spectral Tilt: one measure is -c[1]

    • -c[1] = -ʃ log|X(w)|cos(w)dw ≈ HF log energy – LF log energy

  • A More Universally Accepted Measure:

    • Spectral Tilt = ʃ (w-p/2) log|X(w)| dw

  • Spectral Centrality: -c[2]

    • c[2] = -ʃ log|X(w)|cos(2w)dw

    • c[2]≈ Mid Frequency Energy (p/4 to 3p/4) – Low and High Frequency Energy (0 to p/4 and 3p/4 to p)

Measures of turbulence

Measures of Turbulence

  • Zero Crossing Rate:

    • Count the number of times that the signal crosses zero in one window. Many: frication. Some: sonorant. Few: silence.

    • A related measure, used often in speech coding: “alternation rate” = the number of times the derivative crosses zero

  • Spectral Flatness:

    • average(log(energy)) – log(average(energy))

    • Equal to zero if spectrum is flat (white noise, e.g., frication)

    • Negative if spectrum is peaky (e.g., vowels)



  • Autocorrelation: measures the similarity of the signal to a delayed version of itself

    • Sonorant (low-frequency) signals: R[1] is large

    • Fricative (high-frequency) signals: R[1] is small or negative

  • R[0] is the energy

    • -R[0] ≤ R[k] ≤ R[0] for all k

Model based features lpc lpcc plp

Model-Based Features: LPC, LPCC, PLP

Log magnitude stft

During Vowels and Glides, VT Transfer Function is All-Pole(All-Pole Model sometimes OK at other times too)

Finding lpc coefficients solve the normal equations

Finding LPC Coefficients: Solve the “Normal Equations”

  • LPC Filter Prediction of s[n] is Saks[n-k]. Error is En:

  • ak minimize the error if they solve the Normal Equations:

Roots of the lpc polynomial

Roots of the LPC Polynomial

  • Roots of the LPC Polynomial:

  • Roots include:

    • Complex pole pair at most formant frequencies, rk and rk*

    • In a vowel or glide, there are additional poles at zero frequency:

      • One or two with bandwidth ≈ 100-300Hz; these give a negative tilt to the entire spectrum

      • One or two with bandwidth ≈ 2000-3000Hz; these attenuate high frequencies

    • In a fricative: poles may be at w=p, causing the whole spectrum to be high-pass

Reflection coefficients

Reflection Coefficients

  • LPC Speech Synthesis Filter can be implemented using a reflection line. This reflection line is mathematically equivalent to a p-tube model of the vocal tract:

  • PARCOR coefficients (= reflection coefficients) are found using the Levinson-Durbin recursion:

Lar and lsf


  • Log Area Ratio (LAR) is bilinear transform of the reflection coefficients:

  • Line Spectral Frequencies (LSF) are the resonances of two lossless vocal tract models. Set U(0,jW)=0 at glottis; result is P(z). Set P(0,jW)=0 at glottis, result is Q(z).

    (Hasegawa-Johnson, JASA 2000)

Lsfs tend to track formants

LSFs Tend to Track Formants

  • When LPC finds the formants (during vowels), the roots of P(z) and the roots of Q(z) each tend to “bracket” one formant, with a Q(z) root below, and a P(z) root above.

  • When LPC can’t find the formants (e.g., aspiration), LSFs interpolate between neighboring syllables

Lpc cepstrum efficient recursive formula

LPC Cepstrum: Efficient Recursive Formula

Lpc cepstrum efficient recursive formula1

LPC Cepstrum: Efficient Recursive Formula

Perceptual lpc hermansky j acoust soc am 1990

Perceptual LPC(Hermansky, J. Acoust. Soc. Am., 1990)

  • First, warp the spectrum to a Bark scale:

  • The filters, Hb(k), are uniformly spaced in Bark frequency. Their amplitudes are scaled by the equal-loudness contour (an estimate of how loud each frequency sounds):

Perceptual lpc

Perceptual LPC

  • Second, compute the cube-root of the power spectrum

    • Cube root replaces the logarithm that would be used in MFCC

    • Loudness of a tone is proportional to cube root of its power

      Y(b) = S(b)0.33

  • Third, inverse Fourier transform to find the “Perceptual Autocorrelation:”

Perceptual lpc1

Perceptual LPC

  • Fourth, use Normal Equations to find the Perceptual LPC (PLP) coefficients:

  • Fifth, use the LPC Cepstral recursion to find Perceptual LPC Cepstrum (PLPCC):

Modulation filtering cepstral mean subtraction rasta

Modulation Filtering: Cepstral Mean Subtraction, RASTA



  • Reverberation adds echos to the recorded signal:

  • Reverberation is a linear filter:

    x[n] = Sk=0∞ak s[n-dk]

  • If ak dies away fast enough (ak≈0 for dk>N, the STFT window length), we can model reverberation in the STFT frequency domain:

    X(z) = R(z) S(z)

  • Usually, STFT frequency-domain modeling of reverberation works for

    • Electric echoes (e.g., from the telephone network)

    • Handset echoes (e.g., from the chin of the speaker)

    • But NOT for free-field echoes (e.g., from the walls of a room, recorded by a desktop microphone)

Reverberation recorded and simulated room response

Reverberation: Recorded and Simulated Room Response

Cepstral mean subtraction subtract out short term reverb

Cepstral Mean Subtraction: Subtract out Short-Term Reverb

  • Log Magnitude Spectrum: Constant Filter → Constant Additive Term

  • Reverberation R(z) is Constant during the whole sentence

  • Therefore: Subtract the average value from each frame’s cepstrum  log R(z) is completely subtracted away

  • Warning: if the utterance is too short (contains too few phonemes), CMS will remove useful phonetic information!

Modulation filtering

Modulation Filtering

  • Short-Time Log-Spectrum, log|Xt(w)|, is a function of t (frame number) and w.

  • Speaker information (log|Pt(w)|), Transfer function information (log|Tt(w)|), and Channel/Reverberation Information (log|Rt(w)|) may vary at different speeds with respect to frame number t.

    log|Xt(w)| = log|Rt(w)| + log|Tt(w)| + log|Pt(w)|

  • Assumption: Only log|Tt(w)| carries information about phonemes. Other components are “noise.”

  • Wiener filtering approach: filter log|Xt(w)| to compute an estimate of log|Tt(w)|.

    log|Tt*(w)| = Sk hk log|Xt-k(w)|

Rasta relative spectral amplitude hermansky ieee trans speech and audio proc 1994

RASTA (RelAtive SpecTral Amplitude)(Hermansky, IEEE Trans. Speech and Audio Proc., 1994)

  • Modulation-filtering of the cepstrum is equivalent to modulation-filtering of the log spectrum:

    ct*[m] = Sk hk ct-k[m]

  • RASTA is a particular kind of modulation filter:

Features based on models of auditory physiology

Features Based on Models of Auditory Physiology

Processing of sound by the inner ear

Processing of Sound by the Inner Ear

  • Bones of the middle ear act as an impedance matcher, ensuring that not all of the incoming wave is reflected from the fluid-air boundary at the surface of the cochlea.

  • The basilar membrane divides the top half of the cochlea (scala vestibuli) from the bottom half (scala tympani). The basal end is light and stiff, therefore tuned to high frequencies; the apical end is loose and floppy, therefore tuned to low frequencies. Thus the whole system acts like a bank of mechanical bandpass filters, with Q=centerfrequency/bandwidth≈6.

  • Hair cells on the surface of the basilar membrane release neurotransmitter when they are bent down, but not when they are pulled up. Thus they half-wave rectify the wave-like motion of the basilar membrane.

  • Neurotransmitter, in the cleft between hair cell and neuron, takes a little while to build up or to dissipate. The inertia of neurotransmitter acts to low-pass filter the half-wave rectified signal, with a cutoff around 2kHz. Result is a kind of localized energy in a ~0.5ms window.

Filtering different frequencies excite different positions on the basilar membrane

Filtering: Different Frequencies Excite Different Positions on the Basilar Membrane

Inner and Outer Hair Cells on the Basilar Membrane. Each column of hair cells is tuned to a slightly different center frequency.

Half wave rectification only down bending of the hair cells excites a neural response

Half-Wave Rectification: Only Down-Bending of the Hair Cells Excites a Neural Response

Close-up view of outer hair cells, in a “V” configuration

Neural response to a synthetic vowel cariani 2000

Neural Response to a Synthetic Vowel(Cariani, 2000)

Temporal structure of the neural response

Temporal Structure of the Neural Response

  • Neural response patterns carries more information than just average energy (spectrogram)

  • For example: periodicity

    • Correlogram (Licklider, 1951): Measure periodicity on each simulated neuron by computing its autocorrelation

    • Recursive Neural Net (Cariani, 2000): Measure periodicity by building up response strength in an RNN with different delay loops

    • YIN pitch tracker (de Cheveigne and Kawahara, 2002): Measure periodicity using the absolute value of the difference between delayed signals

Log magnitude stft

Correlogram of a Sine Wave: Center Frequency vs. Autocorrelation Delay, Snapshot at one Instant in Time

Log magnitude stft

Correlogram of a Periodic Signalwith spectral peaks at 2F0, 3F0, etcetera but none at F0 (missing fundamental)

Correlogram of an owl hooting

Correlogram of an Owl Hooting

  • Y axis = neuron’s center frequency

  • X axis = autocorrelation delay (same as on previous two slides

  • Time = time lapsed in the movie (real-time movie)

  • Notice: pitch fine structure, within each band, could be used to separate two different audio input signals, performing simultaneous recognition of two speech signals.

Log magnitude stft

Gandhi and Hasegawa-Johnson, ICSLP 2004



  • Log spectrum, once/10ms, computed with a window of about 25ms, seems to carry lots of useful information about place of articulation and vowel quality

    • Euclidean distance between log spectra is not a good measure of perceptual distance

    • Euclidean distance between windowed cepstra is better

    • Frequency warping (mel-scale or Bark-scale) is even better

    • Fitting an all-pole model (PLP) seems to improve speaker-independence

    • Modulation filtering (CMS, RASTA) improve robustness to channel variability (short-impulse-response reverb)

  • Time-domain features (once/1ms) can capture important information about manner of articulation and landmark times

  • Auditory model features (correlogram, delayogram) are useful for recognition of multiple simultaneous talkers

  • Login