Landmark-Based Speech Recognition:
1 / 54

Log Magnitude STFT - PowerPoint PPT Presentation

  • Uploaded on

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 6: Speech Recognition Acoustic & Auditory Model Features.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Log Magnitude STFT' - ayoka

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology

Mark Hasegawa-Johnson

[email protected]

University of Illinois at Urbana-Champaign, USA

Lecture 6 speech recognition acoustic auditory model features
Lecture 6: Speech Recognition Acoustic & Auditory Model Features

  • Log spectral features: log FFT, cepstrum, MFCC

  • Time-domain features: energy, zero crossing rate, autocorrelation

  • Model-based features: LPC, LPCC, PLP

  • Modulation filtering: cepstral mean subtraction, RASTA

  • Auditory model based features: auditory spectrogram, correlogram, summary correlogram

The problem with fft euclidean distance perceptual distance
The Problem with FFT: Euclidean Distance Features≠ Perceptual Distance

The complex cepstrum
The “Complex Cepstrum” Features

Cepstrum = Even Part of Complex Cepstrum

Euclidean distance between two spectra cepstral distance
Euclidean Distance Between Two Spectra = Cepstral Distance…

… but Windowed Cepstral Distance = Distance Between Smoothed Spectra

Bark scale warped spectrum
Bark-Scale Warped Spectrum Within 1 Bark Can “Mask” a Tone

Mel scale spectral coefficients mfsc
Mel-Scale Spectral Coefficients (MFSC) Within 1 Bark Can “Mask” a Tone

Mel scale spectra of music petruncio b s thesis university of illinois 2003
Mel-Scale Spectra of Music Within 1 Bark Can “Mask” a Tone(Petruncio, B.S. Thesis University of Illinois, 2003)







Mel scale cepstral coefficients mfcc
Mel-Scale Cepstral Coefficients (MFCC) Within 1 Bark Can “Mask” a Tone

Mfcc of music petruncio 2003
MFCC of Music Within 1 Bark Can “Mask” a Tone(Petruncio, 2003)







Time domain features
Time-Domain Features Within 1 Bark Can “Mask” a Tone

Time domain features features that can be computed frequently e g once millisecond
“Time-Domain Features” = Features that can be computed frequently (e.g., once/millisecond)

  • Energy-based features: energy, sub-band energies

  • Low-order cepstral features: energy, spectral tilt, spectral centrality

  • Zero-crossing rate

  • Spectral flatness

  • Autocorrelation

Example 3 features 1ms niyogi and burges 2002
Example: 3 Features/1ms frequently (e.g., once/millisecond)(Niyogi and Burges, 2002)



HF Energy

Spectral Flatness

Stop-Detection SVM


Figure from Niyogi & Burges, 2002

Energy based features
Energy-Based Features frequently (e.g., once/millisecond)

  • Filter the signal, to get the desired band

    • [0,400]: is the signal voiced? (doesn’t work for telephone speech)

    • [300,1000]: is the signal sonorant?

    • [1000,3000]: distinguish nasals from glides

    • [2000,6000]: detect frication energy

    • Full Band (no filtering): syllable detection

  • Window with a short window (4-6ms in length)

  • Compute the energy:

Cepstrum based features
Cepstrum-Based Features frequently (e.g., once/millisecond)

  • Average(log(energy)) = c[0]

    • c[0] = ʃ log|X(w)|dw = ½ ʃ log |X(w)|2 dw

    • Not the same as log(average(energy)), which is log ʃ |X(w)|2dw

  • Spectral Tilt: one measure is -c[1]

    • -c[1] = -ʃ log|X(w)|cos(w)dw ≈ HF log energy – LF log energy

  • A More Universally Accepted Measure:

    • Spectral Tilt = ʃ (w-p/2) log|X(w)| dw

  • Spectral Centrality: -c[2]

    • c[2] = -ʃ log|X(w)|cos(2w)dw

    • c[2]≈ Mid Frequency Energy (p/4 to 3p/4) – Low and High Frequency Energy (0 to p/4 and 3p/4 to p)

Measures of turbulence
Measures of Turbulence frequently (e.g., once/millisecond)

  • Zero Crossing Rate:

    • Count the number of times that the signal crosses zero in one window. Many: frication. Some: sonorant. Few: silence.

    • A related measure, used often in speech coding: “alternation rate” = the number of times the derivative crosses zero

  • Spectral Flatness:

    • average(log(energy)) – log(average(energy))

    • Equal to zero if spectrum is flat (white noise, e.g., frication)

    • Negative if spectrum is peaky (e.g., vowels)

Autocorrelation frequently (e.g., once/millisecond)

  • Autocorrelation: measures the similarity of the signal to a delayed version of itself

    • Sonorant (low-frequency) signals: R[1] is large

    • Fricative (high-frequency) signals: R[1] is small or negative

  • R[0] is the energy

    • -R[0] ≤ R[k] ≤ R[0] for all k

Model based features lpc lpcc plp
Model-Based Features: LPC, LPCC, PLP frequently (e.g., once/millisecond)

During Vowels and Glides, VT Transfer Function is All-Pole frequently (e.g., once/millisecond)(All-Pole Model sometimes OK at other times too)

Finding lpc coefficients solve the normal equations
Finding LPC Coefficients: Solve the “Normal Equations” frequently (e.g., once/millisecond)

  • LPC Filter Prediction of s[n] is Saks[n-k]. Error is En:

  • ak minimize the error if they solve the Normal Equations:

Roots of the lpc polynomial
Roots of the LPC Polynomial frequently (e.g., once/millisecond)

  • Roots of the LPC Polynomial:

  • Roots include:

    • Complex pole pair at most formant frequencies, rk and rk*

    • In a vowel or glide, there are additional poles at zero frequency:

      • One or two with bandwidth ≈ 100-300Hz; these give a negative tilt to the entire spectrum

      • One or two with bandwidth ≈ 2000-3000Hz; these attenuate high frequencies

    • In a fricative: poles may be at w=p, causing the whole spectrum to be high-pass

Reflection coefficients
Reflection Coefficients frequently (e.g., once/millisecond)

  • LPC Speech Synthesis Filter can be implemented using a reflection line. This reflection line is mathematically equivalent to a p-tube model of the vocal tract:

  • PARCOR coefficients (= reflection coefficients) are found using the Levinson-Durbin recursion:

Lar and lsf
LAR and LSF frequently (e.g., once/millisecond)

  • Log Area Ratio (LAR) is bilinear transform of the reflection coefficients:

  • Line Spectral Frequencies (LSF) are the resonances of two lossless vocal tract models. Set U(0,jW)=0 at glottis; result is P(z). Set P(0,jW)=0 at glottis, result is Q(z).

    (Hasegawa-Johnson, JASA 2000)

Lsfs tend to track formants
LSFs Tend to Track Formants frequently (e.g., once/millisecond)

  • When LPC finds the formants (during vowels), the roots of P(z) and the roots of Q(z) each tend to “bracket” one formant, with a Q(z) root below, and a P(z) root above.

  • When LPC can’t find the formants (e.g., aspiration), LSFs interpolate between neighboring syllables

Lpc cepstrum efficient recursive formula
LPC Cepstrum: Efficient Recursive Formula frequently (e.g., once/millisecond)

Lpc cepstrum efficient recursive formula1
LPC Cepstrum: Efficient Recursive Formula frequently (e.g., once/millisecond)

Perceptual lpc hermansky j acoust soc am 1990
Perceptual LPC frequently (e.g., once/millisecond)(Hermansky, J. Acoust. Soc. Am., 1990)

  • First, warp the spectrum to a Bark scale:

  • The filters, Hb(k), are uniformly spaced in Bark frequency. Their amplitudes are scaled by the equal-loudness contour (an estimate of how loud each frequency sounds):

Perceptual lpc
Perceptual LPC frequently (e.g., once/millisecond)

  • Second, compute the cube-root of the power spectrum

    • Cube root replaces the logarithm that would be used in MFCC

    • Loudness of a tone is proportional to cube root of its power

      Y(b) = S(b)0.33

  • Third, inverse Fourier transform to find the “Perceptual Autocorrelation:”

Perceptual lpc1
Perceptual LPC frequently (e.g., once/millisecond)

  • Fourth, use Normal Equations to find the Perceptual LPC (PLP) coefficients:

  • Fifth, use the LPC Cepstral recursion to find Perceptual LPC Cepstrum (PLPCC):

Modulation filtering cepstral mean subtraction rasta
Modulation Filtering: Cepstral Mean Subtraction, RASTA frequently (e.g., once/millisecond)

Reverberation frequently (e.g., once/millisecond)

  • Reverberation adds echos to the recorded signal:

  • Reverberation is a linear filter:

    x[n] = Sk=0∞ak s[n-dk]

  • If ak dies away fast enough (ak≈0 for dk>N, the STFT window length), we can model reverberation in the STFT frequency domain:

    X(z) = R(z) S(z)

  • Usually, STFT frequency-domain modeling of reverberation works for

    • Electric echoes (e.g., from the telephone network)

    • Handset echoes (e.g., from the chin of the speaker)

    • But NOT for free-field echoes (e.g., from the walls of a room, recorded by a desktop microphone)

Reverberation recorded and simulated room response
Reverberation: Recorded and Simulated Room Response frequently (e.g., once/millisecond)

Cepstral mean subtraction subtract out short term reverb
Cepstral Mean Subtraction: Subtract out Short-Term Reverb frequently (e.g., once/millisecond)

  • Log Magnitude Spectrum: Constant Filter → Constant Additive Term

  • Reverberation R(z) is Constant during the whole sentence

  • Therefore: Subtract the average value from each frame’s cepstrum  log R(z) is completely subtracted away

  • Warning: if the utterance is too short (contains too few phonemes), CMS will remove useful phonetic information!

Modulation filtering
Modulation Filtering frequently (e.g., once/millisecond)

  • Short-Time Log-Spectrum, log|Xt(w)|, is a function of t (frame number) and w.

  • Speaker information (log|Pt(w)|), Transfer function information (log|Tt(w)|), and Channel/Reverberation Information (log|Rt(w)|) may vary at different speeds with respect to frame number t.

    log|Xt(w)| = log|Rt(w)| + log|Tt(w)| + log|Pt(w)|

  • Assumption: Only log|Tt(w)| carries information about phonemes. Other components are “noise.”

  • Wiener filtering approach: filter log|Xt(w)| to compute an estimate of log|Tt(w)|.

    log|Tt*(w)| = Sk hk log|Xt-k(w)|

Rasta relative spectral amplitude hermansky ieee trans speech and audio proc 1994
RASTA (RelAtive SpecTral Amplitude) frequently (e.g., once/millisecond)(Hermansky, IEEE Trans. Speech and Audio Proc., 1994)

  • Modulation-filtering of the cepstrum is equivalent to modulation-filtering of the log spectrum:

    ct*[m] = Sk hk ct-k[m]

  • RASTA is a particular kind of modulation filter:

Features based on models of auditory physiology
Features Based on Models of Auditory Physiology frequently (e.g., once/millisecond)

Processing of sound by the inner ear
Processing of Sound by the Inner Ear frequently (e.g., once/millisecond)

  • Bones of the middle ear act as an impedance matcher, ensuring that not all of the incoming wave is reflected from the fluid-air boundary at the surface of the cochlea.

  • The basilar membrane divides the top half of the cochlea (scala vestibuli) from the bottom half (scala tympani). The basal end is light and stiff, therefore tuned to high frequencies; the apical end is loose and floppy, therefore tuned to low frequencies. Thus the whole system acts like a bank of mechanical bandpass filters, with Q=centerfrequency/bandwidth≈6.

  • Hair cells on the surface of the basilar membrane release neurotransmitter when they are bent down, but not when they are pulled up. Thus they half-wave rectify the wave-like motion of the basilar membrane.

  • Neurotransmitter, in the cleft between hair cell and neuron, takes a little while to build up or to dissipate. The inertia of neurotransmitter acts to low-pass filter the half-wave rectified signal, with a cutoff around 2kHz. Result is a kind of localized energy in a ~0.5ms window.

Filtering different frequencies excite different positions on the basilar membrane
Filtering: Different Frequencies Excite Different Positions on the Basilar Membrane

Inner and Outer Hair Cells on the Basilar Membrane. Each column of hair cells is tuned to a slightly different center frequency.

Half wave rectification only down bending of the hair cells excites a neural response
Half-Wave Rectification: Only Down-Bending of the Hair Cells Excites a Neural Response

Close-up view of outer hair cells, in a “V” configuration

Neural response to a synthetic vowel cariani 2000
Neural Response to a Synthetic Vowel Excites a Neural Response(Cariani, 2000)

Temporal structure of the neural response
Temporal Structure of the Neural Response Excites a Neural Response

  • Neural response patterns carries more information than just average energy (spectrogram)

  • For example: periodicity

    • Correlogram (Licklider, 1951): Measure periodicity on each simulated neuron by computing its autocorrelation

    • Recursive Neural Net (Cariani, 2000): Measure periodicity by building up response strength in an RNN with different delay loops

    • YIN pitch tracker (de Cheveigne and Kawahara, 2002): Measure periodicity using the absolute value of the difference between delayed signals

Correlogram of a Sine Wave: Excites a Neural ResponseCenter Frequency vs. Autocorrelation Delay, Snapshot at one Instant in Time

Correlogram of a Periodic Signal Excites a Neural Responsewith spectral peaks at 2F0, 3F0, etcetera but none at F0 (missing fundamental)

Correlogram of an owl hooting
Correlogram of an Owl Hooting Excites a Neural Response

  • Y axis = neuron’s center frequency

  • X axis = autocorrelation delay (same as on previous two slides

  • Time = time lapsed in the movie (real-time movie)

  • Notice: pitch fine structure, within each band, could be used to separate two different audio input signals, performing simultaneous recognition of two speech signals.

Gandhi and Hasegawa-Johnson, ICSLP 2004 Excites a Neural Response

Summary Excites a Neural Response

  • Log spectrum, once/10ms, computed with a window of about 25ms, seems to carry lots of useful information about place of articulation and vowel quality

    • Euclidean distance between log spectra is not a good measure of perceptual distance

    • Euclidean distance between windowed cepstra is better

    • Frequency warping (mel-scale or Bark-scale) is even better

    • Fitting an all-pole model (PLP) seems to improve speaker-independence

    • Modulation filtering (CMS, RASTA) improve robustness to channel variability (short-impulse-response reverb)

  • Time-domain features (once/1ms) can capture important information about manner of articulation and landmark times

  • Auditory model features (correlogram, delayogram) are useful for recognition of multiple simultaneous talkers