1 / 43

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign, USA Assistant Professor, Electrical and Computer Engineering Department

davida
Download Presentation

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonolog

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign, USA Assistant Professor, Electrical and Computer Engineering Department Assistant Professor, Beckman Institute for Advanced Science and Technology Adjunct Professor, Speech and Hearing Sciences Department

  2. Lecture 1Introduction to Spectrogram Reading • Review • Laplace and Fourier transforms • Short-time Fourier transform (STFT) and windowing • White noise • Periodic Signals • Spectrogram reading: Pitch • Wideband and narrowband spectrograms • Spectrogram reading: Manner • Speech physiology • Manner classification of phonemes • Spectrogram reading: Formants • Log-linear form of a rational filter

  3. Laplace and Fourier Transforms

  4. Transform Properties

  5. Transforms worth knowing: Impulses

  6. Transforms worth knowing: Filters

  7. Rectangular Window

  8. Hamming & Hanning Windows

  9. Periodic Signals

  10. Random Signals (Noise)

  11. The Short-Time Fourier Transform

  12. The Spectrogram

  13. Narrowband Spectrogram: N > 2T0

  14. Wideband Spectrogram: N < T0

  15. Fundamental Frequency 10F0 4T0 Fundamental Frequency (Pitch): F0=1/T0

  16. On to New Material:Manner Features, Speech Production, and Landmarks

  17. Anatomy of Speech Production Hard Palate Nasal Cavity Lips Soft Palate (Open) Oral Cavity Pharynx Epiglottis Tongue Blade Vocal Folds Tongue Body Jaw Tongue Root

  18. Speech sources: Voicing, Turbulence, and Transients • The vocal folds: • A nonlinear, high-impedance oscillator • Excitation is like a periodic impulse train • Turbulence: • Vortices striking an obstacle produce white noise • Excitation is like white noise • Transient: • High pressure, suddenly released • Excitation is like a single loud impulse, d(t)

  19. The vocal folds: A nonlinear, high-impedance oscillator Vocal tract “rings” like a bell, shaping the sound produced by the vocal folds (Cross-sectional area of the vocal tract: 0.5-10 cm2) Larynx (the opening between the vocal folds) has an open area of 0.03 cm2. In order to get through, air from lungs must speed up to a high-speed jet. Vocal folds flap back and forth, driven by the jet, with a rate of 100-200 pulses/second.

  20. Turbulence: Vortices striking an obstacle produce white noise In a fricative, area of the tongue constriction is about 0.2cm2. In order to get through, air speeds up into a turbulent jet. The turbulent jet strikes against downstream obstacles, like the teeth. The jet contains vortices of all different radii, between 0mm and 0.2cm, therefore the resulting sound contains noise at all frequencies above about 700Hz.

  21. Transient: High pressure, suddenly released While tongue tip is closed, air pressure builds up behind the constriction. When constriction is released, there is a sudden change in air flow through the constriction (from 0 to nonzero). The sudden change in airflow is heard as a “pop.”

  22. The Source-Filter Model of Speech Production Corresponds to: S(s) = H(s)E(s), where S(s) = Recorded speech spectrum E(s) = Source spectrum H(s) = Transfer function = Filtering by the vocal tract

  23. Manner Classification of Phonemes: [continuant] • [-continuant] = lips or tongue close COMPLETELY on midline of the vocal tract: • stops (p,b,t,d,k,g) • nasals (m,n,ng), • affricates (q,j,ch,zh) • syllable-initial lateral (l, e.g., “lake”) • [+continuant] = no complete closure: • fricatives (f,v,s,z,sh,x, Chinese h) • glides (w,y,r, English h) • vowels (a,e,i,o,u) • diphthongs (in “buy,” “boy,” “bow”)

  24. Manner Classification of Phonemes: [sonorant] • [+sonorant] = “a sound you can sing” (Latin) • nasals (m,n,ng) • lateral (l) • glides (w,y,r) • vowels (a,e,i,o,u) • diphthongs (buy, boy, bow) • [-sonorant] = air pressure builds up behind constriction; voicing amplitude drops (also called an “obstruent consonant”) • stops (p,b,t,d,k,g) • affricates (q,j,ch,zh) • fricatives (f,v,s,z,sh,x) • Special status of “sonorant” in Chinese: • “initial” must be all-sonorant (“liang”) or all-obstruent (“qing”) • “final” must be all-sonorant

  25. Sonorant Consonants: Glide, Lateral, Nasal “layya ton” -- /l/, /y/, /t/, /n/ (the /y/ is [+continuant], others are -) “ame” -- /m/ [-continuant]

  26. Obstruent Consonants: Fricatives, Affricates, and Stops sa (+continuant) shi (+continuant) qe (-continuant) iji (-continuant) ba (-continuant) ita (-continuant)

  27. Place of Primary Articulation Palatal (Blade):q,j,sh,y,i Alveolar (Blade):t,d,s,z,n,l Retroflex (Blade):ch,zh,x,r,er Dental (Blade):th,dh Velar (Body):k,g,ng,w,u Labial (Lips):p,b,f,v,m,w,u,o Uvular (Body):h,o Pharyngeal(Body):a,ae Laryngeal:h

  28. Features of Secondary Articulators: [lateral], [nasal], [affricated], [aspirated] • [+sonorant,+continuant]: vowels, glides • [+sonorant,-continuant]: • [+nasal] = soft palate is open; air escapes through the nose • [+lateral] = tongue is open on the sides; air can escape around edges of tongue • [-sonorant,+continuant]: fricatives • [-sonorant,-continuant]: • [+affricated]: tongue stays nearly closed after release, causing frication (q,j,ch,zh) • [+aspirated]: larynx stays open after release, causing aspiration (p,t,k) • [-affricated,-aspirated]: nothing special happens after release; vowel starts immediately (b,d,g)

  29. Sonorant Consonants: Glide, Lateral, Nasal “layya ton” -- /l/, /y/, /t/, /n/ (the /y/ is [+continuant], others are -) “ame” -- /m/ [-continuant]

  30. Waveforms and Spectrograms: Aspirated and Unaspirated Stops Unaspirated: /b/ Aspirated: /t/

  31. Phonetic Subsegments in the Release of an Aspirated Stop

  32. Waveforms and Spectrograms: Fricatives and Affricates iji qe sa shi

  33. Landmarks: Changes in the features [continuant], [sonorant] /m/ release /t/ release /k/ /l/ release /m/ closure /n/ release /v/ release /t/ closure /v/ closure /n/ closure

  34. The Vocal Tract Transfer Function

  35. Log-Spectral Separation of Source and Filter

  36. Formant Frequencies = Resonant Frequencies of the Vocal Tract

  37. Formant Frequencies of a Vowel From Peterson and Barney, “Control Methods in a Study of the Vowels,” Journal of the Acoustical Society of America, 1952

  38. Classifying Vowels F2 starts at 1200Hz, rises to 2000Hz F2=1200Hz F1=800Hz F1 starts at 800Hz, falls to 300Hz Therefore diphthong is /AY/ Therefore vowel is /AH/

  39. Rational Filters: Obstruents

  40. Example: Front Cavity Resonance of /ch/ (q) is near F3 of Following Vowel

  41. Rational Filters: Nasal Consonants

  42. Examples: Nasal Consonants /m/: This talker makes /m/ with resonances at 1000Hz, 1800Hz uncancelled, but with the resonance at 300Hz cancelled by zeros. /ng/: This talker makes /ng/ with resonances at 300Hz, 1000Hz uncancelled, but with the resonance at 1800Hz cancelled by zeros.

  43. Summary • Spectrogram is the log magnitude of the STFT. • Wideband spectrogram: N<T0, pitch shows up in the time domain • Narrowband spectrogram: N>2T0, pitch shows up in the frequency domain • Landmarks occur at changes in the values of the distinctive features [continuant] and [sonorant]: • [+continuant,+sonorant]: vowels, glides, diphthongs • [+continuant,-sonorant]: fricatives • [-continuant,+sonorant]: nasals, laterals • [-continuant,-sonorant]: stops, affricates • Recognition of Vowels and Glides: F1 and F2 are usually enough • Recognition of Diphthongs: F1 and F2 at two separate points in time (beginning and ending of the vowel). • Obstruent Consonants: Back cavity formants are cancelled by zeros, leaving only the front cavity formants (e.g., F3 for /sh/, /q/) • Nasal Consonants: Resonances of the mouth-nose system are often cancelled by zeros, leaving primarily low-frequency energy.

More Related