1 / 56

Automatic Speech Recogniton

Automatic Speech Recogniton. Application 3. The Malevolent Hal. The Turing Test: A philosophical Interlude. The Chinese Room. What Do the Two Thought Experiments Have in Common?. Types of Performance. The Model. Do You Believe This?. Why ASR is Hard. A Tale of Aspiration.

ryo
Download Presentation

Automatic Speech Recogniton

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Speech Recogniton Application 3

  2. The Malevolent Hal

  3. The Turing Test: A philosophical Interlude

  4. The Chinese Room

  5. What Do the Two Thought Experiments Have in Common?

  6. Types of Performance

  7. The Model Do You Believe This?

  8. Why ASR is Hard

  9. A Tale of Aspiration • [t] Tunafish • Word initial • Vocal chords don’t’ vibrate. Produces a puff of air • [t] Starfish • [t] preceded by [s] • Vocal chords vibrate. No air puff • [k]: vocal chords don’t vibrate. Produces puff of air. • [g]: Vocal vibrate. No air puff • But [s] initial changes things: now [k] vibrates • Leads to the mishearing of [sk]/[sg] • the sky • this guy • There’s more going on: which hearing is more probable?

  10. Ambiguity Exists at Different Levels (Jim Glass, 2007) Acoustic-Phonetic Let us pray / Lettuce spray Syntactic Meet her at home Meter at home Semantic Is the baby crying Is the bay bee crying Discourse It is easy to recognize speech It is easy to wreck a nice beach Prosody: I’m FLYING to Spokane I’m flying to SPOKANE

  11. What to DO? • Language is not a system of rules • [t] makes a certain sound • “to whom” is correct. “to who” is incorrect • Language is a collection of probabilities

  12. Goal Of a Probabilistic Noisy Channel Architecture What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O? Where O is a sequence of observations 0=o1,o2 ,o3 , ..., ot • each oi is a floating point value representing ~10ms worth of energy of that slice of 0 And w=w1,w2 ,w3 ,...,wn • each wi is a word in L

  13. ASR as a Conditional Probability

  14. An Historical Aside • ASR Is Old as Computing • 50s: Bell Labs, RCA Research, Lincoln Labs • Discoveries in acoustic phonetics applied to recognition of single digits, syllables, vowels • 60s: Pattern recognition techniques used in US, Japan, Soviet Union • Two Developments in 80s • DARPA funding for LVCSR • Application of HMMs to speech recognition

  15. A sentimental Journey • Recall the Decoding Task • Given an HMM, M • Given a hidden state sequence, Q, Observation sequence O • Determine p(Q|O) • Recall the Learning Task • Given O and Q, create M • Where M consists of two matrices • Priors: A =a11, ..., a1n, ..., an1, ...,ann, where aij = p(qj|qi) • Likelihoods: p(oi | qi) But how do we get from i.e, to our likelihoods and priors

  16. Parson Bayes to the Rescue Author of: Divine Benevolence, or an Attempt to Prove That the Principal End of the Divine Providence and Government is the Happiness of His Creatures (1731)

  17. Bayes Rule Lets us transform: To: In Fact: p(o|w) : likelihoods  referred to as the acoustic model p(w) : priors  referred to as the language model

  18. LVCSR

  19. Acoustic M odel feature set signal Feature D ecoder symbols Extractor Language Model Diagram of an LVCSR System another View p(O|W) Viterbi rep. of acoustic signal p(W) digital signal processing

  20. Creating Feature VectorsDSP • Digitize the analog signal through sampling • Decide on a window size and perform FFT • Output: amount of energy at each frequency range: spectrum • log(FFT) is mel scale value • Take FFT of the previous value: cepstrum • Cepstrum is a model of the vocal tract • Save 13 values • Compute the change in these 13 over the next window • Compute the change in the 13 deltas • Total: 39 feature vectors

  21. Left Out • Computing the likelihood of feature vectors • Given an HMM state • The HMM state is a partial representation of a linguistic unit • p(ot|qi) But First What are these Speech Sounds

  22. Fundamental Theory of Phonetics • Spoken word • Composed of smaller units of speech • Called phones • Def: A phone is a speech sound • Phones are represented by symbols • IPA • ARPABET

  23. English Vowels

  24. Human Vocal Organs

  25. Close-Up

  26. Types of sound • Glottis: space between vocal folds. • Glottis vibrates/doesn’t vibrate: • Voiced consonants like [b], [d], [g], [v], [z], all English vowels • Unvoiced consonants like [p], [t], [k], [v], [s] • Sounds passing through nose: nasals • [m], [n], [ng]

  27. Phone Classes • Consonants • produced by restricting the airflow • Vowels • unrestricted, usually voiced, and longer lasting • semivowels • [y], voiced but shorter

  28. Consonant: Place of Articulation labial—[b], [m] labiodental—[v],[f] Dental—[th] Alveolar—[s],[z],[t],[d] Palatal—[sh],[ch],[zh] (Asian), [jh] (jar) Velar—[k] (cuckoo), [g] (goose), [ng] (kingfisher)

  29. Consonant: Manner of Articulation How the airflow is restricted stop or plosive: [b],[d],[g] nasal: air passes into the nasal cavity: [n],[m].[ng] fricative: air flow is not cut off completely: [f],[v],[th],[dh], [s],[z] affricates: stops followed by fricative [ch] (chicken), [jh] (giraffe) approximants: two articulators are close together but not close enough to cause turbulent air flow: [y],[w],[r], [l]

  30. Vowels Characterized by height and backness High Front: tongue raised toward the front [iy] (lily) High Back: tongue raised toward the back [uw] (tulip) Low Front: [ae] (bat) Low Back: [aa] (poppy)

  31. Acoustic Phonetics f = cycles per second A = height of the wave T = 1/f, the amount of time it takes cycle to complete Based on the sine wave

  32. Sound Waves [iy] in “She just had a baby” Plot the change in air pressure over time Imagine an eardrum blocking air pressure waves Graph measures the amount of compression and uncompression.

  33. She Just Had a Baby Notice the vowels, fricative [sh], and stop release [b]

  34. Fourier Analysis two wave forms: 10hz and 100 hz Every complex wave form can be represented as a sum of component sine waves

  35. Spectrum Spectrum of the 10 + 100 Hz wave forms. Note the two spikes Spectrum of a signal is a representation of each of its frequency components and their amplitudes.

  36. Wave form for [ae] in “Had” • Note • 10 major waves and 4 smaller within the 10 larger • The frequency of the larger is 10 cy/.0427 s = 234 Hz • The frequency of the smaller is about 4 times that or ~ 930 Hz • Also • Some of the 930 Hz waves have two smaller waves • F ~ 2 * 930 = 1860 Hz

  37. Spectrum for [ae] Notice one of the peaks at just under 1000 Hz Another at just under 2000 Hz

  38. Conclusion Spectral peaks that are visible in a spectrum are characteristic of different phones

  39. What Remains • Computing likelihood probability of vectors given a triphone: p(ot|qi) • Language model: p(W)

  40. Spectrogram [ih] [ae] [ah] • Representation of different frequencies that make up a waveform over time (spectrum was a single point in time) • x axis: time • y axis: frequencies in Hz • darkness: amplitude

  41. We Need a Sequence Classifier

  42. HMMs in Action • Observation Sequence in ASR • Acoustic Feature Vectors • 39 real-valued features • Represents changes in energy in different frequency bands • Each vector represents 10ms • Hidden States • words for simple tasks like digit recognition/yes-no • phones or (usually subphones)

  43. SIX • Bakis Network: Left-Right HMM • Each aij is an entry in the priors matrix • Likelihood probabilities not shown • For each state there is a collection of likelihood observations • Each observation (now a vector of 39 features) has a probability given the state

  44. But Phones Change Over Time

  45. Necessary to Model Subphones • As Before • Bakis Network: Left-Right HMM • Each aij is an entry in the priors matrix • Likelihood probabilities not shown • For each state there is a collection of likelihood observations • Each observation (now a vector of 39 features) has a probability given the state: p(ot|qi)

  46. coarticulation Notice the difference in the 2nd formant of [eh] in each context

  47. Solution • Triphone • phone • left context • right context • Notation[y-eh+l]: [eh] preceded by [y] and followed by [l] • Suppose there are 50 phones in a language: 125,000 triphones • Not all will appear in a corpus • English disallows: [ae-eh+ow] and [m-j+t] • WSJ study: 55,000 triphones needed but found only 18,500

  48. Data Sparsity Lucky for us: different contexts sometimes have similar effects. Notice [w iy]/[r iy] and [m iy]/[n iy]

  49. State Tying Initial subphones of [t-iy+n] [t-iy+n] share acoustic reps (and likelihood probabilities) How: Clustering algorithm

  50. Acoustic Likelihood/transition Probability Computation • Problem 1: Which observation corresponds to which state? • p(ot|qi): Likelihoods • Problem 2: What is the transition probability between states • Priors • Hand labeled • Training corpus of isolated words in wav file • Start and stop time of each phone is marked by hand • Can compute the observation likelihoods by counting (like ice cream) • But requires 400 hours to label an hour of speech • Humans are bad a labeling units smaller than a phone • Embedded training • Wav file + corresponding transcription • Pronunciation lexicon • Raw (untrained) HMM • Baum-Welsh sums over all possible segmentations of words and phones

More Related