1 / 15

IRCS/CCN Summer Workshop June 2003 Speech Recognition

IRCS/CCN Summer Workshop June 2003 Speech Recognition. Why is perception hard?. Task: available signals → model of the world around signals are mostly accidental, inadequate sometimes disguised or falsified always mixed-up and ambiguous Reasoning about the source of signals:

gladys
Download Presentation

IRCS/CCN Summer Workshop June 2003 Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IRCS/CCN Summer WorkshopJune 2003Speech Recognition

  2. Why is perception hard? • Task: available signals → model of the world around • signals are mostly accidental, inadequate • sometimes disguised or falsified • always mixed-up and ambiguous • Reasoning about the source of signals: • Integration of context: what do you expect? • “Sensor fusion”: integration of vision, sound, smell etc. • Source (and noise) separation: there’s more than one thing out there • Variable perspective, source variation etc. • depends on the type of signal • depends on the type of object • Much harder than chess or calculus!

  3. Bayesian probability estimation • Thomas Bayes (1702-1761) • Minister of the Presbyterian Chapel at Tunbridge Wells • Amateur mathematician • Essay towards solving a problem in the doctrine of chances,published (posthumously) in 1764 • Crucial idea: background (prior) knowledge about the plausibility of different theoriescan be combined with knowledge aboutthe relation of theories to evidence • in a mathematically well-defined way • even if all knowledge is uncertain • to reason about the most likely explanation of the available evidence • Bayes’ theorem • “the most important equation in the history of mathematics” (?) • a simple consequence of basic definitions, or • a still-controversial recipe for the probability of alternative causes for a given event, or • the implicit foundation of human reasoning • a general framework for solving the problems of perception Tutorial on Bayes’ Theorem

  4. Fundamental theoremof speech recognition P(W|S) ∝ P(S|W)P(W) where W is “Word(s)” (i.e. message text) S is “Sound(s)” (i.e. speech signal) “Noisy channel model” of communications engineeringdue to Shannon 1949 New algorithms, especially relevant to speech recognition due to L.E. Baum et al. ~ 1965-1970 Applied to speech recognition by Jim Baker (CMU PhD 1975), Fred Jelinek (IBM speech group >>1975)

  5. Motivations for a Bayesian approach • A consistent framework for integrating previous experience and current evidence • A quantitative model for “abduction” = reasoning about the best explanation • A general method for turning a generative model into an analytic one = “analysis by synthesis” helpful where |categories| << |signals| These motivations apply both in engineering practice and in the evolution of biological systems

  6. Basic architecture of standard speech recognition technology 1. Bayes’ Rule: P(W|S) ∝ P(S|W)P(W) 2. Approximate P(S|W)P(W) as a Hidden Markov Model a probabilistic function [ to get P(S|W)] of a markov chain [ to get P(W) ] 3. Use Baum/Welch (=EM) algorithm to “learn” HMM parameters 4. Use Viterbi decoding to find the most probable W given S in terms of the estimated HMM

  7. HMM parameter estimation given labelled/aligned training data...

  8. Viterbi decoding given HMM & observed signal...

  9. Sketch of Baum-Welch (EM) algorithm for estimating HMM parameters given unaligned (or even unlabelled) training data

  10. Other typical details:Complex elaborations of the basic ideas • HMM states ← triphones ← words • each triphone → 3-5 states + connection pattern • phone sequence from pronuncing dictionary • clustering for estimation • Acoustic features • RASTA-PLP etc. • Vocal tract length normalization, speaker clustering • Output pdf for each state as mixture of gaussians • Language model as N-gram model over words • recency/topic effects • Empirical weighting of language vs. acoustic models • etc. etc.

  11. Some limitations of the standard architecture • Problems with Markovian assumptions • Modeling trajectory effects • Variable coordination of articulatory dimensions • ....

More Related