1 / 23

Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

Speech Recognition: A 50 Year Retrospective Paper at ASA 2004 in Honor of Contributions of James Flanagan. Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004. Speech Recognition.

mfincham
Download Presentation

Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition: A 50 Year RetrospectivePaper at ASA 2004 in Honor of Contributions of James Flanagan Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

  2. Speech Recognition • Objective: Recognize, interpret, and execute spoken language input to computer • Background: • ATT, CMU, IBM, and MIT working on the problem for over 40 years • Other Key Contributors: BBN, Dragon Systems, Kurzweil, SRI, Japan Inc., Europe Inc. • Research and Development Level of Effort: About $200 million/year world wide • Long Term Goal : Make speech the preferred mode of communication to computers

  3. Why Speech Recognition Has Been Difficult? • Too Many Sources of Variability • Noise • Microphones • Speakers • Different Speech Sounds • Different Pronunciations • Non Grammaticality • Imprecision of Language

  4. Why Speech Recognition Has Been Difficult? (Cont) • Too Many Sources of Knowledge • Acoustics • Phonetics and Phonology • Lexical Information • Syntax • Semantics • Context • Task Dependent Knowledge

  5. Syntax: Use of Sentence Structure • How do we incorporate syntax into a recognition algorithm? • Recognize the state and sub-select vocabulary • Example: Video from Here! Hear! (1968) • Imposing constraints on sentence and lexical structure reduces ambiguity

  6. Semantics: Use of Task level Knowledge • What is Semantics in the context of ASR and how to harness this power? • Convert knowledge into constraints that limits the search space • Video: Hearsay (1973) • Chess Task Semantics Constrains the Commands (and the Vocabulary) to Only The Legal Moves • Video Icon • Lesson: Task Level Semantics can provide powerful constraints in situations like chess, but much less in Information Retrieval and Medical Diagnosis

  7. Representation: FSG and HMMs • How to effectively use all the disparate sources of knowledge? • Blackboard Model using Hypothesize and Test paradigm (Hearsay system) • Represent linguistic, lexical, phonological and acoustic phonetic knowledge as a single integrated FSG (Dragon system) • Example from Dragon and Harpy Systems • Compiling all knowledge into an integrated network permits efficient execution • Video Icon • Lesson: Integrated representation provides a single abstract model, leading to a great conceptual simplicity.

  8. Search: Beam Search • Optimal Search Requires Consideration Of Every Path at Huge Cost! • given the probability estimates are approximate anyway, why not ignore unpromising alternatives? • Example: Beam Search from Harpy System • Speed-up by ignoring unpromising alternatives • Eliminate backtracking • Video Icon • Lessons: Beam search improved the speed by one to two orders of magnitude with little degradation of accuracy compared to best first search such as “branch and bound” and A* type search techniques.

  9. Speaker Independent Recognition: Use of Large Data Sets • Is speaker specific training essential for high performance? • No. Equivalent performance can be obtained from multi-speaker training data. Needs usually 3 to 10 times more data than for speaker specific training. • Leads to a more robust system • Example: Kai Fu Lee Video • Lesson: One hour speech from each of 100 different speakers can lead to more robust and equally accurate system than 10 hours speech from one speaker!

  10. Unlimited Vocabulary Dictation:Statistical Language Modeling • Can a system be used for unlimited vocabulary dictation? • Trigram and N-gram language models provide a flexible representation • Examples: • WSJ Dictation (1994) • Unlimited Vocabulary Email Dictation (1995) • Lesson: Given a large enough corpus of data, statistical language modeling can lead to respectable system performance

  11. Non Grammaticality in Spoken Language • Unlike written language, spoken language tends to be non-grammatical including non-verbal disfluencies • Semantic Case Frame Parsing • Example: Air Travel Information from open population (1994) • Wayne Ward Video • Lesson: Conventional NL Parsing breaks down for spoken language and need the use less rigid structures

  12. Land Marks • Dragon Dictate and Naturally Speaking • IBM Via Voice dictation • Nuance-based Tellme 800 services allow voice query for directory information, stocks, sports, news, weather, and horoscopes • Microsoft Speech Server e.g. voice dialing

  13. On the Need for Interdisciplinary Teams • AI and Pattern Recognition • Knowledge Representation and Search • Approximate Matching • Natural Language Processing • Human Computer Interaction • Cognitive Science • Design • Social Networks • Computer Science • Hardware, Parallel Systems • Algorithms Optimization • Signal Processing • Fourier Transforms, DFT, FFT • Acoustics • Physics of sounds & speech • Vocal tract model • Phonetics and Linguistics • Sounds (Acoustic-Phonetics) • Words (Lexicon) • Grammar (Syntax) • Meaning (Semantics) • Statistics • Probability Theory • Hidden Markov Models • Clustering • Dynamic Programming

  14. Future Challenges • Unrehearsed Spontaneous Speech • Non Native Speakers of English • Dynamic Learning from Sparse Data • New Words • New Speakers • New Grammatical Forms • New Languages • No Silver Bullet on the Horizon! • 50 more years? • Million times greater computational power, memory and bandwidth?

  15. Speech Research and Jim Flanagan • Pervasive Influence Across The Spectrum Of Speech Research • Source of Encouragement and Inspiration

More Related