Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

Speech Recognition: A 50 Year RetrospectivePaper at ASA 2004 in Honor of Contributions of James Flanagan Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

Speech Recognition • Objective: Recognize, interpret, and execute spoken language input to computer • Background: • ATT, CMU, IBM, and MIT working on the problem for over 40 years • Other Key Contributors: BBN, Dragon Systems, Kurzweil, SRI, Japan Inc., Europe Inc. • Research and Development Level of Effort: About $200 million/year world wide • Long Term Goal : Make speech the preferred mode of communication to computers

Why Speech Recognition Has Been Difficult? • Too Many Sources of Variability • Noise • Microphones • Speakers • Different Speech Sounds • Different Pronunciations • Non Grammaticality • Imprecision of Language

Why Speech Recognition Has Been Difficult? (Cont) • Too Many Sources of Knowledge • Acoustics • Phonetics and Phonology • Lexical Information • Syntax • Semantics • Context • Task Dependent Knowledge

Syntax: Use of Sentence Structure • How do we incorporate syntax into a recognition algorithm? • Recognize the state and sub-select vocabulary • Example: Video from Here! Hear! (1968) • Imposing constraints on sentence and lexical structure reduces ambiguity

Semantics: Use of Task level Knowledge • What is Semantics in the context of ASR and how to harness this power? • Convert knowledge into constraints that limits the search space • Video: Hearsay (1973) • Chess Task Semantics Constrains the Commands (and the Vocabulary) to Only The Legal Moves • Video Icon • Lesson: Task Level Semantics can provide powerful constraints in situations like chess, but much less in Information Retrieval and Medical Diagnosis

Representation: FSG and HMMs • How to effectively use all the disparate sources of knowledge? • Blackboard Model using Hypothesize and Test paradigm (Hearsay system) • Represent linguistic, lexical, phonological and acoustic phonetic knowledge as a single integrated FSG (Dragon system) • Example from Dragon and Harpy Systems • Compiling all knowledge into an integrated network permits efficient execution • Video Icon • Lesson: Integrated representation provides a single abstract model, leading to a great conceptual simplicity.

Search: Beam Search • Optimal Search Requires Consideration Of Every Path at Huge Cost! • given the probability estimates are approximate anyway, why not ignore unpromising alternatives? • Example: Beam Search from Harpy System • Speed-up by ignoring unpromising alternatives • Eliminate backtracking • Video Icon • Lessons: Beam search improved the speed by one to two orders of magnitude with little degradation of accuracy compared to best first search such as “branch and bound” and A* type search techniques.

Speaker Independent Recognition: Use of Large Data Sets • Is speaker specific training essential for high performance? • No. Equivalent performance can be obtained from multi-speaker training data. Needs usually 3 to 10 times more data than for speaker specific training. • Leads to a more robust system • Example: Kai Fu Lee Video • Lesson: One hour speech from each of 100 different speakers can lead to more robust and equally accurate system than 10 hours speech from one speaker!

Unlimited Vocabulary Dictation:Statistical Language Modeling • Can a system be used for unlimited vocabulary dictation? • Trigram and N-gram language models provide a flexible representation • Examples: • WSJ Dictation (1994) • Unlimited Vocabulary Email Dictation (1995) • Lesson: Given a large enough corpus of data, statistical language modeling can lead to respectable system performance

Non Grammaticality in Spoken Language • Unlike written language, spoken language tends to be non-grammatical including non-verbal disfluencies • Semantic Case Frame Parsing • Example: Air Travel Information from open population (1994) • Wayne Ward Video • Lesson: Conventional NL Parsing breaks down for spoken language and need the use less rigid structures

Land Marks • Dragon Dictate and Naturally Speaking • IBM Via Voice dictation • Nuance-based Tellme 800 services allow voice query for directory information, stocks, sports, news, weather, and horoscopes • Microsoft Speech Server e.g. voice dialing

On the Need for Interdisciplinary Teams • AI and Pattern Recognition • Knowledge Representation and Search • Approximate Matching • Natural Language Processing • Human Computer Interaction • Cognitive Science • Design • Social Networks • Computer Science • Hardware, Parallel Systems • Algorithms Optimization • Signal Processing • Fourier Transforms, DFT, FFT • Acoustics • Physics of sounds & speech • Vocal tract model • Phonetics and Linguistics • Sounds (Acoustic-Phonetics) • Words (Lexicon) • Grammar (Syntax) • Meaning (Semantics) • Statistics • Probability Theory • Hidden Markov Models • Clustering • Dynamic Programming

Future Challenges • Unrehearsed Spontaneous Speech • Non Native Speakers of English • Dynamic Learning from Sparse Data • New Words • New Speakers • New Grammatical Forms • New Languages • No Silver Bullet on the Horizon! • 50 more years? • Million times greater computational power, memory and bandwidth?

Speech Research and Jim Flanagan • Pervasive Influence Across The Spectrum Of Speech Research • Source of Encouragement and Inspiration

Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

Presentation Transcript

Edmund M. Clarke School of Computer Science Carnegie Mellon University

Language Technologies Institute School of Computer Science Carnegie Mellon University, USA

Carnegie Mellon University, Pittsburgh, PA

Kenichi Kumatani , Disney Research, Pittsburgh Bhiksha Raj, Carnegie Mellon University

Carnegie Mellon Univ. School of Computer Science 15-415 - Database Applications

Mor Harchol-Balter Carnegie Mellon University Computer Science

Edmund M. Clarke School of Computer Science Carnegie Mellon University

School of Computer Science Carnegie Mellon University

Mor Harchol-Balter Carnegie Mellon University School of Computer Science

Carnegie Mellon University

Carnegie Mellon Univ. School of Computer Science 15-415/615 - DB Applications

Norman M. Sadeh ISR - School of Computer Science Carnegie Mellon University

Tuomas Sandholm Computer Science Department Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Raj Reddy Carnegie Mellon University Pittsburgh, PA 15213 January 21, 2010 rr.cs.cmu

Tuomas Sandholm Computer Science Department Carnegie Mellon University

Tuomas Sandholm Computer Science Department Carnegie Mellon University

Norman M. Sadeh ISR - School of Computer Science Carnegie Mellon University

Carnegie Mellon Univ. School of Computer Science 15-415/615 - DB Applications

Carnegie Mellon University