230 likes | 237 Views
Speech Recognition: A 50 Year Retrospective Paper at ASA 2004 in Honor of Contributions of James Flanagan. Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004. Speech Recognition.
E N D
Speech Recognition: A 50 Year RetrospectivePaper at ASA 2004 in Honor of Contributions of James Flanagan Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004
Speech Recognition • Objective: Recognize, interpret, and execute spoken language input to computer • Background: • ATT, CMU, IBM, and MIT working on the problem for over 40 years • Other Key Contributors: BBN, Dragon Systems, Kurzweil, SRI, Japan Inc., Europe Inc. • Research and Development Level of Effort: About $200 million/year world wide • Long Term Goal : Make speech the preferred mode of communication to computers
Why Speech Recognition Has Been Difficult? • Too Many Sources of Variability • Noise • Microphones • Speakers • Different Speech Sounds • Different Pronunciations • Non Grammaticality • Imprecision of Language
Why Speech Recognition Has Been Difficult? (Cont) • Too Many Sources of Knowledge • Acoustics • Phonetics and Phonology • Lexical Information • Syntax • Semantics • Context • Task Dependent Knowledge
Syntax: Use of Sentence Structure • How do we incorporate syntax into a recognition algorithm? • Recognize the state and sub-select vocabulary • Example: Video from Here! Hear! (1968) • Imposing constraints on sentence and lexical structure reduces ambiguity
Semantics: Use of Task level Knowledge • What is Semantics in the context of ASR and how to harness this power? • Convert knowledge into constraints that limits the search space • Video: Hearsay (1973) • Chess Task Semantics Constrains the Commands (and the Vocabulary) to Only The Legal Moves • Video Icon • Lesson: Task Level Semantics can provide powerful constraints in situations like chess, but much less in Information Retrieval and Medical Diagnosis
Representation: FSG and HMMs • How to effectively use all the disparate sources of knowledge? • Blackboard Model using Hypothesize and Test paradigm (Hearsay system) • Represent linguistic, lexical, phonological and acoustic phonetic knowledge as a single integrated FSG (Dragon system) • Example from Dragon and Harpy Systems • Compiling all knowledge into an integrated network permits efficient execution • Video Icon • Lesson: Integrated representation provides a single abstract model, leading to a great conceptual simplicity.
Search: Beam Search • Optimal Search Requires Consideration Of Every Path at Huge Cost! • given the probability estimates are approximate anyway, why not ignore unpromising alternatives? • Example: Beam Search from Harpy System • Speed-up by ignoring unpromising alternatives • Eliminate backtracking • Video Icon • Lessons: Beam search improved the speed by one to two orders of magnitude with little degradation of accuracy compared to best first search such as “branch and bound” and A* type search techniques.
Speaker Independent Recognition: Use of Large Data Sets • Is speaker specific training essential for high performance? • No. Equivalent performance can be obtained from multi-speaker training data. Needs usually 3 to 10 times more data than for speaker specific training. • Leads to a more robust system • Example: Kai Fu Lee Video • Lesson: One hour speech from each of 100 different speakers can lead to more robust and equally accurate system than 10 hours speech from one speaker!
Unlimited Vocabulary Dictation:Statistical Language Modeling • Can a system be used for unlimited vocabulary dictation? • Trigram and N-gram language models provide a flexible representation • Examples: • WSJ Dictation (1994) • Unlimited Vocabulary Email Dictation (1995) • Lesson: Given a large enough corpus of data, statistical language modeling can lead to respectable system performance
Non Grammaticality in Spoken Language • Unlike written language, spoken language tends to be non-grammatical including non-verbal disfluencies • Semantic Case Frame Parsing • Example: Air Travel Information from open population (1994) • Wayne Ward Video • Lesson: Conventional NL Parsing breaks down for spoken language and need the use less rigid structures
Land Marks • Dragon Dictate and Naturally Speaking • IBM Via Voice dictation • Nuance-based Tellme 800 services allow voice query for directory information, stocks, sports, news, weather, and horoscopes • Microsoft Speech Server e.g. voice dialing
On the Need for Interdisciplinary Teams • AI and Pattern Recognition • Knowledge Representation and Search • Approximate Matching • Natural Language Processing • Human Computer Interaction • Cognitive Science • Design • Social Networks • Computer Science • Hardware, Parallel Systems • Algorithms Optimization • Signal Processing • Fourier Transforms, DFT, FFT • Acoustics • Physics of sounds & speech • Vocal tract model • Phonetics and Linguistics • Sounds (Acoustic-Phonetics) • Words (Lexicon) • Grammar (Syntax) • Meaning (Semantics) • Statistics • Probability Theory • Hidden Markov Models • Clustering • Dynamic Programming
Future Challenges • Unrehearsed Spontaneous Speech • Non Native Speakers of English • Dynamic Learning from Sparse Data • New Words • New Speakers • New Grammatical Forms • New Languages • No Silver Bullet on the Horizon! • 50 more years? • Million times greater computational power, memory and bandwidth?
Speech Research and Jim Flanagan • Pervasive Influence Across The Spectrum Of Speech Research • Source of Encouragement and Inspiration