250 likes | 390 Views
Superhuman Speech Recognition: Technology Challenges & Market Adoption David Nahamoo IBM Fellow Speech CTO, IBM Research July 2, 2008. Overall Speech Market Opportunity. WW Voice-Driven Conversation Access Technology Forecast.
E N D
Superhuman Speech Recognition:Technology Challenges & Market AdoptionDavid NahamooIBM FellowSpeech CTO, IBM ResearchJuly 2, 2008
Overall Speech Market Opportunity WW Voice-Driven Conversation Access Technology Forecast • This chart represents all revenue for speech related ecosystem activity. • Revenue exceeded $1B for the 1st time in 2006 • Note also that hosted services will represent ½ of speech related revenue in 2011 *Opus Research 02_2007
Speech Market Segments • Improved accuracy • Much larger vocabulary speech recognition system
New Opportunity Areas • Contact Centers Analytics • Quality Assurance, Real Time Alerts, Compliance • Media Transcription • Closed captioning • Accessibility • Government, Lectures • Content Analytics • Audio-indexing, cross-lingual information retrieval, multi-media mining • Dictation • Medical, Legal, Insurance, Education • Unified Communication • Voicemail, Conference calls, email and SMS on hand held
Target zone Human Baseline for conversations
Performance Results (2004 DARPA EARS Evaluation) (Last public evaluation of English Telephony Transcription) IBM: Best Speed-Accuracy Tradeoff
MALACH Multilingual Access to Large Spoken ArCHives • Funded by NSF, 5-year project (Started in Oct. 2001) • Project Participants • IBM, Visual History Foundation, Johns Hopkins University, University of Maryland, Charles University and University of West Bohemia • Objective • Improve access to large multilingual collections of spontaneous speech by advancing the state-of-the-art in technologies that work together to achieve this objective: Automatic Speech Recognition, Computer-Assisted Translation , Natural Language Processing and Information Retrieval
Frequent interruptions: • CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN • Disfluencies • A- a- a- a- band with on- our- on- our- arm Emotional speech • young man they ripped his teeth and beard out they beat him MALACH: A challenging speech corpus Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages. Goal: improved access to large multilingual spoken archives Challenges:
Effects of Customization (MALACH Data) • State-of-the-art ASR system • trained on SWB data (8KHz) • MALACH Training data seen • by AM and LM • fMPE, MPE, Consensus decoding
Progress in Word Error Rate – IBM WebSphere Voice Server 45%relative improvement in WER the last2.5years 20%relative improvement in speed in the last1.5years
Lay white at X 8 soon Bin Green with F 7 now Multi-Talker Speech Separation Task male and female speaker at 0dB
Two Talker Speech Separation Challenge Results Examples: Mixture Recognition Error
IBM’s Superhuman Speech Recognition • Universal Recognizer • Any accent • Any topic • Any noise conditions • Broadcast, phone, in car, or live • Multiple languages • Conversational
stem could that on it cuts down and I comes stay I’m they cut them Human Experiments • Question: • Can post-processing of recognizer hypotheses by humans improve accuracy? • What is the relative contribution of linguistic vs. acoustic information (in this post-processing operation?) • Experiment • Produce recognizer hypotheses in form of “sausages” • Allow human to correct output either with linguistic information alone or with short segments of acoustic information • Results • Human performance still far from maximum possible, given information in “sausages” • Recognizer hypothesized linguistic context information not useful by itself • Acoustic information in limited span (1 sec. average) marginally useful • What we learned • Hard-to-design • Expensive to conduct • Hard to decide if not valuable
Acoustic Modeling Today • Approach: Hidden Markov Models • Observation densities (GMM) for P( feature | class ) • Mature mathematical framework, easy to combine with linguistic information • However, does not directly model what we want i.e., P( words | acoustics ) • Training: Use transcribed speech data • Maximum Likelihood • Various discriminative criteria • Handling Training/Test Mismatches: • Avoid mismatches by collecting “custom” data • Adaptation & adaptive training algorithms • Significantly worse than humans for tasks withlittle or no linguistic information - e.g., digits/letters recognition • Human performance extremely robust to acoustic variations • due to speaker, speaking style, microphone, channel, noise, accent, & dialect variations • Steady progress over the yearsContinued progress using current methodology very likely in the future
Towards a Non-Parametric Approach to Acoustics • General Idea: Back to pattern recognition basics! • Break test utterance into sequence of larger segments (phone, syllable, word, phrase) • Match segments to closest ones in training corpus using some metric (possibly using long distance models) • Helps to get it right if you’ve heard it before • Why prefer this approach over HMMs? • HMMs compress training by x1000; too many modeling assumptions • 1000hrs ~ 30Gb; State-of-the-art acoustic models ~ 30Mb • Relaxing assumptions have been key to allrecent improvements in acoustic modeling • How can we accomplish this? • Store & index training data for rapid access of training segments close to test segments • Develop a metric D( train_seq, test_seq): obvious candidate is DTW with appropriate metric and warping rules • Back to the Future? • Reminiscent of DTW & Segmental models from late 80’s – ME was missing • Limited by computational resources (storage/cpu/data) then & so HMMs won • Implications: • Need 100x more data for handling larger units (hence 100x more computing resources) • Better performance with more data – likely to have “heard it before”
Utilizing Linguistic Information in ASR • Today’s standard ASR does not explicitly use linguistic information • But recent work at JHU, SRI and IBM all show promise • Semantic structured LM improves ASR significantly for limited domains • Reduces WER by 25% across many tasks (Air Travel, Medical) • A large amount of linguistic knowledge sources now available, but not used for ASR • Inside IBM • WWW text: Raw text: 50 million pages ~25 billion words, ~10% useful after cleanup • News text: 3-4 billion words, broadcast or newswires • Name entity annotated text: 2 million words tagged • Ontologies • Linguistic knowledge used in rule-based MT system • External • WordNet, FrameNet, Cyc ontologies • PennTreeBank, Brown corpus (syntactic & semantic annotated) • Online dictionaries and thesaurus • Google
Semantic Parser World Knowledge Dialogue State W , ..., W 1 N Named Entity Document Type Embedded Grammar Speaker (turn, gender, ID) Word Class Syntactic Parser Super Structured LM for LVCSR • Acoustic Confusability: LM should be optimized to distinguish between acoustic confusable • sets, rather than based on N-gram counts • Automatic LM adaptation at different levels: discourse, semantic structure, and phrase levels
“I feel shine today” “I veal fine today” “I feel fine toady” “I feel fine today” Combination Decoders • “ROVER” is used in all current systems • NIST tool that combines multiple system outputs through voting • Individual systems currently designed in an ad-hoc manner • Only 5 or so systems possible • An army (“Million”) of simple decoders • Each makes uncorrelated errors
Segmental analysis Broadband features Narrowband features Million Feature Paradigm: Acoustic information for ASR Trajectory features Discard transient noise; Global adaptation for stationary noise Onset features Information Sources Noise Sources • Feature definition is key challenge • Maximum entropy model used to compute word probabilities. • Information sources combined in unified theoretical framework. • Long-span segmental analysis inherently robust to both • stationary and transient noise
Implications of the data-driven learning paradigm • ASR systems give the best results when test data is similar to the training data • Performance degrades as the test data diverges from the training data • Differences can occur both at the acoustic and linguistic levels, e.g. • A system designed to transcribe standard telephone audio (8kHz) cannot transcribe compressed telephony archives (6kHz) • A system designed for a given domain (e.g. broadcast news) will perform worse on a different domain (e.g. dictation) • Hence the training and test sets have to be carefully chosen if the task at hand expects a variety of acoustic sources
Generalization Dilemma Correct complex model (simple model on the right manifold) Want to get here: Model combination: Can we at least get best of both worlds? Performance Simple model The Gutter of Data Addiction Complex model: brute force learning Out-of-Domain In-Domain Test Conditions
Summary • Continue the current tried-and-true technical approach • Continue the yearly milestones and evaluations • Continue the focus on accuracy, robustness, & efficiency • Increase the focus on quantum leap innovation • Increase the focus on language modeling • Plan for 2 orders of magnitude increase in • Access to annotated speech and text data • Computing resources • Improve cross-fertilization among different projects