1 / 25

This chart represents all revenue for speech related ecosystem activity.

Superhuman Speech Recognition: Technology Challenges & Market Adoption David Nahamoo IBM Fellow Speech CTO, IBM Research July 2, 2008. Overall Speech Market Opportunity. WW Voice-Driven Conversation Access Technology Forecast.

odelia
Download Presentation

This chart represents all revenue for speech related ecosystem activity.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superhuman Speech Recognition:Technology Challenges & Market AdoptionDavid NahamooIBM FellowSpeech CTO, IBM ResearchJuly 2, 2008

  2. Overall Speech Market Opportunity WW Voice-Driven Conversation Access Technology Forecast • This chart represents all revenue for speech related ecosystem activity. • Revenue exceeded $1B for the 1st time in 2006 • Note also that hosted services will represent ½ of speech related revenue in 2011 *Opus Research 02_2007

  3. Speech Market Segments • Improved accuracy • Much larger vocabulary speech recognition system

  4. New Opportunity Areas • Contact Centers Analytics • Quality Assurance, Real Time Alerts, Compliance • Media Transcription • Closed captioning • Accessibility • Government, Lectures • Content Analytics • Audio-indexing, cross-lingual information retrieval, multi-media mining • Dictation • Medical, Legal, Insurance, Education • Unified Communication • Voicemail, Conference calls, email and SMS on hand held

  5. Target zone Human Baseline for conversations

  6. Performance Results (2004 DARPA EARS Evaluation) (Last public evaluation of English Telephony Transcription) IBM: Best Speed-Accuracy Tradeoff

  7. MALACH Multilingual Access to Large Spoken ArCHives • Funded by NSF, 5-year project (Started in Oct. 2001) • Project Participants • IBM, Visual History Foundation, Johns Hopkins University, University of Maryland, Charles University and University of West Bohemia • Objective • Improve access to large multilingual collections of spontaneous speech by advancing the state-of-the-art in technologies that work together to achieve this objective: Automatic Speech Recognition, Computer-Assisted Translation , Natural Language Processing and Information Retrieval

  8. Frequent interruptions: • CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN • Disfluencies • A- a- a- a- band with on- our- on- our- arm Emotional speech • young man they ripped his teeth and beard out they beat him MALACH: A challenging speech corpus Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages. Goal: improved access to large multilingual spoken archives Challenges:

  9. Effects of Customization (MALACH Data) • State-of-the-art ASR system • trained on SWB data (8KHz) • MALACH Training data seen • by AM and LM • fMPE, MPE, Consensus decoding

  10. Improvement in Word Error Rate for IBM embedded ViaVoice

  11. Progress in Word Error Rate – IBM WebSphere Voice Server 45%relative improvement in WER the last2.5years 20%relative improvement in speed in the last1.5years

  12. Lay white at X 8 soon Bin Green with F 7 now Multi-Talker Speech Separation Task male and female speaker at 0dB

  13. Two Talker Speech Separation Challenge Results Examples: Mixture Recognition Error

  14. IBM’s Superhuman Speech Recognition • Universal Recognizer • Any accent • Any topic • Any noise conditions • Broadcast, phone, in car, or live • Multiple languages • Conversational

  15. stem could that on it cuts down and I comes stay I’m they cut them Human Experiments • Question: • Can post-processing of recognizer hypotheses by humans improve accuracy? • What is the relative contribution of linguistic vs. acoustic information (in this post-processing operation?) • Experiment • Produce recognizer hypotheses in form of “sausages” • Allow human to correct output either with linguistic information alone or with short segments of acoustic information • Results • Human performance still far from maximum possible, given information in “sausages” • Recognizer hypothesized linguistic context information not useful by itself • Acoustic information in limited span (1 sec. average) marginally useful • What we learned • Hard-to-design • Expensive to conduct • Hard to decide if not valuable

  16. Acoustic Modeling Today • Approach: Hidden Markov Models • Observation densities (GMM) for P( feature | class ) • Mature mathematical framework, easy to combine with linguistic information • However, does not directly model what we want i.e., P( words | acoustics ) • Training: Use transcribed speech data • Maximum Likelihood • Various discriminative criteria • Handling Training/Test Mismatches: • Avoid mismatches by collecting “custom” data • Adaptation & adaptive training algorithms • Significantly worse than humans for tasks withlittle or no linguistic information - e.g., digits/letters recognition • Human performance extremely robust to acoustic variations • due to speaker, speaking style, microphone, channel, noise, accent, & dialect variations • Steady progress over the yearsContinued progress using current methodology very likely in the future

  17. Towards a Non-Parametric Approach to Acoustics • General Idea: Back to pattern recognition basics! • Break test utterance into sequence of larger segments (phone, syllable, word, phrase) • Match segments to closest ones in training corpus using some metric (possibly using long distance models) • Helps to get it right if you’ve heard it before • Why prefer this approach over HMMs? • HMMs compress training by x1000; too many modeling assumptions • 1000hrs ~ 30Gb; State-of-the-art acoustic models ~ 30Mb • Relaxing assumptions have been key to allrecent improvements in acoustic modeling • How can we accomplish this? • Store & index training data for rapid access of training segments close to test segments • Develop a metric D( train_seq, test_seq): obvious candidate is DTW with appropriate metric and warping rules • Back to the Future? • Reminiscent of DTW & Segmental models from late 80’s – ME was missing • Limited by computational resources (storage/cpu/data) then & so HMMs won • Implications: • Need 100x more data for handling larger units (hence 100x more computing resources) • Better performance with more data – likely to have “heard it before”

  18. Utilizing Linguistic Information in ASR • Today’s standard ASR does not explicitly use linguistic information • But recent work at JHU, SRI and IBM all show promise • Semantic structured LM improves ASR significantly for limited domains • Reduces WER by 25% across many tasks (Air Travel, Medical) • A large amount of linguistic knowledge sources now available, but not used for ASR • Inside IBM • WWW text: Raw text: 50 million pages ~25 billion words, ~10% useful after cleanup • News text: 3-4 billion words, broadcast or newswires • Name entity annotated text: 2 million words tagged • Ontologies • Linguistic knowledge used in rule-based MT system • External • WordNet, FrameNet, Cyc ontologies • PennTreeBank, Brown corpus (syntactic & semantic annotated) • Online dictionaries and thesaurus • Google

  19. Semantic Parser World Knowledge Dialogue State W , ..., W 1 N Named Entity Document Type Embedded Grammar Speaker (turn, gender, ID) Word Class Syntactic Parser Super Structured LM for LVCSR • Acoustic Confusability: LM should be optimized to distinguish between acoustic confusable • sets, rather than based on N-gram counts • Automatic LM adaptation at different levels: discourse, semantic structure, and phrase levels

  20. “I feel shine today” “I veal fine today” “I feel fine toady” “I feel fine today” Combination Decoders • “ROVER” is used in all current systems • NIST tool that combines multiple system outputs through voting • Individual systems currently designed in an ad-hoc manner • Only 5 or so systems possible • An army (“Million”) of simple decoders • Each makes uncorrelated errors

  21. Segmental analysis Broadband features Narrowband features Million Feature Paradigm: Acoustic information for ASR Trajectory features Discard transient noise; Global adaptation for stationary noise Onset features Information Sources Noise Sources • Feature definition is key challenge • Maximum entropy model used to compute word probabilities. • Information sources combined in unified theoretical framework. • Long-span segmental analysis inherently robust to both • stationary and transient noise

  22. Implications of the data-driven learning paradigm • ASR systems give the best results when test data is similar to the training data • Performance degrades as the test data diverges from the training data • Differences can occur both at the acoustic and linguistic levels, e.g. • A system designed to transcribe standard telephone audio (8kHz) cannot transcribe compressed telephony archives (6kHz) • A system designed for a given domain (e.g. broadcast news) will perform worse on a different domain (e.g. dictation) • Hence the training and test sets have to be carefully chosen if the task at hand expects a variety of acoustic sources

  23. Generalization Dilemma Correct complex model (simple model on the right manifold) Want to get here: Model combination: Can we at least get best of both worlds? Performance Simple model The Gutter of Data Addiction Complex model: brute force learning Out-of-Domain In-Domain Test Conditions

  24. Summary • Continue the current tried-and-true technical approach • Continue the yearly milestones and evaluations • Continue the focus on accuracy, robustness, & efficiency • Increase the focus on quantum leap innovation • Increase the focus on language modeling • Plan for 2 orders of magnitude increase in • Access to annotated speech and text data • Computing resources • Improve cross-fertilization among different projects

More Related