CS 378 Natural Language Processing *** Speech Processing: Present, Past and Future

CS 378Natural Language Processing***Speech Processing: Present, Past and Future Inge M. R. De Bleecker Department of Linguistics inge@mail.utexas.edu October 14, 2003

Main Types of NLP Applications 2 • Text processing: information retrieval and search engines, information extraction, text summarization, machine translation, question-answering… • Speech processing: speech recognition (ASR) over-the-telephone (OTT) and dictation systems (desktop), speaker verification, text-to-speech (TTS)

Overview 3 • Speech industry: history (late 80’s to present) • Practical Overview of current applications and their future directions: • Speech recognition accuracy • Text-to-speech accuracy • Usability and design • Application building tools • Working in the speech industry

History: Late Eighties 4 Sentiment: OTT ASR finally ready for commercial applications… Technology: OTT speaker-independent discrete digits/yesno apps. Word-based language models. TTS mostly used for numbers, if used at all. Pre-recorded strings much more common. Applications: simple in structure and functionality OTT: banking, e.g. ask for account balance. Desktop: first dictation systems, medical applications. Companies: few. Small research-oriented companies, or research arms of big companies. E.g. Dragon Systems, VPC, VCS, Kurzweil, BBN, AT&T, …

History: Early Nineties 5 Sentiment: credibility and usability of apps grows. Multilingual developments. Technology: OTT SI continuous digits/yesno/command word apps. Move to phoneme-based language models. Applications: OTT: still simple, system-directed dialog (vs user-directed, mixed-initiative) Desktop: more dictation systems, command and control systems (user-directed) Companies: more companies pop up. Most grow out of research communities.

History: Mid to Late Nineties 6 Technology: maturing of technologies used. Companies: • overall growth • dirty politics (L & H) • mergers and buyouts start (still ongoing today)

History: Late Nineties to Present 7 Technology: maturing of technology continues • better recognition accuracy • unrestricted ASR input (natural speech) • move to more sophisticated dialog systems (see next slide) • tool standardization Applications: Wider use of apps. More attention to usability, dialog design, etc…

Dialog System Architecture 8 ASR Parser Reasoning TTS Output Generation

Speech Recognition Accuracy 9 Present: reasonable accuracy on natural speech. Most systems still use grammar to help recognizer. Grammars are written in VoiceXML or vendor-specific language, not very sophisticated from a linguistics point of view. Some systems are (theoretically) purely statistical. E.g. Nuance’s Accuroute. Future: need to add more linguistic principles to current statistic methods. Make signal processing more robust, encourage reusability.

TTS Accuracy 10 Present: getting better all the time. During the last few years, additional research in prosody, intonation has paid off. More naturally sounding speech. Also deals with abbreviations, etc. Current TTS can be used to patch up ‘real’ speech. E.g. AT&T, Scansoft (Speechworks). Future: probably never a complete substitute for pre-recorded strings.

Usability – Dialog Design 11 Present Dialog design (VUI) is becoming more sophisticated through • use of natural speech input • mixed-initiative dialogs (more complicated for novice users) • chatty applications which provide gracious ways of dealing with low accuracy confirmations and errors, fall-back to system-directed dialog,… • use of persona: e.g. Bell Canada’s Emily Future Continued improvements in dialog design are necessary (e.g. usability studies). Dialog design is easier with current (and future) tools, but… still an art! It is (too) easy to design bad speech applications…

Usability – Other Issues 12 Present • Natural language generation (NLG) is not receiving much attention • Reasoning components very limited Future • NLG needs to adapt to user, conform more to human speech patterns • multimodal applications • multilingual systems • use of e.g. ontologies in reasoning components, …

Application Building Tools 13 Present • Standardization: VoiceXML and VoiceXML platforms (alternative: SALT) • Many platform companies: VoiceGenie, Bevocal, Audium,… • Also companies developing tools for platforms: Aptera VoiceXML • World of VoiceXML: comprehensive site on all things VoiceXML • Free developer’s resources: e.g. Bevocal • Small companies: can have voicexml app hosted by a platform company • Big companies: in-house platforms (telco-industry grade equipment), quite costly Future Development of better tools, that make it harder to build bad applications!

Speech Apps State-of-the-Art 14 Conclusion: ASR and TTS are usable in real-world applications right now. To develop better applications, we need to improve accuracy, usability, etc or… think about some radically different approaches to the current problems! (=> the “age-old” argument)

Working in the Speech Industry 15 Working for: A speech recognition/text-to-speech company: a CS undergraduate can work on software development of tools, deployments. With addition of some linguistics classes: dialog designer, QA of deployments, … A VoiceXml platform company: general software development, … A tools company: general software development, … A consulting (services) company: dialog design, deployments. Or… Get a Ph.D. in EE and become a speech scientist who develops the next generation speech recognizer…

CS 378 Natural Language Processing *** Speech Processing: Present, Past and Future