Speech recognition, understanding and conversational interfaces

Speech recognition, understanding and conversational interfaces Alexander Rudnicky School of Computer Science http://www.cs.cmu.edu/~air

Outline • Speech • Types of speech interfaces • Speech systems and their structure • Designing speech interfaces • Some applications • SpeechWear • Communicator

Speech as a signal • The difference between speech and sound • “CD” quality vs. intelligible quality • high-quality is 44.1 / 48 kHz • desirable speech bandwidth: 0-8kHz, 16bits • at 16bits/sample: 256kbps (tethered mic) • telephone: 64kbps (and lower) • Compression: • MPEG: 64kbps/channel and up (but not speech-optimal) • CELP: 16kbps … 2.4kbps (optimized for speech)

Speech for communication • The difference between speech and language • Speech recognition and speech understanding

Computers and speech • Transcription • dictation, information retrieval • Command and control • data entry, device control, navigation • Information access • airline schedules, stock quotes • Problem solving • travel planning, logistics

Speech system architecture • SIGNAL PROCESSING • DECODING • UNDERSTANDING • DISCOURSE • ACTION

Varieties of speech systems

Signal processing Parser Dialog manager Language Generator Decoder Post parser Speech synthesizer Domain agent Domain agent Domain agent speech display effector A generic speech system speech

Reduce dimensionality of signal • noise conditioning Signal processing • Transcribe speech to words Decoder Decoding speech Acoustic models Language models Corpus-base statistical models

Creating models for recognition Speech data Acoustic models Transcribe* Train Text data Language models Train

Understanding speech Grammar Ontology design, language acquisition Parser • Extract semantic content from utterance Post parser • Introduce context and world knowledge into interpretation Context Domain Agents Grounding, knowledge engineering

Interacting with the user Task schemas Task analysis Context Dialog manager • Guide interaction through task • Map user inputs and system state into actions Domain agent • Interact with back-end(s) • Interpret information using domain knowledge Domain agent Domain agent Database Live data (e.g. Web) Domain expert Knowledge engineering

Communicating with the user Language Generator • Decide what to say to user (and how to phrase it) Speech synthesizer Display Generator Action Generator

Speech recognition and understanding • Sphinx system • speaker-independent • continuous speech • large vocabulary • ATIS system • air travel information retrieval • context management • film clip

Command and control systems • Small vocabularies, fixed syntax • OPEN WINDOW <window_id> • MOVE OBJECT <object_id> to <position> • Applications: • data entry (e.g., zip codes), process control (e.g., electron microscope, darkroom equipment) • Large vocabulary, fixed syntax • Web browsing (?)

SpeechWear • Vehicle inspection task • USMC mechanics, fixed inspection form • Wearable computer (COTS components) • html-based task representation • film clip

Information access • Moderate to very large vocabulary • IVR and frame based systems • Commercial systems: • Nuance: http://www.nuance.com/demo/index.html • SpeechWorks: http://www.speechworks.com/demos/demos.htm • lots of others..

IVR and frame-based systems • Interactive voice response (IVR) • interactions specified by a graph (typically a tree) • Frame systems • ergodic graphs • states defined by multi-item forms

Graph-based systems Welcome to Bank ABC! Please say one of the following: Balance, Hours, Loan, ... What type of loan are you interested in? Please sayone of the following: Mortgage, Car, Personal, ... . . . .

Destination_City: Boston Departure_Date: ______ Departure_Time: ______ Preferred_Airline: ______ . . . Frame-based systems • I would like to fly to Boston • I’d like to go to Boston on Friday, … • When would you like to fly?

Frame-based systems Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Transition on keyword or phrase Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .

Some problems • IVR systems work great, but only for well-structured (& “shallow”) tasks • Frame systems are good for “tasks” that correspond to a single form leading to an action • Neither approach does well with more complex problem-solving activities

Dialog Systems • Problem solving activity; complex task • Order of progression through task depends on user goals (which can change) and system state (a back-end retrieval) and is not predictable. • Track progress and help task along • mixed-initiative dialog • Discourse phenomena • User expect to “converse” with the system

Carnegie Mellon Communicator • A dialog system that supports complex problem solving in a travel planning domain • create an itinerary using air schedule, hotel and car information • 186 U.S. airports (>140k enplanements/yr) • currently: >500 world airports • Web-based data resources • Live and cached flight information • Airport, airline, etc. information

Value schema/handlers transform receptors value Domain Agent

Value_1 Value_2 Value_3 Compound schema transform value + e.g. SQL query Domain Agent

Destination airport Date Time Flight Leg Database lookup Available flights Schema ordering Schema i Value i Schema j Value j Schema k Value k transform Value

Carnegie Mellon Communicator • CMU Communicator • Call: 268-5144 • the information is accurate; you can use it for your own travel planning...

User-aware speech interfaces • Predictable behavior on the system’s part • Users coomunicate at different levels • http://www.speech.cs.cmu.edu/air/papers/InterfaceChars.html

User-aware speech interfaces • Content: task-centric utterances • Possibility: What can I do? • Orientation: Where are we? • Navigation: moving through the task space • Control: verbose/terse, listen! • Customization: define this word

Speech interface guidelines • Speech recognition is errorful • System state is often opaque to the user • http://www.speech.cs.cmu.edu/air/papers/SpInGuidelines/SpInGuidelines.html

Interface guidelines • State transparency • Input control • Error recovery • Error detection • Error correction • Log performance • Application integration

Summary • Speech and language communication • Dialog structure • Interface design

Speech recognition, understanding and conversational interfaces

Speech recognition, understanding and conversational interfaces

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition and Understanding

Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Speech Recognition

Speech recognition

Research Developments and Directions in Speech Recognition and Understanding

Speech Recognition

Speech Recognition

Speech Recognition

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition

Speech Recognition

From Speech Recognition Towards Speech Understanding

Speech Recognition

SPEECH RECOGNITION:

Automatic Speech Recognition, Text-to-Speech, and Natural Language Understanding Technologies

Speech Recognition

Learning Long-Term Temporal Features for Conversational Speech Recognition

Speech Recognition

Speech Recognition

Automatic Speech Recognition, Text-to-Speech, and Natural Language Understanding Technologies

Speech Recognition