Speech recognition, understanding and conversational interfaces - PowerPoint PPT Presentation

Speech recognition understanding and conversational interfaces l.jpg
Download
1 / 33

Speech recognition, understanding and conversational interfaces. Alexander Rudnicky School of Computer Science http://www.cs.cmu.edu/~air. Outline. Speech Types of speech interfaces Speech systems and their structure Designing speech interfaces Some applications SpeechWear Communicator.

Related searches for Speech recognition, understanding and conversational interfaces

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Speech recognition, understanding and conversational interfaces

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Speech recognition understanding and conversational interfaces l.jpg

Speech recognition, understanding and conversational interfaces

Alexander Rudnicky

School of Computer Science

http://www.cs.cmu.edu/~air


Outline l.jpg

Outline

  • Speech

  • Types of speech interfaces

  • Speech systems and their structure

  • Designing speech interfaces

  • Some applications

    • SpeechWear

    • Communicator


Speech as a signal l.jpg

Speech as a signal

  • The difference between speech and sound

    • “CD” quality vs. intelligible quality

      • high-quality is 44.1 / 48 kHz

      • desirable speech bandwidth: 0-8kHz, 16bits

        • at 16bits/sample: 256kbps (tethered mic)

        • telephone: 64kbps (and lower)

    • Compression:

      • MPEG: 64kbps/channel and up (but not speech-optimal)

      • CELP: 16kbps … 2.4kbps (optimized for speech)


Speech for communication l.jpg

Speech for communication

  • The difference between speech and language

  • Speech recognition and speech understanding


Computers and speech l.jpg

Computers and speech

  • Transcription

    • dictation, information retrieval

  • Command and control

    • data entry, device control, navigation

  • Information access

    • airline schedules, stock quotes

  • Problem solving

    • travel planning, logistics


Speech system architecture l.jpg

Speech system architecture

  • SIGNAL PROCESSING

  • DECODING

  • UNDERSTANDING

  • DISCOURSE

  • ACTION


Varieties of speech systems l.jpg

Varieties of speech systems


A generic speech system l.jpg

Signal

processing

Parser

Dialog

manager

Language

Generator

Decoder

Post

parser

Speech

synthesizer

Domain

agent

Domain

agent

Domain

agent

speech

display

effector

A generic speech system

speech


Decoding speech l.jpg

  • Reduce dimensionality of signal

  • noise conditioning

Signal

processing

  • Transcribe speech to words

Decoder

Decoding speech

Acoustic

models

Language

models

Corpus-base statistical models


Creating models for recognition l.jpg

Creating models for recognition

Speech

data

Acoustic

models

Transcribe*

Train

Text

data

Language

models

Train


Understanding speech l.jpg

Understanding speech

Grammar

Ontology design, language acquisition

Parser

  • Extract semantic content from utterance

Post

parser

  • Introduce context and world knowledge into interpretation

Context

Domain

Agents

Grounding, knowledge engineering


Interacting with the user l.jpg

Interacting with the user

Task

schemas

Task analysis

Context

Dialog

manager

  • Guide interaction through task

  • Map user inputs and system state into actions

Domain

agent

  • Interact with back-end(s)

  • Interpret information using domain knowledge

Domain

agent

Domain

agent

Database

Live data

(e.g. Web)

Domain

expert

Knowledge engineering


Communicating with the user l.jpg

Communicating with the user

Language

Generator

  • Decide what to say to user (and how to phrase it)

Speech

synthesizer

Display

Generator

Action

Generator


Speech recognition and understanding l.jpg

Speech recognition and understanding

  • Sphinx system

    • speaker-independent

    • continuous speech

    • large vocabulary

  • ATIS system

    • air travel information retrieval

    • context management

  • film clip


Command and control systems l.jpg

Command and control systems

  • Small vocabularies, fixed syntax

    • OPEN WINDOW <window_id>

    • MOVE OBJECT <object_id> to <position>

    • Applications:

      • data entry (e.g., zip codes), process control (e.g., electron microscope, darkroom equipment)

  • Large vocabulary, fixed syntax

    • Web browsing (?)


Speechwear l.jpg

SpeechWear

  • Vehicle inspection task

    • USMC mechanics, fixed inspection form

    • Wearable computer (COTS components)

    • html-based task representation

  • film clip


Information access l.jpg

Information access

  • Moderate to very large vocabulary

    • IVR and frame based systems

  • Commercial systems:

    • Nuance: http://www.nuance.com/demo/index.html

    • SpeechWorks: http://www.speechworks.com/demos/demos.htm

    • lots of others..


Ivr and frame based systems l.jpg

IVR and frame-based systems

  • Interactive voice response (IVR)

    • interactions specified by a graph (typically a tree)

  • Frame systems

    • ergodic graphs

    • states defined by multi-item forms


Graph based systems l.jpg

Graph-based systems

Welcome to Bank ABC!

Please say one of the following:

Balance, Hours, Loan, ...

What type of loan are you interested in?

Please sayone of the following:

Mortgage, Car, Personal, ...

. . . .


Frame based systems l.jpg

Destination_City: Boston

Departure_Date: ______

Departure_Time: ______

Preferred_Airline: ______

.

.

.

Frame-based systems

  • I would like to fly to Boston

    • I’d like to go to Boston on Friday, …

  • When would you like to fly?


Frame based systems21 l.jpg

Frame-based systems

Zxfgdh_dxab: _____

askjs: _____

dhe: _____

aa_hgjs_aa: _____

.

.

Transition on

keyword or phrase

Zxfgdh_dxab: _____

askjs: _____

dhe: _____

aa_hgjs_aa: _____

.

.

Zxfgdh_dxab: _____

askjs: _____

dhe: _____

aa_hgjs_aa: _____

.

.

Zxfgdh_dxab: _____

askjs: _____

dhe: _____

aa_hgjs_aa: _____

.

.

Zxfgdh_dxab: _____

askjs: _____

dhe: _____

aa_hgjs_aa: _____

.

.


Some problems l.jpg

Some problems

  • IVR systems work great, but only for well-structured (& “shallow”) tasks

  • Frame systems are good for “tasks” that correspond to a single form leading to an action

  • Neither approach does well with more complex problem-solving activities


Dialog systems l.jpg

Dialog Systems

  • Problem solving activity; complex task

    • Order of progression through task depends on user goals (which can change) and system state (a back-end retrieval) and is not predictable.

  • Track progress and help task along

    • mixed-initiative dialog

  • Discourse phenomena

    • User expect to “converse” with the system


Carnegie mellon communicator l.jpg

Carnegie Mellon Communicator

  • A dialog system that supports complex problem solving in a travel planning domain

    • create an itinerary using air schedule, hotel and car information

    • 186 U.S. airports (>140k enplanements/yr)

      • currently: >500 world airports

  • Web-based data resources

    • Live and cached flight information

    • Airport, airline, etc. information


Value schema handlers l.jpg

Value schema/handlers

transform

receptors

value

Domain

Agent


Compound schema l.jpg

Value_1

Value_2

Value_3

Compound schema

transform

value

+

e.g. SQL query

Domain

Agent


Schema ordering l.jpg

Destination airport

Date

Time

Flight Leg

Database lookup

Available flights

Schema ordering

Schema i

Value i

Schema j

Value j

Schema k

Value k

transform

Value


Carnegie mellon communicator28 l.jpg

Carnegie Mellon Communicator

  • CMU Communicator

    • Call: 268-5144

    • the information is accurate; you can use it for your own travel planning...


User aware speech interfaces l.jpg

User-aware speech interfaces

  • Predictable behavior on the system’s part

  • Users coomunicate at different levels

  • http://www.speech.cs.cmu.edu/air/papers/InterfaceChars.html


User aware speech interfaces30 l.jpg

User-aware speech interfaces

  • Content: task-centric utterances

  • Possibility: What can I do?

  • Orientation: Where are we?

  • Navigation: moving through the task space

  • Control: verbose/terse, listen!

  • Customization: define this word


Speech interface guidelines l.jpg

Speech interface guidelines

  • Speech recognition is errorful

  • System state is often opaque to the user

  • http://www.speech.cs.cmu.edu/air/papers/SpInGuidelines/SpInGuidelines.html


Interface guidelines l.jpg

Interface guidelines

  • State transparency

  • Input control

  • Error recovery

  • Error detection

  • Error correction

  • Log performance

  • Application integration


Summary l.jpg

Summary

  • Speech and language communication

  • Dialog structure

  • Interface design


  • Login