Multi modal dialogue in personal navigation systems
1 / 23

Multi-Modal Dialogue in Personal Navigation Systems - PowerPoint PPT Presentation

  • Uploaded on

Multi-Modal Dialogue in Personal Navigation Systems. Arthur Chan. Introduction . The term “multi-modal” General description of an application that could be operated in multiple input/output modes. E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Multi-Modal Dialogue in Personal Navigation Systems' - ivanbritt

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Introduction l.jpg

  • The term “multi-modal”

    • General description of an application that could be operated in multiple input/output modes.

    • E.g

      • Input: voice, pen, gesture, face expression.

      • Output: voice, graphical output

Multi modal dialogue mmd in personal navigation system l.jpg
Multi-modal Dialogue (MMD) in Personal Navigation System

  • Motivation of this presentation

    • Navigation System provides MMD

      • an interesting scenario

      • a case why MMD is useful

  • Structure of this presentation

    • 3 system papers

      • AT&T MATCH

        • speech and pen input with pen gesture

      • Speechworks Walking Direction System

        • speech and stylus input

      • Univ. of Saarland REAL

        • Speech and pen input

        • Both GPS and a magnetic tracker were used.

Overall function l.jpg
Overall Function Access

  • A working city guide and navigation system

    • Easy access restaurant and subway information

  • Runs on a Fujitsu pen computer

  • Users are free to

    • give speech command

    • draw on display with stylus

Types of inputs l.jpg
Types of Inputs Access

  • Speech Input

    • “show cheap italian restaurants in chelsea”

  • Simultaneous Speech and Pen Input

    • Circle and area

    • Say “show cheap italian restaurants in neighborhood” at the same time.

  • Functionalities include

    • Review

    • Subway routine

Input overview l.jpg
Input Overview Access

  • Speech Input

    • Use AT&T Watson speech recognition engine

  • Pen Input (electron Ink)

    • Allow usage of pen gesture.

    • It could be a complex, pen input

      • Use special aggregation techniques for all this gesture.

  • Inputs would be combined using lattice combination.

Pen gesture and speech input l.jpg
Pen Gesture and Speech Input Access

  • For example:

    • U: “How do I get to this place?”

      • <user circled one of the restaurant displayed on the map>

    • S: “Where do you want to go from?”

    • U “25th St & 3rd Avenue”

      • <user writes 25th St & 3rd Avenue>

  • <System compute the shortest route >

Summary l.jpg
Summary Access

  • Interesting aspects of the system

    • Illustrate the real life scenario where multi-modal inputs could be used

    • Design issue:

      • how different inputs should be used together?

    • Algorithmic issue:

      • how different inputs should be combined together?

Overview l.jpg
Overview Access

  • Work by Speechworks

    • Jointly conducted by speech recognition and user interface folks

    • Two distinct elements

      • Speech recognition

        • In a embedded domain, which speech recognition paradigm should be used?

          • embedded speech recognition?

          • network speech recognition?

          • distributed speech recognition?

      • User interface

        • How to “situationlize” the application?

Overall function12 l.jpg
Overall Function Access

  • Walking Directions Application

    • Assume user walking in an unknown city

    • Compaq iPAQ 3765 PocketPC

    • Users could

      • Select a city, start-end addresses

      • Display a map

      • Control the display

      • Display directions

      • Display interactive directions in the form of list of steps.

    • Accept speech input and stylus input

      • Not pen gesture.

Choice of speech recognition paradigm l.jpg
Choice of speech recognition paradigm Access

  • Embedded speech recognition

    • Only simple commands could be used due to computation limits.

  • Network speech recognition

    • Bandwidth is required

    • Sometimes network would be cut-off

  • Distributed speech recognition

    • Client takes care of front-end

    • Server takes care of decoding

    • <Issues: higher complexity of the code. >

User interface l.jpg
User Interface Access

  • Situationalization

    • Potential scenario

      • Sitting at a desk

      • Getting out of a cab, building, subway and preparing to walk somewhere

      • Walking somewhere with hands free

      • Walking somewhere carrying things

      • Driving somewhere in heavy traffic

      • Driving somewhere in light traffic

      • Being the passenger in a car

      • Being in highly noisy environment.

Their conclusion l.jpg
Their conclusion Access

  • Balances of audio and visual information

    • Could be reduced to 4 complementary components

      • Single-modal

        • 1, Visual Mode

        • 2, Audio Mode

      • Multi-modal

        • 3, Visual dominant

        • 4, Visual dominant

Summary17 l.jpg
Summary Access

  • Interesting aspects

    • Great discussion on

      • how speech recognition could be used in an embedded domain

      • how the user would use the dialogue application

Overview19 l.jpg
Overview Access

  • Pedestrian Navigation System

    • Two components:

      • IRREAL : indoor navigation system

        • Use magnetic tracker

      • ARREAL: outdoor navigation system

        • Use GPS

Speech input output l.jpg
Speech Input/Output Access

  • Speech Input:

    • HTK / IBM Viavoice embedded and Logox was being evaluated

  • Speech Output:

    • Festival

Visual output l.jpg
Visual output Access

  • Both 2D and 3D spatialization supported

Interesting aspects l.jpg
Interesting aspects Access

  • Tailor the system for elderly people

    • Speaker clustering

      • to improve recognition rate for elderly people

    • Model selection

      • Choose from two models based on likelihood

        • Elderly models

        • Normal adult models

Conclusion l.jpg
Conclusion Access

  • Aspects of multi-modal dialogue

    • What kind of inputs should be used?

    • How speech and other inputs could be combined/interacted?

    • How users would use the system?

    • How the system should respond to the users?