Multi modal dialogue in personal navigation systems
1 / 23

Multi-Modal Dialogue in Personal Navigation Systems - PowerPoint PPT Presentation

Multi-Modal Dialogue in Personal Navigation Systems. Arthur Chan. Introduction. The term “multi-modal” General description of an application that could be operated in multiple input/output modes. E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Multi-Modal Dialogue in Personal Navigation Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Multi-Modal Dialogue in Personal Navigation Systems

Arthur Chan


  • The term “multi-modal”

    • General description of an application that could be operated in multiple input/output modes.

    • E.g

      • Input: voice, pen, gesture, face expression.

      • Output: voice, graphical output

Multi-modal Dialogue (MMD) in Personal Navigation System

  • Motivation of this presentation

    • Navigation System provides MMD

      • an interesting scenario

      • a case why MMD is useful

  • Structure of this presentation

    • 3 system papers

      • AT&T MATCH

        • speech and pen input with pen gesture

      • Speechworks Walking Direction System

        • speech and stylus input

      • Univ. of Saarland REAL

        • Speech and pen input

        • Both GPS and a magnetic tracker were used.

Multi-modal Language Processing for Mobile Information Access

Overall Function

  • A working city guide and navigation system

    • Easy access restaurant and subway information

  • Runs on a Fujitsu pen computer

  • Users are free to

    • give speech command

    • draw on display with stylus

Types of Inputs

  • Speech Input

    • “show cheap italian restaurants in chelsea”

  • Simultaneous Speech and Pen Input

    • Circle and area

    • Say “show cheap italian restaurants in neighborhood” at the same time.

  • Functionalities include

    • Review

    • Subway routine

Input Overview

  • Speech Input

    • Use AT&T Watson speech recognition engine

  • Pen Input (electron Ink)

    • Allow usage of pen gesture.

    • It could be a complex, pen input

      • Use special aggregation techniques for all this gesture.

  • Inputs would be combined using lattice combination.

Pen Gesture and Speech Input

  • For example:

    • U: “How do I get to this place?”

      • <user circled one of the restaurant displayed on the map>

    • S: “Where do you want to go from?”

    • U “25th St & 3rd Avenue”

      • <user writes 25th St & 3rd Avenue>

  • <System compute the shortest route >


  • Interesting aspects of the system

    • Illustrate the real life scenario where multi-modal inputs could be used

    • Design issue:

      • how different inputs should be used together?

    • Algorithmic issue:

      • how different inputs should be combined together?

Multi-modal Spoken Dialog with Wireless Devices


  • Work by Speechworks

    • Jointly conducted by speech recognition and user interface folks

    • Two distinct elements

      • Speech recognition

        • In a embedded domain, which speech recognition paradigm should be used?

          • embedded speech recognition?

          • network speech recognition?

          • distributed speech recognition?

      • User interface

        • How to “situationlize” the application?

Overall Function

  • Walking Directions Application

    • Assume user walking in an unknown city

    • Compaq iPAQ 3765 PocketPC

    • Users could

      • Select a city, start-end addresses

      • Display a map

      • Control the display

      • Display directions

      • Display interactive directions in the form of list of steps.

    • Accept speech input and stylus input

      • Not pen gesture.

Choice of speech recognition paradigm

  • Embedded speech recognition

    • Only simple commands could be used due to computation limits.

  • Network speech recognition

    • Bandwidth is required

    • Sometimes network would be cut-off

  • Distributed speech recognition

    • Client takes care of front-end

    • Server takes care of decoding

    • <Issues: higher complexity of the code. >

User Interface

  • Situationalization

    • Potential scenario

      • Sitting at a desk

      • Getting out of a cab, building, subway and preparing to walk somewhere

      • Walking somewhere with hands free

      • Walking somewhere carrying things

      • Driving somewhere in heavy traffic

      • Driving somewhere in light traffic

      • Being the passenger in a car

      • Being in highly noisy environment.

Their conclusion

  • Balances of audio and visual information

    • Could be reduced to 4 complementary components

      • Single-modal

        • 1, Visual Mode

        • 2, Audio Mode

      • Multi-modal

        • 3, Visual dominant

        • 4, Visual dominant

A glance of UI


  • Interesting aspects

    • Great discussion on

      • how speech recognition could be used in an embedded domain

      • how the user would use the dialogue application

Multi-modal Dialog in a Mobile Pedestrian Navigation System


  • Pedestrian Navigation System

    • Two components:

      • IRREAL : indoor navigation system

        • Use magnetic tracker

      • ARREAL: outdoor navigation system

        • Use GPS

Speech Input/Output

  • Speech Input:

    • HTK / IBM Viavoice embedded and Logox was being evaluated

  • Speech Output:

    • Festival

Visual output

  • Both 2D and 3D spatialization supported

Interesting aspects

  • Tailor the system for elderly people

    • Speaker clustering

      • to improve recognition rate for elderly people

    • Model selection

      • Choose from two models based on likelihood

        • Elderly models

        • Normal adult models


  • Aspects of multi-modal dialogue

    • What kind of inputs should be used?

    • How speech and other inputs could be combined/interacted?

    • How users would use the system?

    • How the system should respond to the users?

  • Login