multi modal dialogue in personal navigation systems l.
Skip this Video
Download Presentation
Multi-Modal Dialogue in Personal Navigation Systems

Loading in 2 Seconds...

play fullscreen
1 / 23

Multi-Modal Dialogue in Personal Navigation Systems - PowerPoint PPT Presentation

  • Uploaded on

Multi-Modal Dialogue in Personal Navigation Systems. Arthur Chan. Introduction . The term “multi-modal” General description of an application that could be operated in multiple input/output modes. E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Multi-Modal Dialogue in Personal Navigation Systems' - ivanbritt

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • The term “multi-modal”
    • General description of an application that could be operated in multiple input/output modes.
    • E.g
      • Input: voice, pen, gesture, face expression.
      • Output: voice, graphical output
multi modal dialogue mmd in personal navigation system
Multi-modal Dialogue (MMD) in Personal Navigation System
  • Motivation of this presentation
    • Navigation System provides MMD
      • an interesting scenario
      • a case why MMD is useful
  • Structure of this presentation
    • 3 system papers
      • AT&T MATCH
        • speech and pen input with pen gesture
      • Speechworks Walking Direction System
        • speech and stylus input
      • Univ. of Saarland REAL
        • Speech and pen input
        • Both GPS and a magnetic tracker were used.
overall function
Overall Function
  • A working city guide and navigation system
    • Easy access restaurant and subway information
  • Runs on a Fujitsu pen computer
  • Users are free to
    • give speech command
    • draw on display with stylus
types of inputs
Types of Inputs
  • Speech Input
    • “show cheap italian restaurants in chelsea”
  • Simultaneous Speech and Pen Input
    • Circle and area
    • Say “show cheap italian restaurants in neighborhood” at the same time.
  • Functionalities include
    • Review
    • Subway routine
input overview
Input Overview
  • Speech Input
    • Use AT&T Watson speech recognition engine
  • Pen Input (electron Ink)
    • Allow usage of pen gesture.
    • It could be a complex, pen input
      • Use special aggregation techniques for all this gesture.
  • Inputs would be combined using lattice combination.
pen gesture and speech input
Pen Gesture and Speech Input
  • For example:
    • U: “How do I get to this place?”
      • <user circled one of the restaurant displayed on the map>
    • S: “Where do you want to go from?”
    • U “25th St & 3rd Avenue”
        • <user writes 25th St & 3rd Avenue>
    • <System compute the shortest route >
  • Interesting aspects of the system
    • Illustrate the real life scenario where multi-modal inputs could be used
    • Design issue:
      • how different inputs should be used together?
    • Algorithmic issue:
      • how different inputs should be combined together?
  • Work by Speechworks
    • Jointly conducted by speech recognition and user interface folks
    • Two distinct elements
      • Speech recognition
        • In a embedded domain, which speech recognition paradigm should be used?
          • embedded speech recognition?
          • network speech recognition?
          • distributed speech recognition?
      • User interface
        • How to “situationlize” the application?
overall function12
Overall Function
  • Walking Directions Application
    • Assume user walking in an unknown city
    • Compaq iPAQ 3765 PocketPC
    • Users could
      • Select a city, start-end addresses
      • Display a map
      • Control the display
      • Display directions
      • Display interactive directions in the form of list of steps.
    • Accept speech input and stylus input
      • Not pen gesture.
choice of speech recognition paradigm
Choice of speech recognition paradigm
  • Embedded speech recognition
    • Only simple commands could be used due to computation limits.
  • Network speech recognition
    • Bandwidth is required
    • Sometimes network would be cut-off
  • Distributed speech recognition
    • Client takes care of front-end
    • Server takes care of decoding
    • <Issues: higher complexity of the code. >
user interface
User Interface
  • Situationalization
    • Potential scenario
      • Sitting at a desk
      • Getting out of a cab, building, subway and preparing to walk somewhere
      • Walking somewhere with hands free
      • Walking somewhere carrying things
      • Driving somewhere in heavy traffic
      • Driving somewhere in light traffic
      • Being the passenger in a car
      • Being in highly noisy environment.
their conclusion
Their conclusion
  • Balances of audio and visual information
    • Could be reduced to 4 complementary components
      • Single-modal
        • 1, Visual Mode
        • 2, Audio Mode
      • Multi-modal
        • 3, Visual dominant
        • 4, Visual dominant
  • Interesting aspects
    • Great discussion on
      • how speech recognition could be used in an embedded domain
      • how the user would use the dialogue application
  • Pedestrian Navigation System
    • Two components:
      • IRREAL : indoor navigation system
        • Use magnetic tracker
      • ARREAL: outdoor navigation system
        • Use GPS
speech input output
Speech Input/Output
  • Speech Input:
    • HTK / IBM Viavoice embedded and Logox was being evaluated
  • Speech Output:
    • Festival
visual output
Visual output
  • Both 2D and 3D spatialization supported
interesting aspects
Interesting aspects
  • Tailor the system for elderly people
    • Speaker clustering
      • to improve recognition rate for elderly people
    • Model selection
      • Choose from two models based on likelihood
        • Elderly models
        • Normal adult models
  • Aspects of multi-modal dialogue
    • What kind of inputs should be used?
    • How speech and other inputs could be combined/interacted?
    • How users would use the system?
    • How the system should respond to the users?