Multimodal dialog
This presentation is the property of its rightful owner.
Sponsored Links
1 / 66

Multimodal Dialog PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on
  • Presentation posted in: General

Multimodal Dialog. Multimodal Dialog System. Multimodal Dialog System. A system which supports human-computer interaction over multiple different input and/or output modes . Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc. Applications GPS

Download Presentation

Multimodal Dialog

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multimodal dialog

Multimodal Dialog

Intelligent Robot Lecture Note


Multimodal dialog

Multimodal Dialog System

Intelligent Robot Lecture Note


Multimodal dialog system

Multimodal Dialog System

A system which supports human-computer interaction over multiple different input and/or output modes.

Input: voice, pen, gesture, face expression, etc.

Output: voice, graphical output, etc.

Applications

GPS

Information guide system

Smart home control

Etc.

여기에서 여기로 가는 제일 빠른 길 좀 알려 줘.

voice

pen

Intelligent Robot Lecture Note


Motivations

Motivations

Speech: the Ultimate Interface?

+ Interaction style: natural (use free speech)

Natural repair process for error recovery

+ Richer channel – speaker’s disposition and emotional state (if system’s knew how to deal with that..)

- Input inconsistent (high error rates), hard to correct error

e.g., may get different result, each time we speak the same words.

- Slow (sequential) output style: using TTS (text-to-speech)

How to overcome these weak points?

 Multimodal interface!!

Intelligent Robot Lecture Note


Advantages of multimodal interface

Advantages of Multimodal Interface

Task performance and user preference

Migration of Human-Computer Interaction away from the desktop

Adaptation to the environment

Error recovery and handling

Special situations where mode choice helps

Intelligent Robot Lecture Note


Task performance and user preference

Task Performance and User Preference

Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997]

10% faster task completion,

23% fewer words, (Shorter and simpler linguistic constructions)

36% fewer task errors,

35% fewer spoken disfluencies,

90-100% user preference to interact this way.

  • Speech-only dialog system

  • Speech: Bring the drink on the table to the side of bed

  • Multimodal dialog System

  • Speech: Bring this to here

Pen gesture:

Easy, Simplified user utterance !

Intelligent Robot Lecture Note


Migration of human computer interaction away from the desktop

Migration of Human-Computer Interaction away from the desktop

Small portable computing devices

Such as PDAs, organizers, and smart-phones

Limited screen real estate for graphical output

Limited input no keyboard/mouse (arrow keys, thumbwheel)

 Complex GUIs not feasible

Augment limited GUI with natural modalities such as speech and pen

Use less space

Rapid navigation over menu hierarchy

Other devices

Kiosks, car navigation system…

No mouse or keyboard

 Speech + pen gesture

Intelligent Robot Lecture Note


Adaptation to the environment

Adaptation to the environment

Multimodal interfaces enable rapid adaptation to changes in the environment

Allow user to switch modes

Mobile devices that are used in multiple environments

Environmental conditions can be either physical or social

Physical

Noise: Increases in ambient noise can degrade speech performance  switch to GUI, stylus pen input

Brightness: Bright light in outdoor environment can limit usefulness of graphical display

Social

Speech many be easiest for password, account number etc, but in public places users may be uncomfortable being overheard  Switch to GUI or keypad input

Intelligent Robot Lecture Note


Error recovery and handling

Error Recovery and Handling

Advantages for recovery and reduction of error:

Users intuitively pick the mode that is less error-prone.

Language is often simplified.

Users intuitively switch modes after an error

The same problem is not repeated.

Multimodal error correction

Cross-mode compensation - complementarity

Combining inputs from multiple modalities can reduce the overall error rate

Multimodal interface has potentially

Intelligent Robot Lecture Note


Special situations where mode choice helps

Special Situations Where Mode Choice Helps

Users with disability

People with a strong accent or a cold

People with RSI

Young children or non-literate users

Other users who have problems when handle the standard devices: mouse and keyboard

Multimodal interface let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities.

Intelligent Robot Lecture Note


Multimodal dialog system architecture

Multimodal Dialog System Architecture

Architecture of QuickSet []

Multi-agent architecture

Speech/

TTS

Natural

Language

Multimodal

Integration

Map

Interface

VR/AR Interfaces

MAVEN

BARS

Inter-agent Communication Language

Facilitator

routing, triggering, dispatching,

Simulators

Sketch/ Gesture

Java-enabled

Web pages

COM

objects

ICL — Horn Clauses

Other user

interfaces

Other

Facilitators

CORBA

bridge

WebSvcs

(XML, SOAP, …)

Databases

Intelligent Robot Lecture Note


Multimodal dialog

Multimodal Language Processing

Intelligent Robot Lecture Note

12


Multimodal reference resolution

Multimodal Reference Resolution

Multimodal Reference Resolution

Need to resolve references (what the user is referring to) across modalities.

A user may refer to an item in a display by using speech, by pointing, or both

Closely related with Multimodal Integration

여기에서 여기로 가는 제일 빠른 길 좀 알려 줘.

voice

pen

Intelligent Robot Lecture Note


Multimodal reference resolution1

Multimodal Reference Resolution

Multimodal Reference Resolution

Finds the most proper referents to referring expressions. [Chai et al., 2004]

Referring expression

Refer to a specific entity or entities

Given by a user’s inputs (most likely in speech inputs)

Referent

An entity which the user refers

Referent can be an object that is not specified by current utterance.

여기에서 여기로 가는 가장 빠른 길 좀 알려줘

Speech

여기

여기

Gesture

g2

g1

롯데 백화점

Object

버거킹

Intelligent Robot Lecture Note


Multimodal reference resolution2

Multimodal Reference Resolution

Multimodal Reference Resolution

Hard case

Multiple and complex gesture inputs.

E.g.) in information guide system

이거랑 이것들이랑 가격 좀 비교 해 줄래

User: 이건 가격이 얼마지?

(물건 하나를 선택한다)

System: 만 오천원 입니다.

User: 이거랑 이것들이랑 가격 좀 비교 해 줄래

(물건 세 개를 선택 한다)

Speech

이것들

이거

Gesture

g2

g3

g1

Time

Speech

이것들

이거

?

Gesture

g2

g3

g1

Time

Intelligent Robot Lecture Note


Multimodal reference resolution3

Multimodal Reference Resolution

Multimodal Reference Resolution

Using linguistic theories to guide the reference resolution process. [Chai et al., 2005]

Conversation Implicature

Givenness Hierarchy

Greedy algorithm for finding the best assignment for a referring expression given a cognitive status.

Calculate the match score between referring expressions and referent candidates.

Matching score

Finds the best assignments by using greedy algorithm

object selectivity

compatibility measurement

Likelihood of status

Intelligent Robot Lecture Note


Multimodal integration

Meaning

Meaning

Multimodal Integration / Fusion

Combined Meaning

Multimodal Integration

  • Combining information from multiple input modalities to understand user’s intention and attention

    • Multimodal reference resolution is a special case of multimodal integration

      • Speech + pen gesture.

      • The case where pen gestures can express meaning of deictic or grouping only.

Intelligent Robot Lecture Note


Multimodal integration1

Multimodal Integration

  • Issues:

    • Nature of multimodal integration mechanism

      • Algorithmic – procedural

      • Parser / Grammars – Declarative

    • Does approach treat one mode as primary?

      • Is gesture a secondary dependent mode?

        • Multimodal reference resolution

    • How temporal and spatial constraints are expressed

    • Common meaning representation for speech and gesture

  • Two main approaches

    • Unification-based multimodal parsing and understanding [Johnston, 1998]

    • Finite-state transducer for multimodal parsing and understanding [Johnston et al., 2000

Intelligent Robot Lecture Note


Unification based multimodal parsing and understanding

Unification-based multimodal parsing and understanding

  • Parallel recognizers and “understanders”

  • Time-stamped meaning fragments for each stream

  • Common framework for meaning representation – typed feature structures

  • Meaning fusion operations – unification

    • Unification is an operation that determines the consistency of two pieces of partial information,

    • And if they are consistent combines them into a single result

      • Whether a given gestural input is compatible with a given piece of spoken input.

      • And if they are, combine them into a single result

    • Semantic, and spatiotemporal constraints

  • Statistical ranking

  • Flexible asynchronous architecture

  • Must handle unimodal and multimodal input

Intelligent Robot Lecture Note


Unification based multimodal parsing and understanding1

Unification-based multimodal parsing and understanding

  • Temporal Constraints [Oviatt et al., 1997]

    • Speech and gesture overlap, or

    • Gesture precedes speech by <= 4 seconds

    • Speech does not precede gesture

      Given sequencespeech1; gesture; speech2

      Possible groupingspeech1; (gesture; speech2)

      Finding [Oviatt et al. 2004, 2005] -

      Users have a consistent temporal integration style  adapt

Intelligent Robot Lecture Note


Unification based multimodal parsing and understanding2

Unification-based multimodal parsing and understanding

  • Each unimodal inputs are represented as feature structure [Holzapfel et al., 2004]

    • Very common representation in Comp. Ling. – FUG, LFG, PATR

      • e.g., lexical entries, grammar rules, etc.

    • e.g., “please switch on the lamp”

  • And there are some predefined rules for resolving the deictic reference and integrating multimodal inputs

Attr1: val1

Attr2: val2

Attr3:

Attr4: val4

Type2

Type

Intelligent Robot Lecture Note


Unification based multimodal parsing and understanding3

“Draw a line”

Unification-based multimodal parsing and understanding

  • An example

From speech (one of many hyp’s)

Object:

Color: green

Label: draw a line

Create_line

Object:

Color: green

Label: draw a line

Location:

Line

Create_line

Create_line

+

From pen gesture

Coordlist

[ (12143,12134),

(12146,12134),

… ]

Location:

Coordlist

[ (12143,12134),

(12146,12134),

… ]

Location:

Line

ISA

Create_line

Line

Cross-mode

compensation

command

Xcoord: 15487, Ycoord: 19547

Location:

Point

command

Intelligent Robot Lecture Note


Unification based multimodal parsing and understanding4

Unification-based multimodal parsing and understanding

  • Advantages of multimodal integration via typed feature structure unification

    • Partiality

    • Structure sharing

    • Mutual Compensation (cross-mode compensation)

    • Multimodal discourse

Intelligent Robot Lecture Note


Unification based multimodal parsing and understanding5

Unification-based multimodal parsing and understanding

  • Mutual Disambiguation (MD)

    • Each input mode provides a set of scored recognition hypotheses

    • MD derives the best joint interpretation by unification of meaning representation fragments

    • PMM = αPS + βPG + C

      • Learn α, β and C over a multimodal corpus

    • MD stabilizes system performance in challenging environments

gesture

object

multimodal

speech

g1

o1

mm1

s1

mm2

g2

o2

s2

s3

g3

o3

mm3

g4

mm4

Intelligent Robot Lecture Note


Finite state multimodal understanding

Finite-state Multimodal Understanding

  • Modeled by a 3-tape finite state device

    • Speech and gesture stream (gesture symbols)

    • Their combined meaning (meaning symbols)

  • Device take speech and gesture as inputs and create the meaning output.

  • Simulated by two transducers

    • G:W  aligning speech and gesture

    • G*W:M  composite alphabet of speech and gesture symbols as inputs and outputs meaning

  • Speech and gesture input will be composed by G:W

  • Then G_W will be composed by G*W:M

Intelligent Robot Lecture Note


Finite state multimodal understanding1

these

two

restaurants

for

numbers

phone

ten

american

show

new

rest

2

SEM(r12,r15)

sel

G

area

loc

SEM(points…)

hw

0

Finite-state Multimodal Understanding

  • Representation of speech input modality

    • Lattice of words

  • Representation of gesture input modality

    • Represent range of recognitions as lattice of symbols

Intelligent Robot Lecture Note


Finite state multimodal understanding2

SEM

(r12,r15)

<cmd>

<obj>

<type>

</obj>

phone

</rest>

<rest>

</cmd>

</type>

Finite-state Multimodal Understanding

  • Representation of combined meaning

    • Also represented as lattice

    • Paths in meaning lattice are well-formed XML

<cmd>

<info>

<type>phone</type>

<obj><rest>r12,r15</rest></obj>

</info>

</cmd>

Intelligent Robot Lecture Note


Finite state multimodal understanding3

Finite-state Multimodal Understanding

  • Multimodal Grammar Formalism

    • Multimodal context-free grammar (MCFG)

      • e.g., HEADPL  restaurants:rest:<rest> ε:SEM:SEM ε: ε:</rest>

    • Terminals are multimodal tokens consisting of three components:

      • Speech stream : Gesture stream : Combined meaning (W:G:M)

    • e.g., “put that there”

      S  ε:ε:<cmd> PUTV OBJNP LOCNP ε: ε:</층>

      PUTV  ε:ε:<act> put:ε:put ε:ε:</act>

      OBJNP  ε:ε:<obj> that:Gvehicle:εε:SEM:SEM ε:ε:</obj>

      LOCNP  ε:ε:<loc> there:Garea:εε:ε:</loc>

S

LOCNP

OBJNP

PUTV

that

Speech

put

there

Gvehicle v1

Garea a1

Gesture

<act>put</act>

<obj>v1</obj>

<cmd>

Meaning

<loc>a1</loc>

</cmd>

Intelligent Robot Lecture Note


Finite state multimodal understanding4

ε:SEM:SEM

department:Gd;dept(

ε:ε:)

this:ε:ε

2

3

4

person:Gp;person(

email:ε:email([

ε:ε:])

5

6

1

0

that:ε:ε

and:ε:,

page:ε:page([

Finite-state Multimodal Understanding

  • Multimodal Grammar Example

    • Speech: email this person and that organization

    • Gesture: Gp SEM Go SEM

    • Meaning: email([ person(SEM) , org(SEM ) ])

      S  V NP ε:ε:])

      NP  DET N

      NP  NP CONJ NP

      CONJ  and:ε:,

      V  email:ε:email([

      V  page:ε:page([

      DET  this:ε:ε

      DET  that:ε:ε

      N  person:Gp:person( ε:SEM:SEM ε:ε:)

      N  organization:Go:org( ε:SEM:SEM ε:ε:)

      N  department:Gd:dept( ε:SEM:SEM ε:ε:)

organization:Go:org(

Intelligent Robot Lecture Note


Finite state multimodal understanding5

these

two

restaurants

for

numbers

phone

ten

american

show

new

rest

2

SEM(r12,r15)

sel

G

area

loc

SEM(points…)

hw

0

Finite-state Multimodal Understanding

integration

processing

Speech

lattice

3-Tape Multimodal

Finite-state Device

Gesture

lattice

Meaning

lattice

<cmd>

<type>

phone

Intelligent Robot Lecture Note


Finite state multimodal understanding6

ε:SEM:SEM

department:Gd;dept(

ε:ε:)

this:ε:ε

email

that

this

person

organization

2

3

4

and

1

0

2

5

6

3

4

person:Gp;person(

email:ε:email([

ε:ε:])

5

6

1

0

that:ε:ε

and:ε:,

page:ε:page([

Finite-state Multimodal Understanding

  • An example

Multimodal

Grammar

Speech

lattice

Gesture

lattice

Gp

SEM

Go

SEM

1

0

2

3

4

SEM

email([

,

person(

])

SEM

)

org(

)

1

0

2

7

5

8

6

9

3

4

Meaning

lattice

Intelligent Robot Lecture Note


Multimodal dialog

Robustness in Multimodal Dialog

Intelligent Robot Lecture Note

32


Robustness in multimodal dialog

Robustness in Multimodal Dialog

  • Gain robustness via

    • Fusion of inputs from multiple modalities

    • Using strengths of one mode to compensate for weaknesses of others—design time and run time

    • Avoiding/correcting errors

    • Statistical architecture

    • Confirmation

    • Dialogue context

    • Simplification of language in a multimodal context

    • Output affecting/channeling input

  • Example approaches

    • Edit machine in FST based Multimodal integration and understanding

    • Salience driven approach to robust input interpretation

    • N-best re-ranking method for improving speech recognition performance

Intelligent Robot Lecture Note


Edit machine in fst based mm integration

Edit Machine in FST based MM integration

  • Problem of FST based MM integration - mismatch between the user’s input and the language encoded in the grammar

    ASR: show cheap restaurants thai places in in chelsea

    Grammar: show cheap thai places in chelsea

  • How to parse it?

  •  determine which in-grammar string it is most like

    Edits: show cheap ε thai places in ε chelsea

    (restaurants and in is deleted)

    To find this, employ the edit machine !

Intelligent Robot Lecture Note


Handcrafted finite state edit machines

Handcrafted Finite-state Edit Machines

  • Edit-based Multimodal Understanding – Basic edit

    • Transform ASR output so that it can be assigned a meaning by the FST-based Multimodal Understanding model

    • Find the string with the least costly number of edits that can be assigned an interpretation by the grammar

      • λg: Language encoded in the multimodal grammar

      • λs: String encoded in the lattice resulting from ASR

      • ◦ : composition of transducers

Intelligent Robot Lecture Note


Handcrafted finite state edit machines1

Handcrafted Finite-state Edit Machines

  • Edit-based Multimodal Understanding – 4-edit

    • Basic edit is quite large and adds an unacceptable amount of latency (5s on average).

    • Limited number of edit operations (at most 4)

Intelligent Robot Lecture Note


Handcrafted finite state edit machines2

Handcrafted Finite-state Edit Machines

  • Edit-based Multimodal Understanding – Smart edit

    • Smart edit is a 4-edit machine + heuristics + refinements

      • Deletion of SLM only words (not found in the grammar)

        • thai restaurant listings in midtown -> thai restaurant in midtown

      • Deletion of doubled words

        • Subway to to the cloisters -> subway to the cloisters

      • Subdivided cost classes ( icost, dcost  3 classes )

        • High cost: slot fillers (e.g. Chinese, cheap, downtown)

        • Low cost: dispensable words (e.g. please, would )

        • Medium cost: all other words

      • Auto-completion of place names

        • Algorithm enumerates all possible shortening of places names

        • Metropolitan Museum of Art, Metropolitan Museum

Intelligent Robot Lecture Note


Learning edit patterns

Learning Edit Patterns

  • User’s input is considered a “noisy” version of the parsable input (clean).

    Noisy (S): show cheap restaurants thai places in in chelsea

    Clean (T): show cheap ε thai places in ε chelsea

  • Translating the user’s input to a string that can be assigned a meaning representation by the grammar

Intelligent Robot Lecture Note


Learning edit patterns1

Learning Edit Patterns

  • Noisy Channel Model for Error Correction

    • Translation probability

      • Sg: string that can be assigned a meaning representation by the grammar

      • Su: user’s input utterance

      • From Markov assumption, (trigram)

        • Where Su = Su1Su2…Sun and Sg = Sg1Sg2…Sgm

    • Word Alignment (Sui,Sgi)

      • GIZA++

Intelligent Robot Lecture Note


Learning edit patterns2

Learning Edit Patterns

  • Deriving Translation Corpus

    • Finite-state transducer can generate the input strings for given meaning.

    • Training the translation model

corpus

string

meaning

Generated

String

Target

String

Multimodal

Grammar

Generate

the strings

given meaning

Select the

closest strings

Intelligent Robot Lecture Note


Experiments and results

Experiments and Results

  • 16 first time users (8 male, 8 female).

  • 833 user interactions (218 multimodal / 491 speech-only / 124 pen-only)

  • Finding restaurants of various types and getting their names, phone numbers, addresses.

  • Getting subway directions between locations.

  • Avg. ASR sentence accuracy: 49%

  • Avg. ASR word accuracy: 73.4%

Intelligent Robot Lecture Note


Experiments and results1

Experiments and Results

  • Improvements on concept accuracy

Result of 6-fold cross validation

Result of 10-fold cross validation

Intelligent Robot Lecture Note


A salience driven approach

A Salience Driven Approach

  • Modify the language model score, and rescore recognized hypotheses

    • By using the information of gesture input

    • Primed Language model

      • W* = argmaxP(O|W)P(W)

Intelligent Robot Lecture Note


A salience driven approach1

A Salience Driven Approach

  • “People do not make any unnecessary deictic gesture”

    • Cognitive theory of Conversation Implicature

      • Speakers tend to make their contribution as informative as is required

      • And not make their contribution more informative than is required

  • “Speech and gesture tend to complement each other”

    • When a speech utterance is accompanied by a deictic gesture,

      • Speech input – issue commands or inquiries about properties of object

      • Deictic gesture – indicate the objects of interest

  • Gesture as an earlier indicator to anticipate the content of communication in the subsequent spoken utterances

    • 85% of time gestures occurred before corresponding speech unit

Intelligent Robot Lecture Note


A salience driven approach2

Graphical display

Salience weight

A Salience Driven Approach

  • A deictic gesture can activate several objects on the graphical display

    • It will signal a distribution of objects that are salient

time

gesture

speech

Move this to here

Salient Object

 A cup

Intelligent Robot Lecture Note


A salience driven approach3

A Salience Driven Approach

  • Salient object ‘a cup’ is mapped to the physical world representation

    • To indicate a salient part of representation

      • Such as relevant properties or task related to the salient objects.

  • This salient part of the physical world is likely to be the potential content of speech

A cup

time

gesture

speech

Move this to here

Intelligent Robot Lecture Note


A salience driven approach4

A Salience Driven Approach

  • Physical world representation

    • Domain Model

      • Relevant knowledge about the domain

        • Domain objects

        • Properties of objects

        • Relations between objects

        • Task models related to objects

      • Frame-based representation

        • Frame: domain object

        • Frame elements: attributes and tasks related to the objects

    • Domain Grammar

      • Specifies grammar and vocabularies used to process language inputs

        • Semantics-based context free grammar

          • Non-terminal: semantic tag

          • Terminal: word (value of semantic tag)

        • Annotated user spoken utterance

          • Relevant semantic information

          • N-grams

Intelligent Robot Lecture Note


Salience modeling

Salience Modeling

  • Calculating a salience distribution of entities in the physical world

    • Salience value of entity at time tn is influenced by a joint effect from

      • Sequence of gestures that happen before tn

Intelligent Robot Lecture Note


Salience modeling1

Salience Modeling

Summation ofP(ek|g) for all gestures before time tn

Weighted byα

Normalizing factor:

Summation of salience value of all entities at time tn

The closer gesture has higher impact to

salience distribution

Intelligent Robot Lecture Note


Salience driven spoken language understanding

Salience Driven Spoken Language Understanding

  • Maps the salience distribution to the physical world representation

  • Uses salient world to influence spoken language understanding

  • primes language models to facilitate language understanding

    • Rescoring the hypotheses of speech recognizer by using primed language model score

Intelligent Robot Lecture Note


Primed language model

Word class probability

Class transition probability

Primed Language Model

  • Primed language model is based on the class-based bigram model

    • Class : semantic and functional class for domain

      • E.g.) this  Demonstrative, price  AttrPrice

    • Modify the word class probability

      • Originally it measures the probability of seeing a word wi given a class ci

      • It modified as the choice of word “wi” is dependent on the salient physical world

        • Which is represented as the salience distribution P(e)

      • P(wi,ci|ek) and P(ci|ek) are not dependent on time ti

      •  can be estimated based on the training data

  • Speech hypotheses are reordered according to primed language model.

Intelligent Robot Lecture Note


Evaluation wer

Evaluation - WER

  • Domain : real estate properties

  • Interface : speech + pen gesture

  • 11 users tested, five non-native speakers and six native speakers

  • 226 user inputs with an average of 8 words per utterance

  • Average WER reduction is about 12% (t=4.75, p<0.001)

Intelligent Robot Lecture Note


Evaluation concept identification

Evaluation – Concept Identification

  • Examples of improved case

    • Transcription: What is the population of this town

    • Baseline: What is the publisher of this time

    • Salience-based: What is the population of this town

    • Transcription: How much is this gray house

    • Baseline: How much is this great house

    • Salience-based: How much is this gray house

Intelligent Robot Lecture Note


N best re ranking for improving speech recognition performance

N-best re-ranking for improving speech recognition performance

  • Using multimodal understanding feature

Speech

error

ASR

이것 좀 여기에 갖다 놔.

이다 좀 여기에 갖 다 가

SLU

Pen

Speech Act: request

Main Goal: move

Component Slots:

Target.Loc : 여기

Missing the slot!!!

 Source.item : 이것

Intelligent Robot Lecture Note


N best re ranking for improving speech recognition performance1

N-best re-ranking for improving speech recognition performance

  • Using N-best ASR Hypotheses

    • Rescore the hypotheses with many information

    • That are not available during speech recognition

    •  We use multimodal understanding features

이것 좀 여기 갖 다 가

이것 좀 이것 갖 다 가

이다 좀 여기 갖 다 가

이다 좀 여기 갖 다 줘

이다 좀 여기 갖 다 가

이다 좀 여기 갖 다 줘

이것 좀 여기 갖 다 가

이것 좀 이것 갖 다 가

Re-ranking Model

with

many

Features

Intelligent Robot Lecture Note


Speech recognizer features

Speech Recognizer Features

  • Speech recognizer score: P(W|X).

  • Acoustic model score: P(X|W).

  • Language model score: P(W).

  • N-best word rate: To give more confidence to a particular word which occurs in many hypotheses.

  • N-best homogeneity: To give more weight to a word which appears in a higher ranked hypothesis, we weigh each word by the score of the hypothesis in which it appears.

Intelligent Robot Lecture Note


Slu features

SLU features

  • CRF confidence Score: the confidence score of the SLU results.

    • Confidence score of speech act and main goal:

      P(speech act|word sequence), P(main goal|word sequence)

      • Driven from a CRF formulation

      • y: output variable

      • x: input variable

      • Z: normalization factor

      • fk(yt-1,yt,x,t): arbitrary linguistic feature function (often binary-valued)

      • λk: trained parameter associated with feature fk

Intelligent Robot Lecture Note


Slu features1

SLU features

  • CRF confidence Score (cont.)

    • Confidence score of component slot

      • yt: component slot

      • xt: corresponding word

Intelligent Robot Lecture Note


Multimodal understanding features

Multimodal Understanding Features

  • Multimodal reference resolution score

    • Well recognized speech hypothesis tend to resolve well.

    • (a)  well recognized

    • (b),(c) this bathroom this bad noon

    • (b) this bad noon can not be

      referring expression.

      Second pen gesture has low

      reference resolution score

    • (c) this bad noon as a

      referring expression but has

      low reference resolution score

Intelligent Robot Lecture Note


Experimental setup

Experimental Setup

  • Corpus

    • 617 Multimodal inputs

      • 118 (speech + pen gesture) + 499 (speech only)

      • 3135 words, 5.08 words per utterance.

      • Vocabulary size: 396

  • Speech Recognizer

    • HTK-based Korean speech recognizer was trained by MFCC 39 dimensional feature vectors.

    • Output  75 best lists

Intelligent Robot Lecture Note


Experimental result wer

Experimental Result (WER)

  • Comparison word error rate between baseline and N-best re-ranking model with variable feature set.

    • Relative error reduction rate: 7.95 (%)

    • Re-ranking model has a significantly smaller word error rates than that of baseline system. (p < 0.001)

Intelligent Robot Lecture Note


Experimental results wer

Experimental Results (WER)

  • Word error rates of a N-best re-ranking model with the varied size of an N

  • If N is too large  many noisy hypotheses

  • If N is too small  small candidate size and few clues to re-rank

Intelligent Robot Lecture Note


Experimental results cer

Experimental Results (CER)

  • Comparison concept error rate between baseline and N-best re-ranking model.

    • Relative error reduction rate: 10.13 (%)

    • Re-ranking model has a significantly smaller concept error rates than that of baseline system. (p < 0.01)

Intelligent Robot Lecture Note


Reading list

Reading List

  • R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics interface,” Computer Graphics Vol. 14, no. 3, 262-270.

  • J. Chai,  S. Pan, M. Zhou, and K. Houck, 2002, Context-based Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).

  • J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.

  • J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.

Intelligent Robot Lecture Note


Reading list1

Reading List

  • J. Chai, S. Qu, A Salience Driven Approach to Robust Input Interpretation in Multimodal Conversational Systems, In Proceedings of the HLT/EMNLP 2005

  • H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference on Multimodal Interfaces, (ICMI),

  • M. Johnston, 1998. Unification-based multimodal parsing. Proceedings of the International Joint Conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics , 624-630.

  • M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing and understanding. Proceedings of COLING-2000.

Intelligent Robot Lecture Note


Reading list2

Reading List

  • M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for multimodal dialogue systems. In Proceedings of ACL-2002.

  • M. Johnston, S. Bangalore, Learning Edit Machines For Robust Multimodal Understanding, In Proceedings of the ICASSP 2006

  • P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I. Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.

  • S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.

Intelligent Robot Lecture Note


  • Login