IJCNLP 2011 (Nov 9 2011)

Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic CluesRyu IidaMasaaki YasuharaTakenobu TokunagaTokyo Institute of Technology IJCNLP 2011 (Nov 9 2011)

Research background • Typical coreference/anaphora resolution • Researchers have tackled problems provided by MUC, ACE and CoNLL shared tasks (a.k.a. OntoNote) • Mainly focused on linguistic aspect of reference function • Multi-modal research community(Byron, 2005; Prasov and Chai, 2008; Prasov and Chai, 2010; Schütte et al., 2010, Iida et al. 2010) • Essential for human-computer interaction • Identifying referents of referring expressions in a static scene or a situated world, taking extra-linguistic clues into account

Multi-modal reference resolution dialogue history Rotate the triangle at top right 60 degrees clockwise All right… done it.. move the triangle to the left eye-gaze O.K. action history …piece 1: move (X:230,Y:150) piece 7: move (X:311, Y:510) piece 3: rotate 60°

Aim • Integrate several types of multi-modal information into a machine learning-based reference resolution model • Investigate which kinds of clues are effective on multi-modal reference resolution

Multi-modal problem setting:related work • 3D virtual world (Byron 2005, Stonia et al. 2008) • e.g. Participants controlled an avatar in a virtual world for exploring hidden treasures • Frequently occurring scene updates  Referring expressions will be relatively skewed to exophoric cases • Static scene (Dale 1992) • Centrality and size of each object in computer display is fixed through dialogues  Change of visual salience of objects is not observed

Evaluation data set creation • REX-J corpus (Spanger et al. 2010) • Dialogues and transcripts of collaborative work (solving Tangram puzzles) by two Japanese participants • Designed the puzzle solving task to require the frequent use of both anaphoric and exophoric referring expressions

Setting of collecting data working area solver operator working area not available shield screen goal shape

Collecting eye gaze data • Recruited 18 Japanese graduate students split them into 9 pairs • All pairs knew each other previously and were of the same sex and approximately the same age • Introduced to solve 4 different Tangram puzzles • Use the Tobii T60 Eye Tracker, sampling at 60 Hz for recording users’ eye gaze with 0.5 degrees in accuracy • 5 dialogues in which the tracking results contained more than 40% errors were removed

Annotating referring expressions • Conducted using a multimedia annotation tool, ELAN • Annotator manually detects a referring expression and then selects its referent out of the possible puzzle pieces shown on the computer display • Total number of annotated referring expressions:1,462 instances in 27 dialogues • 1,192 instances in solver’s utterances (81.5%) • 270 instances in operator’s utterances (18.5%)

Multi-modal reference resolution • Base model • Ranking candidate referents is important for better accuracy (Iida et al. 2003, Yang et al. 2003, Denis & Baldridge 2008) • Apply Ranking SVM algorithm (Joachims, 2002) • Learn a weight vector to rank candidates for a given partial ranking of each referent • Training instances • To define the partial ranking of candidate referents, simply rank referents referred to by a given referring expression as first place and any other referents as second place

Feature set • Linguistic features: Ling (Iida et al. 2010):10 features • Capture the linguistic salience of each referent based on the discourse history • Task-specific features: TaskSp(Iida et al. 2010):12 features • Consider the visual salience based on the recent movements of mouse cursor and recent pieces manipulated by the operator • Eye-gaze features: Gaze (proposed):14 features

Eye gaze as clues of reference function • Eye gaze • Saccades: quick, simultaneous movements of both eyes in the same direction • Eye-fixations: maintaining of the visual gaze on a single location • Direction of eye gaze directly reflects the focus of attention (Richardson et al., 2007) • Used the eye fixations as clues for identifying the pieces focused on • Separating saccades and eye fixations: Dispersion-threshold identification (Salvucci and Anderson, 2001)

Eye gaze features fixating on piece_a fixating on piece_b T= 1500msec (Prasov and Chai 2010) t’ t t-T “First you need to move the smallest triangle to the left” time g a f b e how frequently orhow long the speaker fixates on each piece c d

Empirical evaluation • Compared models with different combinations of the three types of features • Conducted 5-fold cross-validation • Proposed model with model separation(Iida et al. 2010) • the referential behaviour of pronouns is completely different from non-pronouns Separately create two reference resolution models; • pronoun model: identifies a referent of a given pronoun • non-pronoun model: identifies a referent of all other expressions (e.g. NP)

Results of (non-)pronouns

Overall results

Investigation of the significance of features • Calculate the weight of features according to the following formula function that returns 1 if f occurs in x set of the support vectors in a ranker weight of the support vector x

Weights of features in each model TaskSp1:mouse cursor was over a piece at the beginning of uttering a referring expression TaskSp3: time distance is less than or equal to 10 sec after the mouse cursor was over a piece

Weights of features in each model Ling6: shape attributes of a piece are compatible with the attributes of a referring expression Gaze10: there exists the fixation time of a piece in the time period [t − T , t] Gaze 9: the fixation time of a piece in the time period [t − T , t] is longest out of all pieces

Summary • Investigated the impact of multi-modal information on reference resolution in Japanese situated dialogues • The results demonstrate • The referents of pronouns rely on the visual focus of attention such as is indicated by moving the mouse cursor • Non-pronouns are strongly related to eye fixations on its referent • Integrating these two types of multi-modal information into linguistic information contributes to increasing accuracy of reference resolution

Future work • Need further data collection • All objects in Tangram puzzle (i.e. puzzle pieces) have nearly the same size • Rejecting the factor that a relatively larger object occupying the computer display has higher prominence over smaller objects • Zero-anaphors in utterances need to be annotated • frequent use of them in Japanese

IJCNLP 2011 (Nov 9 2011)

IJCNLP 2011 (Nov 9 2011)

Presentation Transcript

9 Kasim 2011 Van Depremi Van Earthquake 9th Nov. 2011

Nov. 24, 2011

Nov, 2011

Nov. 22, 2011

November 2, 2011 – 10am class

7 Nov. 2011

October 28, 2011 – 10am Class

Wednesday, Nov. 9, 2011

Nov, 2011

7 Nov 2011

Nov 17. 2011

Nov 9, 2011

Wednesday, Nov. 9, 2011

Nov 10, 2011

7 Nov 2011

Nov 17, 2011

IRS_P6_LISS3_07 NOV 2011

Nov 16, 2011

Nov. 24, 2011

Wednesday, Nov. 9, 2011

Nov 17. 2011