multimodal expressive embodied conversational agents n.
Skip this Video
Loading SlideShow in 5 Seconds..
Multimodal Expressive Embodied Conversational Agents PowerPoint Presentation
Download Presentation
Multimodal Expressive Embodied Conversational Agents

play fullscreen
1 / 73

Multimodal Expressive Embodied Conversational Agents

108 Views Download Presentation
Download Presentation

Multimodal Expressive Embodied Conversational Agents

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Multimodal Expressive Embodied Conversational Agents Université Paris 8 Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini Magalie Ochs, FT Christopher Peters Radek Niewiadomski

  2. ECAs Capabilities • Anthropomorphic autonome figures • New form on human-machine interaction • Study of human communication, human-human interaction • ECAs ought to be endowed with dialogic and expressive capabilities • Perception: an ECA must be able to pay attention to, perceive user and the context she is placed in.

  3. ECAs capabilities • Interaction: • speaker and addressee emits signals • speaker perceives feedback from addressee • speaker may decide to adapt to addressee’s feedback • consider social context • Generation: expressive synchronized visual and acoustic behaviors. • produce expressive behaviours • words, voice, intonation, • gaze, facial expression, gesture • body movements, body posture

  4. Synchrony tool - BEAT • Cassell et al, Media Lab MIT • Decomposition of text into theme and rheme • Linked to WordNet • Computation of: • intonation • gaze • gesture

  5. Virtual Training Environments MRE(J. Gratch, L. Jonhson, S. Marsella…, USC)

  6. Interactive System • Real state agent • Gesture synchronized with speech and intonation • Small talk • Dialog partner

  7. MAX, S. Kopp, U of Bielefeld Gesture understanding and imitation

  8. Gilbert and George at the Bank (Upenn, 1994)

  9. Greta

  10. Problem to Be Solved • Human communication is endowed with three devices to express communicative intention: • Verbs and formulas • Intonation and paralinguistic • Facial expression, gaze, gesture, body movement, posture… • Problem: For any communicative act, the Speaker has to decide: • Which nonverbal behaviors to show • How to execute them

  11. Verbal and Nonverbal Communication • Suppose I want to advise a friend to put on her coat because it is snowing. • Which signals do I use? • Verbal signal: use of a syntactically complex sentence: Take your umbrella because it is raining • Verbal + nonverbal signals: Take your umbrella + point out to the window to show the rain by a gesture or by gaze

  12. Multimodal Signals • The whole body communicates by using: • Verbal acts (words and sentences) • Prosody, intonation (nonverbal vocal signals) • Gesture (hand and arm movements) • Facial action (smile, frown) • Gaze (eyes and head movements) • Body orientation and posture (trunk and leg movements) • All these systems of signals have to cooperate in expressing overall meaning of communicative act.

  13. Multimodal Signals • Accompany flow of speech • Synchronized at the verbal level • Punctuate accented phonemic segments and pauses • Substitute for word(s) • Emphasize what is being said • Regulate the exchange of speaking turn

  14. Synchronization • There exists an isomorphism between patterns of speech, intonation and facial actions • Different levels of synchrony: • Phoneme level (blink) • Word level (eyebrow) • Phrase level (hand gesture) • Interactional synchrony: Synchrony between speaker and addressee

  15. Taxonomy of Communicative Functions (I. Poggi) • The speaker may provide three broad types of information about: • Information about the world: deictic, iconic (adjectival),… • Information about the speaker’s mind: • belief (certainty, adjectival) • goal (performative, rheme/theme, turn-system, belief relation) • emotion • meta-cognitive • Information about speaker’s identity (sex, culture, age…)

  16. Multimodal Signals (Isabella Poggi) • Characterization of multimodal signals by their placement with respect to linguistic utterance and significance in transmitting information. Eg: • Raised eyebrow may signal surprise, emphasis, question mark, suggestion… • Smile may express happiness, be a polite greeting, be a backchannel signal… • Need two information to characterize multimodal signals: • Their meaning • Their visual action

  17. Expression meaning deictic: this, that, here, there adjectival: small, difficult certainty: certain, uncertain… performative: greet, request topiccomment: emphasis Beliefrelation: contrast,… turn allocation: take/give turn affective: anger, fear, happy-for, sorry-for, envy, relief, …. Expression signal Deictic: gaze direction Certainty: Certain: palm up open hand; Uncertain: raised eyebrow adjectival:small eye aperture Belief relation:Contrast: raised eyebrow Performative:Suggest: small raised eyebrow, head aside; Assert: horizontal ring Emotion: Sorry-for: head aside, inner eyebrow up; Joy: raising fist up Emphasis: raised eyebrows, head nod, beat Lexicon=(meaning, signal)

  18. Representation Language • Affective Presentation Markup Language – APML • describes the communicative functions • works at meaning level and not the signal level <APML> <turn-allocation type="take turn"> <performative type="greet"> Good Morning, Angela. </performative> <affective type="happy"> It is so <topic-comment type="comment"> wonderful </topic-comment> to see you again. </affective> <certainty type="certain"> I was <topic-comment type="comment"> sure </topic-comment> we would do so, one day!</certainty> </turn-allocation> </APML>.

  19. Facial Description Language • Facial expressions defined as (meaning, signal) pairs stored in library • Hierarchical set of classes: • Facial basis FB class: basic facial movement • An FB may be represented as a set of MPEG-4 compliant FAPs or recursively, as a combination of other FBs using the `+' operators • FB={fap3=v1,…,fap69=vk}; • FB'=c1*FB1+c2*FB2; • where c1 and c2 are constants and FB1 and FB2 can be: • Previous defined FBs • FB of the form: {fap3=v1,…,fap69=vk}

  20. Facial basis class • Facial basis class • Examples of facial basis class: • Eyebrow: small_frown, left_raise, right_raise • Eyelid: upper_lid_raise • Mouth: left_corner_stretch, left_corner_raise = +

  21. Facial Displays • Every facial display (FD) is made up of one or more FBs: • FD=FB1 + FB2 + FB3 + … + FBn; • surprise=raise_eyebrow+raise_lid+open_mouth; • worried=(surprise*0.7)+sadness;

  22. Facial Displays • Probabilistic mapping between the tags and signals: • Es: happy_for = (smile*0.5, 0.3) + (smile*0.25) + (smile*2 + raised_eyebrow, 0.35) + (nothing, 0.1) • Definition of a function class for addressee association (meaning, signal) • Class communicative function: • Certainty • Adjectival • Performative • Affective • …

  23. Facial Temporal Course

  24. Gestural Lexicon • Certainty: • Certain: palm up open hand • Uncertain: showing empty hands while lowering forearms • Belief-relation: • List of items of same class: numbering on fingers • Temporal relation: fist with extended hand moves back and forth behind one’s shoulder • Turn-taking: • Hold the floor: raise hand, palm toward hearer • Performative: • Assert: horizontal ring • Reproach: extended index, palm to left, rotating up & down on wrist • Emphasis: beat

  25. Gesture Specification Language • Scripting language for hand-arm gestures, based on formational parameters [Stokoe]: • Hand shape specified using HamNoSys [Prillwitz et. al.] • Arm position: concentric squares in front of agent [McNeill] • Wrist orientation: palm and finger base orientation • Gestures are defined by a sequence of timed key poses: gesture frame • Gestures are broken down temporally into distinct (optional) phases: • Gesture phase: preparation, stroke, hold, retraction • Change of formational components over time

  26. Gesture specification example: Certain

  27. Gesture Temporal Course stroke start – stroke end rest position preparation retraction rest position

  28. ECA architecture

  29. ECA Architecture • Input to the system: APML annotated text • Output to the system: Animation files and WAV file for the audio • System: • Interprets APML tagged dialogs, i.e. all communicative functions • Looks in a library the mapping between the meaning (specified by the XML-tag) and signals • Decides which signals to convey on which modalities • Synchronizes the signals with speech at different levels (word, phoneme or utterance)

  30. Behavioral Engine

  31. Modules • APML Parser: XML parser • TTS Festival: manages the speech synthesis and give us the list of phonemes and phonemes duration. • Expr2Signal Converter: given a communicative function and its meaning, this module returns the list of facial signals • Conflicts Resolver: resolves the conflicts that may happened when more than one facial signals should be activated on same facial parts • Face Generator: converts the facial signals into MPEG-4 FAP values • Viseme Generator: converts each phoneme, given by Festival, into a set of FAPs • MPEG4 FAP Decoder: is an MPEG-4 compliant Facial Animation Engine

  32. TTS Festival • Drive the synchronization of facial expression • Synchronization implemented at word level • Timing of facial expression connected to the text embedded between the markers • Use of the tree structure of Festival to compute expressions duration

  33. Expr2Signal Converter • Instantiation of APML tags: meaning of a given communicative function Converts markers into facial signals • Use of a library containing the lexicon of the type (meaning, facial expressions)

  34. Gaze Model • Based on communicative functions’ model of Isabella Poggi • This model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. • For example: • agent wants to emphasize a given word, the model will output that the agent should gaze at her conversant.

  35. Gaze Model • Very deterministic behavior model: at every Communicative Function associated with a meaning correspond the same signal (with probabilistic changes) • Event-driven model: only when a Communicative Function is specified the associated signals are computed only when a Communicative Function is specified, the corresponding behavior may vary

  36. Gaze Model • Several drawbacks as there is no temporal consideration: • No consideration of past and current gaze behavior to compute the new one • No consideration of how long the current gaze state of S and L has lasted

  37. Gaze Algorithm • Two steps: • Communicative prediction: • Apply the communicative function model to compute the gaze behavior as to convey a given meaning for S and L • Statistical prediction: • The communicative gaze model is probabilistically modified by a statistical model defined with constraints: • what is the communicative gaze behavior of S and L • in which gaze behavior S and L were • the duration of the current state of S and L

  38. Temporal Gaze Parameters • The gaze behaviors depend on the communicative functions, general purpose of the conversation (persuasion discours, teaching...), personality, cultural root, social relations... • Very, too, complex model propose parameters that control the gaze behavior overall • TS=1,L=1max: maximum duration the mutual gaze state may remain active. • TS=1max : maximum duration of gaze state S=1. • TL=1max : maximum duration of gaze state L=1 . • TS=0max : maximum duration of gaze state S=0. • TL=0max : maximum duration of gaze state L=0.

  39. Mutual Gaze

  40. Gaze Aversion

  41. Gesture Planner • Adaptive instantiation: • Preparation and retraction phase adjustments • Transition key and rest gesture insertion • Joint-chain follow-through • Forward time shifting of children joints in time • Stroke of gesture on stressed word • Stroke expansion • During planning phase, identify rheme clauses with closely repeated emphases/pitch accents • Indicate secondary accents by repeating the stroke of the primary gesture with decreasing amplitude

  42. Gesture Planner • Determination of gesture: • Look in dictionary • Selection of gesture • Gestures associated with most embedded tags have priority (except beat): adjectival, deictic • Duration of gesture: • Coarticulation between successive gestures closed in time • Hold for gestures belonging to higher up tag hierarchy (e.g. performative, belief-relation) • Otherwise go to rest position

  43. Behavior Expressivity • Behavior is related to the (Wallbott, 1998): • quality of the mental state (e.g. emotion) it refers to • quantity (somehow linked to the intensity factor of the mental state) • Behaviors encode: • content information (the ‘What is communicating’) • expressive information (the ‘How it is communicating’) • Behavior expressivity refers to the manner of execution of the behavior

  44. Expressivity Dimensions • Spatial: amplitude of movement • Temporal: duration of movement • Power: dynamic property of movement • Fluidity: smoothness and continuity of movement • Repetitiveness: tendency to rhythmic repeats • Overall Activation: quantity of movement across modalities

  45. Overall Activitation • Threshold filter on atomic behaviors during APML tag matching • Determines the number of nonverbal signals to be executed.

  46. Spatial Parameter • Amplitude of movement controlled through asymmetric scaling of the reach • space that is used to find IK goal positions • Expand or condense the entire space in front of agent

  47. Temporal parameter • Determine the speed of the arm movement of a gesture's • meaning-carrying stroke phase • Modify speed of stroke Stroke shift / velocity control of a beat gesture Y position of wrist w.r.t. shoulder [cm] Frame #

  48. Fluidity • Continuity control of TCB interpolation splines and gesture-to-gesture • Continuity of arms’ trajectory paths • Control the velocity profiles of an action coarticulation X position of wrist w.r.t. shoulder [cm] Frame #

  49. Power • Tension and Bias control of TCB splines; • Overshoot reduction • Acceleration and deceleration of limbs Hand shape control for gestures that do not need hand configuration to convey their meaning (beats).