multimodal emotion recognition n.
Skip this Video
Loading SlideShow in 5 Seconds..
multimodal+emotion+recognition PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 109

multimodal+emotion+recognition - PowerPoint PPT Presentation

  • Uploaded on

multimodal+emotion+recognition. a.k.a. ‘better than the sum of its parts’. Kostas Karpouzis Assoc. researcher ICCS/NTUA multimodal+emotion+recognition. Three very different (and interesting!) problems

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
multimodal emotion recognition


a.k.a. ‘better than the sum of its parts’

Kostas Karpouzis

Assoc. researcher


multimodal emotion recognition1
  • Three very different (and interesting!) problems
    • What is ‘multimodal’, why do we need it, what do we earn from that?
    • What is ‘emotion’ in HCI applications?
    • What can we recognize and, better yet, what should we recognize?
multimodal emotion recognition2
  • In terms of R&D, emotion/affect-aware human-computer interaction is a hot topic
    • Novel, interesting application for existing algorithms
    • Demanding test bed for feature extraction and recognition tasks
    • …and just wait until we bring humans in the picture!
multimodal emotion recognition3
  • In terms of R&D, emotion/affect-aware human-computer interaction is a hot topic
    • Dedicated conferences (e.g. ACII, IVA, etc.) and planned journals
    • Humaine Network of Excellence  Humaine Association
    • Integrated Projects (CALLAS, Companions, LIREC, Feelix Growing, etc.)
yours truly
yours truly
  • Associate researcher at ICCS/NTUA, Athens
  • Completed post-doc within Humaine
    • Signals to signs of emotion
    • Co-editor of Humaine Handbook
  • Member of the EC of the Humaine Association
  • Emotion modelling and development in Callas, Feelix Growing FP6 Projects
what next
what next
  • first we define ‘emotion’
    • terminology
    • semantics and representations
    • computational models
    • emotion in interaction
    • emotion in natural interaction
what next1
what next
  • then ‘multimodal’
    • modalities related to emotion and interaction
    • fusing modalities (how?, why?)
    • handling uncertainty, noise, etc.
    • which features from each modality?
    • semantics of fusion
what next2
what next
  • and ‘recognition’
    • from individual modalities (uni-modal)
    • across modalities (multi-modal)
    • static vs. dynamic recognition
    • what can we recognize?
      • can we extend/enrich that?
    • context awareness
what next3
what next
  • affect and emotion aware applications
    • can we benefit from knowing a user’s emotional state?
  • missing links
    • open research questions for the following years
  • Emotions, mood, personality
  • Can be distinguished by
    • time (short-term vs. long-term)
    • influence (unnoticed vs. dominant)
    • cause (specific vs. diffuse)
  • Affect classified by time
    • short-term: emotions (dominant, specific)
    • medium-term: moods (unnoticed, diffuse)
    • and long-term: personality (dominant)
  • what we perceive is the expressed emotion at a given time
    • on top of a person’s current mood, which may change over time, but not drastically
    • and on top of their personality
      • usually considered a base line level
  • which may differ from what a person feels
    • e.g. we despise someone, but are forced to be polite
  • Affect is an innately structured, non-cognitive evaluative sensation that may or may not register in consciousness
  • Feeling is defined as affect made conscious, possessing an evaluative capacity that is not only physiologically based, but that is often also psychologically oriented.
  • Emotion is psychosocially constructed, dramatized feeling
how it all started
how it all started
  • Charles Darwin, 1872
  • Ekman et al. since the 60s
  • Mayer and Salovey, papers on emotional intelligence, 90s
  • Goleman’s book: Emotional Intelligence: Why It Can Matter More Than IQ
  • Picard’s book: Affective Computing, 1997
why emotions
why emotions?
  • “Shallow” improvement of subjective experience
  • Reason about emotions of others
    • To improve usability
    • Get a handle on another aspect of the "human world"
    • Affective user modeling
    • Basis for adaptation of software to users
name that emotion
name that emotion
  • so, we know what we’re after
    • but we have to assign it a name
    • in which we all agree upon
    • and means the same thing for all (most?) of us
  • different emotion representations
    • different context
    • different applications
    • different conditions/environments
emotion representations
emotion representations
  • most obvious: labels
    • people use them in everyday life
    • ‘happy’, ‘sad’, ‘ironic’, etc.
    • may be extended to include user states, e.g. ‘tired’, which are not emotions
    • CS people like them
      • good match for classification algorithms
  • but…
    • we have to agree on a finite set
      • if we don’t, we’ll have to change the structure of our neural nets with each new label
    • labels don’t work well with measurements
      • is ‘joy’ << ‘exhilaration’ and in what scale?
      • do scales mean the same to the expresser and all perceivers?
  • Ekman’s set is the most popular
    • ‘anger’, ‘disgust’, ‘fear’, ‘joy’, ‘sadness’, and ‘surprise’
    • added ‘contempt’ in the process
  • Main difference to other sets of labels:
    • universally recognizable across cultures
    • when confronted with a smile, all people will recognize ‘joy’
from labels to machine learning
from labels to machine learning
  • when reading the claim that ‘there are six facial expressions recognized universally across cultures’…
  • …CS people misunderstood, causing a whole lot of issues that still dominate the field
strike 1
strike #1
  • ‘we can only recognize these six expressions’
  • as a result, all video databases used to contain images of sad, angry, happy or fearful people
  • a while later, the same authors discussed ‘contempt’ as a possible universal, but CS people weren’t listening
strike 2
strike #2
  • ‘only these six expressions exist in human expressivity’
  • as a result, more sad, angry, happy or fearful people, even when data involved HCI
    • can you really be afraid when using your computer?
strike 3
strike #3
  • ‘we can only recognize extreme emotions’
  • now, happy people grin, sad people cry or are scared to death when afraid
  • however, extreme emotions are scarce in everyday life
    • so, subtle emotions and additional labels were out of the picture
labels are good but
labels are good, but…
  • don’t cover subtle emotions and natural expressivity
    • more emotions are available in everyday life and usually masked
    • hence the need for alternative emotion representations
  • can’t approach dynamics
  • can’t approach magnitude
    • extreme joy is not defined
other sets of labels
other sets of labels
  • Plutchik
    • Acceptance, anger, anticipation, disgust, joy, fear, sadness, surprise
    • Relation to adaptive biological processes
  • Frijda
    • Desire, happiness, interest, surprise, wonder, sorrow
    • Forms of action readiness
  • Izard
    • Anger, contempt, disgust, distress, fear, guilt, interest, joy, shame, surprise
other sets of labels1
other sets of labels
  • James
    • Fear, grief, love, rage
    • Bodily involvement
  • McDougall
    • Anger, disgust, elation, fear, subjection, tender-emotion, wonder
    • Relation to instincts
  • Oatley and Johnson-Laird
    • Anger, disgust, anxiety, happiness, sadness
    • Do not require propositional content
going 2d
going 2D
  • vertical: activation (active/passive)
  • horiz.: evaluation (negative/positive)
going 2d1
going 2D
  • emotions correspond to points in 2D space
  • evidence that some vector operations are valid, e.g. ‘fear’ + ‘sadness’ = ‘despair’
going 2d2
going 2D
  • quadrants useful in some applications
    • e.g. need to detect extreme expressivity in a call-centre application
going 3d
going 3D
  • Plutchik adds another dimension
  • vertical  intensity, circle  degrees ofsimilarity
    • four pairs of opposites
going 3d1
going 3D
  • Mehrabian considers pleasure, arousal and dominance
  • Again, emotions are points in space
what about interaction
what about interaction?
  • these models describe the emotional state of the user
  • no insight as to what happened, why the user reacted and how the user will react
    • action selection
  • OCC (Ortony, Clore, Collins)
  • Scherer’s appraisal checks
occ ortony clore collins
OCC (Ortony, Clore, Collins)
  • each event, agent and object has properties
    • used to predict the final outcome/expressed emotion/action
occ ortony clore collins2
OCC (Ortony, Clore, Collins)
  • Appraisals
    • Assessments of events, actions, objects
  • Valence
    • Whether emotion is positive or negative
  • Arousal
    • Degree of physiological response
  • Generating appraisals
    • Domain-specific rules
    • Probability of impact on agent’s goals
scherer s appraisal checks
Scherer’s appraisal checks

2 theoretical approaches:

  • “Discrete emotions” (Ekman, 1992; Ekman & Frisen, 1975: EMFACS)
  • “Appraisal theory” of emotion (Scherer, 1984, 1992)
scherer s appraisal checks1
Scherer’s appraisal checks
  • Componential Approach
    • Emotions are elicited by a cognitive evaluation of antecedent events.
    • Patterning of reactions are shaped by this appraisal process. Appraisal dimensions are used to evaluate stimulus, in an adaptive way to the changes.
  • Appraisal Dimensions: Evaluation of significance of event, coping potential, and compatibility with the social norms

Autonomic responses contribute to the intensity of the emotional experience.





General autonomic

Arousal (heart races)





Particular emotion

experienced (fear)

Emotion experienced will affect future interpretations

Of stimuli and continuing autonomic arousal

scherer s appraisal checks2
Scherer’s appraisal checks
  • 2 theories, 2 sets of predictions:the example of Anger
summary on emotion
summary on emotion
  • perceived emotions are usually short-lasting events across modalities
  • labels and dimensions are used to annotate perceived emotions
    • pros and cons for each
  • additional requirements for interactive applications
a definition
a definition
  • Raisamo, 1999
  • “Multimodal interfaces combine many simultaneous input modalities and may present the information using synergistic representation of many different output modalities”
twofold view
Twofold view
  • A Human-Centered View
    • common in psychology
    • often considers human input channels, i.e., computer output modalities, and most often vision and hearing
    • applications: a talking head, audio-visual speech recognition, ...
  • A System-Centered View
    • common in computer science
    • a way to make computer systems more adaptable
going multimodal
going multimodal
  • ‘multimodal’ is this decade’s ‘affective’!
  • plethora of modalities available to capture and process
    • visual, aural, haptic…
    • ‘visual’ can be broken down to ‘facial expressivity’, ‘hand gesturing’, ‘body language’, etc.
    • ‘aural’ to ‘prosody’, ‘linguistic content’, etc.
multimodal design
multimodal design

Adapted from [Maybury and Wahlster, 1998]

paradigms for multimodal user interfaces
paradigms for multimodal user interfaces
  • Computer as a tool
    • multiple input modalities are used to enhance direct manipulation behavior of the system
    • the machine is a passive tool and tries to understand the user through all different input modalities that the system recognizes
    • the user is always responsible for initiating the operations
    • follows the principles of direct manipulation [Shneiderman, 1982; 1983]
paradigms for multimodal user interfaces1
paradigms for multimodal user interfaces
  • Computer as a dialogue partner
    • the multiple modalities are used to increase the anthropomorphism in the user interface
    • multimodal output is important: talking heads and other human-like modalities
    • speech recognition is a common input modality in these systems
    • can often be described as an agent-based conversational user interface
why multimodal
why multimodal?
  • well, why not?
    • recognition from traditional unimodal databases had reached its ceiling
    • new kinds of data available
  • what’s in it for me?
    • have recognition rates improved?
    • or just introduced more uncertain features
essential reading
essential reading
  • Communications of the ACM,Nov. 1999, Vol. 42, No. 11, pp. 74-81
putting it all together
putting it all together
  • myth #1: If you build a multimodal system, users will interact multimodally
    • Users have a strong preferenceto interact multimodally rather than unimodally
    • no guaranteethat they will issue every command to a systemmultimodally
    • users express commands multimodallywhen describing spatial information, but not when e.g. they print something
putting it all together1
putting it all together
  • myth #2: Speech and pointing is the dominant multimodal integration pattern
  • myth #3: Multimodal input involves simultaneous signals
    • consider the McGurk effect:
    • when, the spoken sound /ga/ is superimposed on the video of a person uttering /ba/, most people perceive the speaker as uttering the sound /da/.
    • opening the mouth does not coincide temporally with uttering a word
putting it all together2
putting it all together
  • myth #4: Speech is the primary input mode in any multimodal system that includes it
    • Mehrabian indicates that most of the conveyed message is contained in facial expressions
      • wording  7%, paralinguistic  38%
    • Do you talk to your computer?
    • People look at the face and body more than any other channel when they judge nonverbal behavior [Ambady and Rosenthal, 1992].
putting it all together3
putting it all together
  • myth #6: multimodal integration involves redundancy of content between modes
  • you have features from a person’s
    • facial expressions and body language
    • speech prosody and linguistic content,
    • even their heartbeat rate
  • so, what do you do when their face tells you different than their …heart?
putting it all together4
putting it all together
  • myth #7: Individual error-prone recognition technologies combine multimodally to produce even greater unreliability
  • wait for multimodal results later
  • hint:
    • facial expressions + speech >> facial expressions!
    • facial expressions + speech > speech!
but it can be good
but it can be good
  • what happens when one of the available modalities is not robust?
    • better yet, when the ‘weak’ modality changes over time?
  • consider the ‘bartender problem’
    • very little linguistic content reaches its target
    • mouth shape available (viseme)
    • limited vocabulary
fusing modalities
fusing modalities
  • so you have features and/or labels from a number of modalities
  • if they all agree…
    • no problem, shut down your PC and go for a beer!
  • but life is not always so sweet 
    • so how do you decide?
fusing modalities1
fusing modalities
  • two main fusion strategies
    • feature-level (early, direct)
    • decision level (late, separate)
  • and some complicated alternatives
    • dominant modality (a dominant modality drives the perception of others) – example?
    • hybrid, majority vote, product, sum, weighted (all statistical!)
fusing modalities2
fusing modalities
  • feature-level
    • one expert for all features
    • may lead to high dimensional feature spaces and very complex datasets
    • what happens within each modality is collapsed to a 1-D feature vector
    • features from robust modalities are considered in the same manner as those from uncertain
fusing modalities3
fusing modalities
  • feature-level
    • as a general rule, sets of correlated features and sets of most relevant features determine the decision
    • features may need clean-up!
    • e.g. a neural net will depend on relevant features (and indicate them!) after successful training
    • inconsistent features assigned lower weights
fusing modalities4
fusing modalities
  • decision-level
    • one expert for each modality
    • fails to model interplay between features across modalities
      • e.g. a particular phoneme is related with a specific lip formation
      • perhaps some are correlated, so selecting just one would save time and complexity
    • assigning weights is always a risk
    • what happens if your robust (dominant?) modality changes over time?
    • what happens if unimodal decisions differ?
fusing modalities5
fusing modalities
  • decision-level
    • if you have a robust modality (and you know which), you can get good, consistent results
    • sometimes, a particular modality is dominant
      • e.g. determined by the application
    • however, in practice, feature-based fusion outperforms decision-level
      • even by that much…
fusing modalities6
fusing modalities
  • for a specific user
    • dominant modality can be identified almost immediately
    • remains highly consistent over a session
    • remains stable across their lifespan
    • highly resistant to change, even when they are given strong selective reinforcement or explicit instructions to switch patterns
  • S. Oviatt, “Toward Adaptive Information Fusion in Multimodal Systems”
fusing modalities7
fusing modalities
  • humans are able to recognize an emotional expression in face images with about 70-98% accuracy
    • 80-98% automatic recognition on 5-7 classes of emotional expression from face images
    • computer speech recognition: 90% accuracy on neutrally-spoken speech  50-60% accuracy on emotional speech
    • 81% automatic recognition on 8 categories of emotion from physiological signals
again why multimodal
again, why multimodal?
  • holy grail: assigning labels to different parts of human-human or human-computer interaction
  • yes, labels can be nice!
    • humans do it all the time
    • and so do computers (it’s called classification!)
    • OK, but what kind of label?
      • GOTO STEP 1 
it s all about the data
it’s all about the data!
  • Sad, but true 
    • very few multimodal (audiovisual) databases exist
    • lots of unimodal, though
    • lots of acted emotion
  • comprehensive list at
acted natural or
acted, natural, or…?
  • Acted is easy!
    • just put together a group of students/volunteers and hand them a script
  • Studies show that acted facial expressions are different than real ones
    • both feature- and activation-wise
    • can’t train on acted and test on real
acted natural or1
acted, natural, or…?
  • Natural is hard…
    • people don’t usually talk to microphones or look into cameras
    • emotions can be masked, blended, subtle…
  • What about induced?
    • The SAL technique (a la Wizard of Oz or Eliza)
    • Computer provides meaningless cues to facilitate discussion
    • Should you induce sadness or anger?
recognition from speech prosody
recognition from speech prosody
  • Historically, one of the earliest attempts at emotion recognition
  • Temporal unit: tune
    • a segment between two pauses
    • emotion does not change within a tune!
    • but also some suprasegmentalefforts (extends over more than one sound segment)
recognition from speech prosody1
recognition from speech prosody
  • Most approaches based on pitch and its F0
    • and statistical measures on it
    • e.g. distance between peaks/between pauses, etc. [Batliner et al.]
recognition from speech prosody2
recognition from speech prosody
  • Huge number of available features
    • all of them relevant?
    • imminent need to clean up
    • correlation, ANOVA, sensitivity analysis
    • irrelevant features hamper training
    • good results even with 32 features
recent findings
recent findings
  • Batliner et al, from Humaine NoE
  • The impact of erroneous F0 extraction
    • recent studies question the role of pitch as the most important prosodic feature
    • manually corrected pitch outperforms automatically extracted pitch
    • extraction errors?
recent findings1
recent findings
  • Voice quality and emotion
    • claims that voice quality serves the marking of emotionsare not verified in natural speech, mostly for acted or synthesized data
    • at first sight, some emotions might display higher frequencies of laryngealizations
    • rather, a combination of speaker-specific traits and lexical/segmental characteristics which causes the specific distribution
recent findings2
recent findings
  • Impact of feature type and functionals on classification performance
  • Emotion recognition with reverberated and noisy speech
    • good microphone quality (close-talk microphone), artificially reverberated speech, and low microphone quality (room microphone) flavours
    • speech recognition deteriorates with low quality speech
    • emotion recognition seems to be less prone to noise!
recognition from facial expressions
Holistic approaches

image comparison with known patterns, e.g. eigenfaces

suffer from lighting, pose, rotation, expressivity, etc.

recognition from facial expressions
recognition from facial expressions1
recognition from facial expressions
  • Facial expressions in natural environments are hard to recognize
    • Lighting conditions (edge artifacts)
    • Colour compression, e.g. VHS video (colour artifacts)
    • Not looking at camera
    • Methods operating on a single feature are likely to fail
    • Why not try them all?!
feature extraction1
Canny operator for edge detection

Locates eyebrows, based on (known) eye position

feature extraction
feature extraction2
Texture information is richer within the eye

especially around the borders between eyebrows, eye white and iris

Complexity estimator: variance around a window size n

feature extraction




feature extraction3
feature extraction
  • same process for the mouth
    • neural network
feature extraction4
feature extraction
  • same process for the mouth
    • luminosity
mask fusion
mask fusion
  • comparison with anthropometric criteria
  • better performing masks rewarded
  • for a video with good colour conditions  colour-based masks
from areas to points
from areas to points
  • Areas  bounding boxes Points
  • Compatible with MPEG-4 Facial Animation Parameters (FAPs)
from areas to points2
Sets of FAP values  facial expressions

Example in the positive/active quadrant (+,+)

from areas to points
recognition from hand gestures
recognition from hand gestures
  • Very few gestures have emotion-related meaning
  • Emotions change the way we perform a particular gesture
    • consider how you wave at a friend or someone you don’t really like
  • We can check motion-based features for correlation with an emotion representation
    • activation half plane
recognition from hand gestures1
recognition from hand gestures
  • Skin probability
  • Thresholding & Morphological Operations
  • Distance Transform
  • Frame difference
expressivity features
expressivity features
  • A set of parameters that modifies the quality of movement
  • Based on studies by Wallbott-Scherer and Gallaher:
    • Spatial: amplitude of movement (arm extension: wrist location)
    • Temporal: duration of movement (velocity of wrist movement)
    • Power: dynamic property of movement (acceleration)
    • Fluidity: smoothness and continuity of movement
    • Repetitiveness: tendency to rhythmic repeats (repetition of the stroke)
    • Overall Activation: quantity of movement across modalities
multimodal recognition
multimodal recognition
  • Neural networks and Bayesian networks  most promising results
    • usually on acted data
    • what about the dynamics of an expression
    • in natural HCI, when you smile you don’t go neutral  grin  neutral
  • Need to learn/adapt to sequences of samples
recognizing dynamics
recognizing dynamics
  • Modified Elman RNN deployed to capture dynamics of facial expressions and speech prosody
    • Used in tunes lasting >10 frames (i.e. half a second)
multimodal excellence
multimodal excellence!
  • Results from the SALAS dataset
    • As expected, multimodal recognition outperforms visual (by far) and speech recognition
    • Confusion matrix
multimodal excellence1
multimodal excellence!
  • Comparison with other techniques
f eature vs decision level fusion
feature- vs decision-level fusion
  • Experiments in Genoa dataset (acted)
    • Facial expressions, gesture expressivity, speech (tunes)
f eature vs decision level fusion1
feature- vs decision-level fusion
  • Decision-level fusion obtained lower recognition rates than feature-level fusion
    • best probability and majority (2 out of 3 modalities) voting
multimodal emotion recognition 2010

multimodal+emotion+recognition 2010

two years from now in a galaxy (not) far, far away…

a fundamental question1
a fundamental question
  • OK, people may be angry or sad, or express positive/active emotions
  • face recognition provides response to the ‘who?’ question
  • ‘when?’ and ‘where?’ are usually known or irrelevant
  • but, does anyone know ‘why?’
    • context information is crucial
is it me or1
is it me or?...
  • some modalities may display no cues or, worse, contradicting cues
  • the same expression may mean different things coming from different people
  • can we ‘bridge’ what we know about someone with what we sense?
    • and can we adapt what we know based on that?
    • or can we align what we sense with other sources?
another kind of language1
another kind of language
  • sign language analysis poses a number of interesting problems
    • image processing and understanding tasks
    • syntactic analysis
    • context (e.g. when referring to a third person)
    • natural language processing
    • vocabulary limitations
want answers

want answers?

see you in 2010!