Vocalic Markers of Deception and Cognitive Dissonance for Automated Emotion Detection Systems

Vocalic Markers of Deception and Cognitive Dissonance for Automated Emotion Detection Systems Dr. Aaron C. Elkins The University of Arizona

Emotional Voice

Can computers perceive vocal emotion? • Yes…. but, • The science of the emotional voice is young • Communication is complex and dynamic • Moods and emotions contextually switch • Emotion is computationally ill-defined • Measuring emotion may inform theory

Emotional Dimensions DISGUST?

Four Components of Speech • Voiced vs. Unvoiced sounds • [v] vs. [f] • Airstream through mouth or nose • [m] vs. [o]

Speech Sounds • (1) pitch, (2) loudness, and (3) quality • Sound is small variations in air pressure that occur rapidly in succession • Vocal folds superimpose outgoing air of voiced sounds • The vocal folds vibrate to create a periodic vibration (100 – 250 Hz) • We measure these features digitally

Recording Father – Digital Audio Waveform measures pulses of vocal folds Based on air pressure disturbance (dB) Voiced vs. Unvoiced (low pressure) Each peak occurs every 100th of a second (100 Hz)

Vowel Articulation • Source-Filter Theory (Müller, 1848) • Vocal Folds vibrate at same speed (pitch) • Resonance changes in vocal tract to filter frequencies (formants)

Vocalics • Vocalic Analysis • Examines how it was said • Amplitude • Pitch (frequency) • Response latency • Tempo • Linguistics • Examines what was said

Sound Production is Complex • When we tense our muscles, such during stress, our larynx tenses • Higher Pitch • The process is complex • Emotions affect the normal operation • Deception takes away cognitive resources away and is stressful • More mistakes, lower quality, increased average and variation in pitch • Sympathetic Nervous system response • Increased auditory acuity • Heightened arousal

Standard Vocal Measures Calculated with Praat and Custom Signal Processing Software

Nemesysco LVA 6.50Commercial Vocalic Software Evaluated

Five Vocalic Studies Summarized • Study One (Deception Experiment) • Study Two (Cognitive Dissonance) • Study Three (Embodied Conversational Agent and Trust) • Study Four (Embodied Conversational Agent Security Screening - Bomber) • Study Five (Embodied Conversational Agent Security Screening - Imposter)

Vocal Deception (Study 1) – Experimental Design • N = 96 • $10 reward for appearing credible to professional interviewer • Two Sequences: First Sequence: DT DDTT TD TTDD T Second Sequence: DT TTDD TD DDTT T • 13 Short-Answer Questions • Only 8 had variation both within and between subjects • Two types of questions: Charged and Neutral

Results • Built-in classification performed at chance level • Vocal measures independent of system discriminated deception: FMain, AVJ, and SOS • Possible Latent Variables measuring Conflicting Thoughts, Cognitive Effort, and Emotional Fear • Logistic regression performed best on charged questions • Higher pitch, cognitive effort, and hesitations are predictive of deception in more stressful interactions • The claim that the vocal analysis software measures stress, cognitive effort, or emotion cannot be completely dismissed • Deception and Stress can be predicted by Acoustic measures of Voice Quality and Pitch when controlling for speaker characteristics

Vocal Dissonance (Study 2) –Experimental Design • Modified Induced-Compliance Paradigm • Participants (N=52) made two vocal counter-attitudinal arguments for cutting funding for service for the disabled • Choice is manipulated High vs. Low (IV) High N = 24, Low N = 28 • Participants report attitude towards argument issue (DV)

Arousal (Vocal Pitch) • High choice had a 10Hz higher pitch • F(1,50) = 4.43, p = .04 • All participants reduced their pitch over time • F(1,50) = 4.90, p = .03

Cognitive Difficulty • High Choice had nearly 2x the response latency on argument two • F(1,50) = 4.53, p = .04 • Arousal moderation

Cognitive Difficulty • Participants spoke with 33% more nonfluencies on the second argument • F(1,50) = 4.03, p = .05

The Importance of Language (Imagery as Abstract Language)

Vocal Dissonance Model • χ²(1, N = 51), p = .49 SRMR = .02 • R² Attitude Change = .17, Imagery = .11

From the lab to the AVATAR

First Kiosk

Kiosk from Last Year

Third-Generation Kiosk

Gender and Demeanor

Vocal Trust (Study 3) – Experimental Design • Participants completed pre-survey • Packed bag before ECA screening interviewing • Completed security screening • All responses to ECA recorded for vocal analysis

ECA Demeanor and Gender N = 88 Participants (53 Males, 35 Females) Repeated Measures Latin Square Design All participants interacted with all demeanor and gender ECA combinations 4 Questions Per block, 16 Total Questions

Trust and Time Multilevel Growth Model Specified with Trust as the DV (N = 218) with Subject as random effect (N=60) Main effects • Initial Trust = 4.09 • Trust Rate of Change • .04 per second increase • p < .01 • Duration • .05 decrease in trust for every second spent answering the ECA overthe 7.6 second average • p < .001

Vocal Pitch, Time, and Trust • Main Effect of Pitch • For every 1Hz increase in pitch over 156Hz trust drops by .01 • p = .03 • Interaction Pitch and Time • Pitch x Time b = 9.3e-05, p = .03 • Over time pitch predicts trust less and less

Results • Human perceptions of trust transfer to ECA • Time plays in important role in the interaction • All participants trusted the ECA more over time, particularly when it smiled • 48 increase in trust when ECA smiles • Vocal measures of pitch predicted trust, but only early on • For every 1Hz increase in pitch over 156Hz trust drops by .01 • Over time pitch predicts trust less and less

Vocalics of a Bomber (Study 4) Experimental Design • 29 EU border guards were randomly assigned to build a bomb (N = 16) or Control (N = 13) then pack a bag • Identical to Study 3,but no breaks in the interview • Only male neutral demeanor ECA interviewed participants • Bomb Makers were instructed to successfully smuggle the bomb past the ECA

Vocal Analysis Recorded responses to question: • “Has anyone given you a prohibited substance to transport through this checkpoint?” • Average Response 2.68 sec (SD = 1.66) • Responses such as “No” or “of course not” • Vocal measures of Pitch and Pitch Variation

Results of Vocal Pitch • Voice Quality, Gender, and Intensity included as covariates • No difference in mean vocal pitch F(1,22)=0.38, p = .54 • Main Effect of pitch variation • Bomb Makers had 25.34%more variation F(1,22)=4.79, p=.04

Pitch Contours

Eye Gaze: Guilty

Eye Gaze: Innocent

Vocalics of an Imposter (Study 5) – Experimental Design • 38 EU Border Guards • All required to present visa and passport through multiphase screening • E-gate • Manual Processing • AVATAR Screening Interview • Four randomly assigned imposters carrying false documents with hostile intentions through screening

AVATAR Interaction Example

iPad Output for Screener

Voice Quality Change from Baseline Question (What is your full name?)

Vocalic Classification Model

Vocalic Resulting Classification • 7 innocents falsely classified as terrorists • 27 correctly classified as innocent • All “guilty” referred to secondary • Overall accuracy = 81% • TPR = 100% • TNR = 79% • FPR = 20% • FNR = 0%

Eye Fixations on Visa

Date of Birth Results – Correct?

Final Decision Model

Vocalic Resulting Classification • 3 innocents falsely classified as terrorists • One of these three was actually lying • Actually a True Positive • 31 correctly classified as innocent • All “guilty” referred to secondary • Overall accuracy = 94.47% • TPR = 100% • TNR = 88.24% • FPR = 5.8%  Reduced by 3/4 • FNR = 0%

Questions? • Isn’t the voice amazing?

Vocalic Markers of Deception and Cognitive Dissonance for Automated Emotion Detection Systems