Enhancing Human-Computer Interaction: A Study on Talking Avatar Head Movements and Perception
This research explores how head movements of talking avatars impact human perception in various applications like teleconferencing and online education. By quantitatively analyzing user-rated head animations alongside audio speech features, the study identifies critical correlations between head motion and perceived naturalness. Using advanced techniques, including optical motion capture, the research provides insights into optimal head motion frequencies that enhance human interactions. The findings aim to improve avatars' effectiveness in computer-human interfaces, leading to more engaging online experiences.
Enhancing Human-Computer Interaction: A Study on Talking Avatar Head Movements and Perception
E N D
Presentation Transcript
Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science University of Houston
Motivation • Avatars have been increasingly used in Human-Computer Interfaces • Teleconferencing, computer-mediated communication, distance education, online virtual worlds, etc. • Human-like avatar gestures influence human perception significantly • Facial expressions • Hand gestures • Lip movements • head movements • One of the crucial visual cues to facilitate engaging social interaction and communication
Our Quantitative Perspective Talking Avatar Head Animations • Uncover how talking avatar head movements affect human perception • User-rated head animations’ naturalness • Joint features extracted from head animations (with audio) • Acoustic speech features • Head motion patterns • Quantitatively analyze the association between extracted joint features and user ratings User evaluation Feature extraction Joint Features Perception (rating) Analysis of the association
Data Acquisition and Processing • Acquisition of the audio-head motion dataset • Head & speech were recorded simultaneously • Head motion: optical motion capture system (120 Hz) • Speech: microphone (48 kHz) • Processing of the captured audio-head motion dataset • Head motion: 3 Euler rotation angles per frame • Speech: pitches and RMS energy • Aligned head & speech datasets to the same frame rate (24 FPS) Y-axis rotation X-axis rotation Z-axis rotation
Subjective Evaluation • Using the captured dataset, we generated 60 head animation clips • Based on 15 recorded speech clips • 4 different audio-head motion generation techniques • Mosaic on the mouth region • User study • 18 participants • Ages: 23~28 • Gender: female (16.67%), male (83.33%) • Language: fluent English-speakers • User rating: 1~5
Speech-Head Motion Features and Perception • Measure the correlation between head motion and speech features • Canonical Correlation Analysis (CCA) • Pitch-Head motion and human perception • Computed Pearson coefficient: 0.731 • Energy-Head motion and human perception • Seem random, definitely not linear.
Speech-Head Motion Features and Perception • Implications for CHI • Validate the tight coordination between speech and head motion: Precise timing in generation is required • Delayed head movement generation may significantly degrade human perception • An approximate linear correlation between user ratings and CCA for Pitch-head motion • Prosody driven head motion synthesis could be fundamentally sound. • No a simple linear correlation between user ratings and CCA for RMS Energy-head motion • RMS energy may vary among sentences
Frequency-Domain Analysis of Head Motion Y-axis • Frequency-domain analysis of head motion • Head motion: rotation angles • Frequency spectrum: FFT transform applied to the head rotation angle vector • Association between head motion spectrum and human perception • With squared magnitude less than 5 degree. Z-axis X-axis - X-axis: average user rating (2.1 ~ 4.2) - Y-axis: the squared magnitude of three Euler angles in the head rotation (0 ~ 5 degree) - Z-axis: Frequency spectrum (0 ~ 19 Hz)
Frequency-Domain Analysis of Head Motion Low-frequency patterns • Key observations • Highly rated: low-frequency • Natural head motion: less than 10 Hz • Lowly rated: high-frequency • Typically lager than 12 Hz • With a small range of head movements • Implications for HCI • The comfortable head motion frequency zone: 0~12 Hz • Smooth post-processing for head motion generations of talking avatar • Smooth: Post-process the synthesized head motions • Simply crop the high frequency part from the synthesized head motions High-frequency patterns
Conclusion and Future Work • Summary of our findings • The coupling between the pitch and head motion has a strong linear correlation with human perception • The perceived-natural head motions mainly consist of low-frequency motion components and those high-frequency components (>12 Hz) will damage human perception significantly. • Future work • Multi-party conversation scenario • Analysis of other fundamental speech features: pause, repetitions, etc. Acknowledgments: This work is in part supported by NSF IIS-0914965, Texas Norman Hackerman Advanced Research 003652-0058-2007, and research gifts from Google and Nokia.