human language technologies hlt WP3 speech and emotion (analysis & recognition)
UERLN: SYMPAFLY • Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues. • Annotations: word-based emotional user states, prosodic and conversational peculiarities; dialogue (step) success; emotional user states distribution follows nested Pareto (80/20) principle
UERLN: AIBO • Children's interaction (age 10-12, 51 children, 9.2 hours of speech) with SONY’s AIBO robot, Wizard-of-Oz-scenario; cf. WP5 (plus English and read speech) • Annotations: word-based emotional user states (holistic, 5 labellers) and prosodic peculiarities; alignment of children's utterances with AIBO's actions; manual correction of F0, labelling of voice quality. Emotional user states for the English data.
AIBO disobedient: frommotherese to angry g'radeaus Aibolein ja M fein M gut M machst M du M *da M | *tz l"aufst du mal bitte nach links | stopp E Aibo stopp | nach links E umdrehen | nein M <*ne> nein M <*ne> nein M <*ne> so M weit M *simma M noch M nicht M aufstehen M Schlafm"utze M komm M hoch M | ja M so M ist M es M <*is> guter M Hund M lauf mal jetzt nach links | nach links Aibo | Aibolein M aufstehen M *son M sonst M werd' M ich M b"ose M hoch E | nach A links A | Aibo A nach A links A | Aibolein A ganz A b"oser A Hund A jetzt A stehst A du A auf A | hoch A | dreh dich ein bisschen | ja M so ist es <*is> gut stopp Aibo stopp | *tz lauf g'radeaus |
UERLN: Different Conceptualizations Remote control tool Pet dog Straight on little Aibo ok great You‘re doing fine now please to the left stop Aibo stop turn to the left no no no we aren´t that far yet get up sleepyhead get up yes that´s a good dog now go left left Aibo little Aibo get up else I´m getting angry get up Aibo left little Aibobad boy now get up turn a little ok that´s fine stop Aibo stop straight on Aibo straight on stop Aibo stop turn round to the left Aibo get up turn round to the left Aibo get up turn round, to the left Aibo get up get up Aibo now go left now straight on Aibo st´ straight on
ITC: Targhe • Fully automatic speech dialogue telephone system • 15,6 hours of Italian natural speech • 9444 files (turns) -> 450 emotionally rich • Word-level • Orthographic transcription and word segmentation • Prosodic peculiarities annotated • Turn-level • Holistic emotion labels • Sympafly(cf. UERLN) • for comparison and benchmarking
UKA: LDC2002S28 • Elicited emotional speech database; native American English • labels: 1 of 15 holistic speaker states per utterance; used in algorithm and feature set development
UKA: ISL Meeting Corpus • 18 recordings of multi-party (mean 5.1 participants) meetings; mean 35 minute duration; American English • Annotations: orthographic transcription; Verbmobil II, and discourse-level annotations.
Assessment of Data Collection: • focus on • spontaneous, realistic data • important/new types of dialogues/interaction • evaluation of annotations • considerable percentage of realistic (processed and available) databases world-wide
UERLN: Features • large feature vector for a context of 2 words: • 95 prosodic (duration, energy, F0, pauses) • 80 spectral (HNR, formant based frequencies and energy) • 24 MFCC • 30 POS • Language Models & dialogue based features
ITC: Features • Baseline feature set • 96 features • Based on energy, duration, and pitch • Final feature set • 273 features (many redundant) • Based on energy, duration, pitch, and pauses • Different pitch extractors tried • Normalized Cross Correlation • Weighted Auto Correlation • UERLN PDA • Different subsets compared • Different tests to reduce the feature space • Principal component analysis
UKA: 133 Acoustic Features • pitch, unvoiced/unvoiced energy, quartiles (15) • voice quality, Praat metrics (11) • harmonicity, quartiles (5) and Praat metrics (3) • zero-crossing rate vs energy, histogram (20) • correlation/regression, coefficients (36) • vocal tract volume, quartiles (25) • duration/timing, verbmobil features (18)
Classifiers • UERLN: Linear Discriminant Analysis LDA, Decision Trees (CARTs), Neural Networks NN, Support Vector machines SVM, Gaussian Mixtures GM, Language Models LM • ITC: Decision Trees (CARTs), Neural Networks NN • UKA: Linear, Neural Networks NN, Support Vector machines SVM
UERLN classification I: SympaFly GM/NN, 2 classes, neutral vs. problem, l≠t LDA, 4 classes SVM/CART, 2 classes, loo dialogue step success, 2 classes, SVM: CL 82.5 dialogue success, 2 classes, CART: CL 85.4 RR: overall rec. rate CL: class-wise averaged rec. rate
UERLN classification II: AIBO • joyful • surprised • motherese • neutral (default) • rest (non-neutral) • bored • helpless, hesitant • emphatic • touchy (=irritated) • angry • reprimanding 4 classes "AMEN", NN
ITC Classification II: • Final feature set • 273 (acoustic/temporal) features • 2 class problem (neutral and non neutral) RR = overall rec. rate; CL = class-wise averaged rec. rate N = neutral turns; NN = Non neutral turns
UKA Classification II: 133 utterance-level prosodic features, 15 classes, acted speech, 8 speakers:
Assessment of Features • a pool of many different features/feature groups implemented/compared • prosodic features better (more consistent) than "spectral" features in realistic speech • combination of knowledge sources improves performance • relevance of single features (feature classes)?
Assessment of Classifications • not much difference between different classifiers in classification performance (linear classifiers highly competitive in speaker-independent classification) • large differences between speaker-dependent and speaker-independent classification
Categories & Dimensions cf. also tomorrow
UKA: Meeting Annotation Meeting audio appears to be rich in non-neutral speech. Open-set holistic labeling of 5 meetings by 3 labellers
UKA: towards new Dimensions for Social Interaction in Meetings denoting conflict, bulding community, or skepticism etc. weakpower strong self support group
Assessment of Categories & Dimensions • New categories, new dimensions, new consistency measure • prototypical "full-blown" emotions are rare • labels depend on type of data (call center, human-robot, different types of multi-party meeting) • new dimensions that do not model emotions but interaction between participants in communication • new entropy based consistency measure