Spoken Dialogue Systems: Architecture & Modelling

Languages for the Annotation and Specification of Dialogues(updated 31-Oct-2001) Gregor Erbach (gor@acm.org) Languages for the Annotation and Specification of Dialogues

Course Outline 1. Introduction to Spoken Dialogue Systems 2. Linguistic Resources in SDS 3. Developing Spoken Dialogue Applications 4. Annotation of Dialogues • Uses of annotated dialogues • Levels of annotation, multilevel annotation • Annotation Graphs • Annotation Frameworks (ATLAS) 5. Introduction to XML 6. Dialogue Annotation in XML (MATE) Languages for the Annotation and Specification of Dialogues

Outline (2) 7. Evaluation of Spoken Dialogue Systems 8. Dialogue Specification Languages • Behaviouristic Models (pattern-response) • Finite-State Models • Slot-Filling • Condition-Action Rules (HDDL) • Planning • Re-usable Dialogue Behaviours: SpeechObjects 9. Voice XML 10. Research Challenges Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue Systems • Human-machine dialogue differs from human-human dialogue: • limited natural-language understanding • limited vocabulary • limited back-channel • limited world knowledge and inference capabilities • limited social and emotional competence • speech recognition errors • Design and implementation of dialogue system is a discipline between science and engineering Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsDialog System Architecture speech under-standing dialogue control application logic / reasoning database / knowledge base speech output Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsDialogue Modelling Interaction Model Language Model Dialogue Model (from Bernsen, Dybkjær and Dybkjær, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsSpeech and Audio Processing Speech Understanding Signal processing: • Convert the audio wave into a sequence of feature vectors Speech recognition: • Decode the sequence of feature vectors into a sequence of words Semantic interpretation: • Determine the meaning of the recognized words Speech Output Speech generation: • Generate marked-up word string from system semantics Speech synthesis: • Generate synthetic speech from a marked-up word string Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsAutomatic Speech Recognition (ASR) • Research activities since the 1950s • Widespread commercial use since a number of years, enabled by increased processor power, memory and better software engineering • speech recognisers can be implemented on PCs as software-only applications Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsASR Fundamentals • Digitisation of the acoustic signal • Signal analysis: distribution of acoustic energy over time and frequency, represented as feature vectors • Matching against stored patterns (acoustic models) • Selection of the best pattern by using linguistic knowledge and world knowledge Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsSignal Analysis (Output of the speech analysis tool PRAAT) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsChallenges in ASR • speaker-independent recognition • Variation of speakers (age, dialect, diseases ...) • Vocabulary size • Continuous speech • Spontaneous speech • Background noise • Distorted signal transmission Languages for the Annotation and Specification of Dialogues

Task Difficulty Device control Dialogue system Voice Dialling Dictation System 10000 10 100 1000 100000 1M Vocabulary Size 1. Spoken Dialogue SystemsDifficulty vs. Vocabulary Size Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsThe Speech Recognition Problem • Bayes’ Law • P(a,b) = P(a|b) P(b) = P(b|a) P(a) • Joint probability of a and b = probability of b times the probability of a given b • The Recognition Problem • Find most likely sequence w of “words” given the sequence of acoustic observation vectors a • Use Bayes’ law to create a generative model • ArgMaxwP(w|a) = ArgMaxw P(a|w) P(w) / P(a) = ArgMaxw P(a|w) P(w) • Acoustic Model: P(a|w) • Language Model: P(w) (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsPronunciation Modelling • Needed for speech recognition and synthesis • Maps orthographic representation of words to sequence(s) of phones • Dictionary doesn’t cover language due to: • open classes • names • inflectional and derivational morphology • Pronunciation variation can be modeled with multiple pronunciation and/or acoustic mixtures • If multiple pronunciations are given, estimate likelihoods • Use rules (e.g. assimilation, devoicing, flapping), or statistical transducers (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsLanguage Modelling • Assigns probability P(w) to word sequence w =w1 ,w2,…,wk • Bayes’ Law provides a history-based model: P(w1 ,w2,…,wk) = P(w1) P(w2|w1) P(w3|w1,w2) … P(wk|w1,…,wk-1) • Cluster histories to reduce number of parameters (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsN-Gram Language Modelling • n-gram assumption clusters based on last n-1 words • P(wj|w1,…,wj-1) ~ P(wj|wj-n-1,…,wj-2 ,wj-1) • unigrams ~ P(wj) • bigrams ~ P(wj|wj-1) • trigrams ~ P(wj|wj-2 ,wj-1) • Trigrams often interpolated with bigram and unigram: • the litypically estimated by maximum likelihood estimation on held out data (F(.|.) are relative frequencies) • many other interpolations exist (another standard is a non-linear backoff) (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsRecognition Grammars • Restrict the possible user inputs at each step of the dialogue • Restriction of possible inputs is necessary for speaker-independent systems to improve recognition accuracy • Recognition grammars in commercial dialogue systems are generally regular or context-free grammars • Dynamically generated grammars can be used which are adapted to the state of the dialogue • Closed grammars match user input from beginning to end • Open grammars match parts of the user input Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsFinite-State Language Models • Write a finite-state task grammar (with non-recursive CFG) • Simple Java Speech API example (from user’s guide): public <Command> = [<Polite>] <Action> <Object> (and <Object>)*; <Action> = open | close | delete; <Object> = the window | the file; <Polite> = please; • Typically assume that all transitions are equi-probable • Technology used in most current applications • Can put semantic actions in the grammar (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsJava Speech Grammar Format • Java Speech Grammar Format (JSGF) is a widely used format for recognition grammars <xyz> Grammatical Category xyz * Repetition (0 to n times) + Repetition (1 to n times) (...) Grouping [...] Grouping, optional | Alternatives /n/ Alternative with weight n Languages for the Annotation and Specification of Dialogues

Austin today flights from Boston for pay Boston lights to for 1. Spoken Dialogue SystemsWord hypothesis graphs • Keep multiple tokens and return n-best paths/scores: • p1 flights from Boston today • p2 flights from Austin today • p3 flights for Boston to pay • p4 lights for Boston to pay • Can produce a packed word graph (a.k.a. lattice) • likelihoods of paths in lattice should equal likelihood for n-best (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsMeasuring Recognition Performance • Word Error Rate = • Example scoring: • actual utterance: four six seven nine three three seven • recognizer: four oh six seven five three seven insert subst delete • WER: (1 + 1 + 1)/7 = 43% • Would like to study concept accuracy • typically count only errors on content words [application dependent] • ignore case marking (singular, plural, etc.) • For word/concept spotting applications: • recall: percentage of target words (concept) found • precision: percentage of hypothesized words (concepts) in target (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

Dictation system Dialogue system Speaker dependence Speaker-dependent or speaker-adaptive (must be trained for each speaker) Speaker-independent Vocabulary Size Up to 100.000 words, always active Several thousand words, of which a subset is active Nature of the User Input Unrestricted, including complex sentences Only certain pat-terns are recognised at each step 1. Spoken Dialogue SystemsDictation vs. Dialogue System Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsSpeaker Verification • Speaker verification: confirm the claimed identity of a speaker • Speaker identification: recognition of one speaker among a group of potential candidates • Evaluation by means of the ratios "false acceptance" and "false rejection" • One measure can be improved at the expense of the other • For high-security applications, speaker verification should be combined with other methods (password, chip card, biometrics...). Languages for the Annotation and Specification of Dialogues

2. Linguistic Resources for Dialogue Systems • Acoustic Models • Phonetic Lexicon • Language Models (Grammars) • Dialogue Specifications • System Output (Prompts) • Training data: annotated human/human or human/machine dialogues Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSAcoustic Models • Tri-phone HMMs • transcribed speech used for training • orthographic transcriptions + noise markers + phonetic lexicon • SpeechDat is a standard format for transcription. Each audio file is associated with a label file which contains the transcription plus information about the speaker (age, sex, education level) and the call (telephone network, environment) Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSSPEECHDAT Label File LHD: SAM, 6.0 DBN: SpeechDat_Austrian_Mobile VOL: MOBIL1AT_01 SES: 0099 DIR: \MOBIL1AT\BLOCK00\SES0099 SRC: B10099C2.ATA CCD: C2 BEG: 0 END: 63487 REP: Connect Austria, Vienna RED: 02/Jan/2000 RET: 16:15:45 SAM: 8000 SNB: 1 SSB: 8 QNT: A-LAW SCD: 000099 SEX: F AGE: 22 ACC: NOE REG: Wien ENV: HOME NET: MOBILE, A1 PHM: UNKNOWN, EFR SHT: 600-0663 EDU: MATURA NLN: DE-AT ASS: OK LBD: LBR: 0,63487,,,,0354/329 851 LBO: 0,,63487,[sta] null drei fünf vier drei zwei neun acht fünf eins Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSPhonetic Lexicon • The phonetic lexicon consists of pairs <orthography, phonetic-representation+>, where the phonetic symbols correspond to the acoustic models used in the speech recogniser • Phonetic lexicons are also used for text-to-speech synthesis. • Example (with SAM-PA transcriptions): Abkehr a p k e:6 Abkommen a p + k O m @ n a p k O m @ n Abkommens a p k O m @ n s Ablauf a p l aU f Ablegers a p l e: g 6 s Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSLanguage Models • Two kinds of language models are widely used: statistical language models and recognition grammars • Statistical LMs are generally used for dictation systems • Recognition grammars are often used for speaker-independent dialogue systems • Recognition grammars are often finite-state models, or non left-recursive context-free grammars • Statistical LMs and recognition grammars can be combined (e.g. Philips, Nuance 8) • Language models can be trained or optimised using text corpora or transcriptions of dialogues Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSDialogue Specifications • Dialogue specifications are used to control the flow of the dialogue • Dialogue specifications can be expressed • as executable code in some programming language • as a task model • in some dialogue specification language • Dialogue specifications must provide repair strategies to deal with recognition failures and unacceptable user input Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSSystem Output (Prompts) • Prompts are the speech output provided to the user of the dialogue system • Prompts should • be clear and understandable • encourage the user to produce system-friendly speech input • convey the personality chosen for the system • Other audio sounds ("earcons") can be used in addition to prompts to provide orientation • Prompts can be pre-defined, constructed by concatenation of partial prompts, or produced by a NL generator Languages for the Annotation and Specification of Dialogues

1. Spoken Dialogue SystemsSpeech Output • Recorded vs. synthesised speech • Recorded speech has higher user acceptance • Ensure smooth transitions and appropriate prosody when concatenating recorded speech • In case of large or highly variable vocabulary, speech synthesis must be used. • Speech synthesisers are evaluated according to intelligibility and naturalness. Languages for the Annotation and Specification of Dialogues

2. Lingusistic Resources for SDSTraining data: annotated dialogues • Transcribed speech data (not necessarily dialogues) for training of speech recogniser • Text data (ideally transcriptions of dialogues from a running application) for training of language models and/or optimization of recognition grammars • Labelled dialogues to determine the likely sequence of dialogue acts (dialogue grammar) • Dialogues labelled with communication failures and emotional markup for optimizing dialogue specifications • Annotated dialogues as a resource for system evaluation Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue Applications • Conflicting requirements: system "intelligence" vs. control of the dialogue flow • Imperfections of speech recognition (errors are the rule, not the exception) • Limited "understanding" of user utterances (out of vocabulary, out of grammar) • Dialogue system must take the initiative after dialogue failure and try to recover from the errors • Personality of the dialogue application Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsDevelopment Process • Requirements specification • Definition of dialogue flow • Rapid prototyping or Wizard-of-Oz Experiment (outputs: annotated dialogues, questionnaires, interviews) • Pilot system with basic functionality • Internal Tests • Trascription and annotation of dialogues • Optimisation of system functionality • Tests with external users • Extension and tuning of the system • Unless satisfactory system performance: go to 5 Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsTasks and Roles • Gather requirements and produce requirement specification (Analyst) • Specify dialogue flow (Dialogue Designer) • Define prompts (Interaction Designer) • Write and optimise recognition grammars (Grammar Writer) • Usability testing with "real" users (Usability Tester) • Transcribe and annotate dialogues from usability testing and deployed application (Annotator) • Test and optimize grammars, language models and dialogues (Quality Assurance Engineer, Grammar Writer, Dialogue Designer) • System Integration (Software Engineer) Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsDialogue Initiative • System initiative • for systems that are not regularly used by the same users • User initiative • experienced users can issue commands without system prompts • Mixed initiative • e.g., for user questions or activation of help functionality • Over-answering of questions by the user Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsBarge-in • "Barge-In" is the interruption of system output by user input • Advantages: • Possibility to interrupt long system outputs (e.g. timetable information, reading of e-mails) • Faster answering of system questions for regular users • Problems: • Interruption of system output through background noise or side speech (to or from colleagues or children) • Echo cancellation required to avoid activation of barge-in by system output Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsVerification of User Input • Verification is the confirmation of user input by the system, with a possibility of correction • Explicit Verification: User must confirm the input explicitly, usually by saying "yes" or "no" • Implicit Verification: The user's input is repeated, and accepted if the user does not contradict. Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsRepair Strategies • Misunderstandings and communication problems are common in human-human and in human-machine dialogues • Repair strategies are used for recovering from communication failure. • The relatively poor performance of speech recognisers causes many misunderstandings • Repair strategies must therefore be part of every practical dialogue system Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsCauses of Communication Problems • No speech detected (volume too low) • Failure to detect beginning or end of speech accurately (endpointing) • Misrecognitions or no recognition results due to • background noise • distorted speech transmission (microphone, phone line) • out-of-vocabulary words • out-of-grammar input • speaker variation • lack of semantic interpretation Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsEngineering Issues • Telephony interfaces (analog, digital, Voice over IP) • Speech signal quality • Echo cancellation • Scalability to hundreds or thousands of concurrent dialogues • Real-time speed • System integration Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsSpeech Recognition API • Abstract from recognition/synthesis engines • Recognizer and synthesizer loading • Acoustic and grammar model loading (dynamic updates) • Recognition • online • n-best or lattice • Synthesis • markup • barge in • Acoustic control • telephony interface • microphone/speaker interface (from Carpenter and Chu-Carroll, 1998) Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsCase Study: Postal Rates System • Technical Basis • Philips SpeechMania with Austrian German acoustic models • Windows NT 4.0 • Pentium III (Minimum: Pentium II, 300 MHz) • 256 MB RAM (Minimum: 64 MB) • ISDN card • Co-operation between Forschungszentrum Telekommunikation Wien (ftw.) und the Austrian Post Office Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsCase Stude: System Requirements • Mostly casual users • System initiative • Over-answering is allowed • No Barge-In • Speed of Interaction is important Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsExample Dialogue (1) S1 (System): Möchten Sie ein Paket oder eine Briefsendung verschicken? B2 (Benutzer): eine Briefsendung S3: Möchten Sie|den Brief|ins Inland oder ins Ausland verschicken? B4: ins Ausland S5: In welches Zielland wollen Sie|den Brief|schicken? B6: nach Italien S7: Soll der Brief|nach Italien|als Priority versendet werden? B8: nein danke bitte ich richtig (nein danke, das ist nicht nötig) S9: Bitte geben Sie das Gewicht des Briefes an Languages for the Annotation and Specification of Dialogues

3. Developing Spoken Dialogue ApplicationsExample Dialogue (2) B10: Niue Chile Sudan Bermuda zwanzig Gramm (na ich schätz so zwei hundert zwanzig Gramm) S11: nach Italien|sind folgende Sonderbehandlungen zulässig: Einschreiben, Eilsendung und Nachnahme. Wollen Sie eine dieser Sonderbehandlungen nutzen, wenn ja, welche? B12: nein richtig bitte kostet hier (nein das ist mir zu kompliziert) S13: Danke für Ihre Angaben S14: Ein Brief|nach Italien|20|Gramm schwer, der|priority|versendet wird, kostet|7|Schilling S15: Benötigen Sie eine weitere Auskunft? B16: nein danke Languages for the Annotation and Specification of Dialogues

4. Dialogue Annotation • Purpose of dialogue annotation • Linguistic description and analysis on different levels • Resources for conversation analysis (sociological, socio-linguistic research) • Resources for system engineering (acoustic models, language models) • Resources for application development (Prompts, recognition grammars, dialogue design) • Resources for system evaluation Languages for the Annotation and Specification of Dialogues

4. Dialogue AnnotationAnnotation Schemas • Corpus Encoding Standard • MATE • ATLAS • DAMSL The MATE project provides a good overview of annotation schemas Languages for the Annotation and Specification of Dialogues

4. Dialogue AnnotationSpoken Dialogue Corpora • Human-Human • Call Home (spontaneous telephone speech) • Map Task (direction giving on a map) • Switchboard (task-oriented human-human dialogues) • Childes (child language dialogues) • Verbmobil (appointment scheduling dialogues) • TRAINS (task-oriented dialogues in railroad freight domain) • Human-Machine • Danish Dialogue System (57 dialogues, domestic flight reservation) • Philips (13500 dialogues, train timetable information) • Sundial (100 Wizard of Oz dialogues, British flight information) Languages for the Annotation and Specification of Dialogues

Spoken Dialogue Systems: Architecture & Modelling