Prosody in Generation

Prosody in Generation

Natural Language Generation (NLG) • Typical NLG system does • Text planning transforms communicative goal into sequence or structure of elementary goals • Sentence planning chooses linguistic resources to achieve those goals • Realization produces surface output

Research Directions in NLG • Past focus • Hand-crafted rules inspired by small corpora • Very little evaluation • Monologue text generation • New directions • Large-scale corpus-based learning of system components • Evaluation important but howto do it still unclear • Spoken monologue and dialogue

How to produce speech instead of text?

Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-to-Speech (CTS) • Current Approaches to CTS • Hand-built systems • Corpus-based systems • NLG Evaluation • Open Questions

Importance of NLG in Dialogue Systems • Conveying information intonationally for conciseness and naturalness • System turns in dialogue systems can be shorter S: Did you say you want to go to Boston? S: (You want to go to)Boston H-H% • Not providingmis-information through misleading prosody ...S: (You want to go to)Boston L-L%

Silverman et al ‘93: • Mimicking human prosody improves transcription accuracy in reverse telephone directory task • Sanderman & Collier ‘97 • Subjects were quicker to respond to ‘appropriately phrased’ ambiguous responses to questions in a monitoring task Q: How did I reserve a room? vs. Which facility did the hotel have? A: I reserved a room L-H% in the hotel with the fax. A: I reserved a room in the hotel L-H% with the fax.

Prosodic Generation for TTS • Default prosodic assignment from simple text analysis • Hand-built rule-based system: hard to modify and adapt to new domains • Corpus-based approaches (Sproat et al ’92) • Train prosodic variation on large labeled corpora using machine learning techniques • Accent and phrasing decisions • Associate prosodic labels with simple features of transcripts

# of words in phrase • distance from beginning or end of phrase • orthography: punctuation, paragraphing • part of speech, constituent information • Apply learned rules to new text • Incremental improvements continue: • Adding higher-accuracy parsing (Koehn et al ‘00) • Collins ‘99 parser • More sophisticated learning algorithms (Schapire & Singer ‘00) • Better representations: tree based? • Rules always impoverished • How to define Gold Standard?

Spoken NLG • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy

Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-to-Speech (CTS) • Current approaches to CTS • Hand-built systems • Corpus-based systems • NLG evaluation • Open questions

Relying upon Prior Research • MIMIC CTS (Nakatani & Chu-Carroll ‘00) • Use domain attribute/value distinction to drive phrasing and accent: critical information focussed Movie: October Sky Theatre: Hoboken Theatre Town: Hoboken • Attribute names and values always accented • Values set off by phrase boundaries • Information status conveyed by varying accent type (Pierrehumbert & Hirschberg ‘90) • Old (given) L* • Inferrable (by MIMIC, e.g. theatre name from town) L*+H

Key (to formulating valid query) L+H* • New H* • Marking Dialogue Acts • NotifyFailure: U: Where is “The Corrupter” playing in Cranford. S: “The Corrupter”[L+H*] is not [L+H*] playing in Cranford [L*+H]. • Other rules for logical connectives, clarification and confirmation subdialogues • Contrastive accent for semantic parallelism (Rooth ‘92, Pulman ‘97) used in GoalGetter and OVIS (Theune ‘99) The cat eats fish. The dog eats meat.

But … many counterexamples • Association of prosody with many syntactic, semantic, and pragmatic concepts still an open question • Prosody generation from (past) observed regularities and assumptions: • Information can be ‘chunked’ usefully by phrasing for easier user understanding • But in many different ways • Information status can be conveyed by accent: • Contrastive information is accented? S: You want to go to L+H* Nijmegen, L+H* not Eindhoven.

Given information is deaccented? Speaker/hearer givenness U: I want to go to Nijmegen. S: You want to go to H* Nijmegen? • Intonational contours can convey speech acts, speaker beliefs: • Continuation rise can maintain the floor? S: I am going to get you the train information [L-H%]. • Backchanneling can be produced appropriately? S: Okay. Okay? Okaaay… Mhmm..

Wh and yes-no questions can be signaled appropriately? S: Where do you want to go. S: What is your passport number? • Discourse/topic structure can be signaled by varying pitch range, pausal duration, rate?

MAGIC • MM system for presenting cardiac patient data • Developed at Columbia by McKeown and colleagues in conjunction with Columbia Presbyterian Medical Center to automate post-operative status reporting for bypass patients • Uses mostly traditional NLG hand-developed components • Generate text, then annotate prosodically • Corpus-trained prosodic assignment component • Corpus: written and oral patient reports • 50min multi-speaker, spontaneous + 11min single speaker, read • 1.24M word text corpus of discharge summaries

Transcribed, ToBI labeled • Generator features labeled/extracted: • syntactic function • p.o.s. • semantic category • semantic ‘informativeness’ (rarity in corpus) • semantic constituent boundary location and length • salience • given/new • focus • theme/ rheme • ‘importance’ • ‘unexpectedness’

Very hard to label features • Results: new features to specify TTS prosody • Of CTS-specific features only semantic informativeness (likeliness of occuring in a corpus) useful so far (Pan & McKeown ‘99) • Looking at context, word collocation for accent placement helps predict accent (Pan & Hirschberg ‘00) RED CELL (less predictable) vs. BLOOD cell (more) Most predictable words are accented less frequently (40-46%) and least predictable more (73-80%) Unigram+bigram model predicts accent status w/77% (+/-.51) accuracy

Stochastic, Corpus-based NLG • Generate from a corpus rather than hand-built system • For MT task, Langkilde & Knight ‘98 over-generate from traditional hand-built grammar • Output composed into lattice • Linear (bigram) language model chooses best path • But … • no guarantee of grammaticality • How to evaluate/improve results? • How to incorporate prosody into this kind of generation model?

FERGUS (Bangalore & Rambow ‘00) • Corpus-based learning to refine syntactic, lexical and prosodic choice • Domain is DARPA Communicator task (air travel information) • Uses stochastic tree model + linear LM + XTAG (hand-crafted) grammar • Trained on WSJ dependency trees tagged with p.o.s., morphological information, syntactic SuperTags (grammatical function, subcat frame, arg realization), WordNet sense tags and prosodic labels (accent and boundary)

Input: • Dependency tree of lexemes • Any feature can be specified, e.g. syntactic, prosodic control poachers <L+H*> now trade the underground

Tree Chooser: • Selects syntactic/prosodic properties for input nodes based match with features of mothers and daughters in corpus control poachers<L+H*> now trade the underground

Unraveler: • Produces lattice of all syntactically possible linearizations of tree using XTAG grammar underground poachers trade now control the s now poachers underground trade

Linear Precedence Chooser: • Finds most likely lattice traversal, using trigram language model Now [H*] poachers [L+H*] [L-] control the underground trade [H*] [L-L%]. • Many ways to implement each step • How to choose which works ‘best’? • How to evaluate output?

Evaluating NLG • How to judge success/progress in NLG an open question • Qualitative measures: preference • Quantitative measures: • task performance measures: speed, accuracy • automatic comparison to a reference corpus (e.g. string edit-distance and variants, tree-similarity-based metrics) • Not always a single “best” solution • Critical for stochastic systems to combine qualitative judgments with quantitative measures (Walker et al ’97)

Qualitative Validation of Quantitative Metrics • Subjects judged understandability and quality • Candidates proposed by 4 evaluation metrics to minimize distance from Gold Standard (Bangalore, Rambow & Whittaker ‘00) • Tree-based metrics correlate significantly with understandability and quality judgments -- string metrics do not • New objective metrics learned • Understandability accuracy = (1.31*simple tree accuracy -.10*substitutions=.44)/.87 • Quality accuracy = (1.02*simple tree accuracy - .08*substitutions - .35)/.67

More Open Questions for Spoken NLG • How much to model human original? • Planning for appropriate intonational variation even important in recorded prompts • Timing and backchanneling • What kind of output is most comprehensible? • What kind of output elicits most easily understood user response? (Gustafson et al ’97,Clark & Brennan ‘99) • Implementing variations in dialogue strategy • Implicit confirmation • Mixed initiative

Prosody in Generation