Speech output
1 / 31

Speech Output - PowerPoint PPT Presentation

  • Uploaded on

Speech Output. Reading: Reiter and Dale, chap 7. Note: Simplenlg and Protege. Simplenlg Lexicaliser creates an SPhraseSpec from a Protégé instance Based on template mapping rules encoded in Protégé. Example. SPIKE: Subject = “there” Verb = “is” Complement = “a spike”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Speech Output' - sonja

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Speech output l.jpg

Speech Output

Reading: Reiter and Dale, chap 7

Note simplenlg and protege l.jpg
Note: Simplenlg and Protege

  • Simplenlg Lexicaliser creates an SPhraseSpec from a Protégé instance

    • Based on template mapping rules encoded in Protégé

Example l.jpg

  • SPIKE:

    • Subject = “there”

    • Verb = “is”

    • Complement = “a spike”

    • Modifier = [“in [channel]”, “to [peak_value]”

    • Channel, peak_value are features of spikes

  • Results in texts such as

    • There is a spike in HR to 160

Usage l.jpg

  • Document Planner decides which instances to include in the text

  • Lexicaliser produces initial SPhraseSpec from these

  • Microplanner modifies SPhraseSpec

    • Add extra modifiers if necessary

      • Eg, “at 10.40” (if diff from last time mentioned)

    • Aggregation

    • Syntactic choice (passive, tense)

    • Referring exp (HR, Heart Rate)

  • Realiser produces text

Simplenlg and protege l.jpg
Simplenlg and Protege

  • Complex, very much under development

  • Happy to discuss more with interested students

    • Prof Mellish is very interested in NLG and Semantic Web

Different modalities l.jpg
Different Modalities

  • Many ways to communicate data

    • Visualisation

    • Written text

    • Spoken text (speech)

    • Combinations of above

Speech output7 l.jpg
Speech output

  • Computers can talk as well as write

  • Prerecorded files (eg, WAV)

  • Text-to-speech (TTS)

    • Speaks arbitrary texts

  • Example app: spoken weather forecasts

    • Output of our weather-forecast generator spoken for premium-rate telephone weather information services

Simple approach l.jpg
Simple approach

  • Problem: speak aloud a written text

  • Simple approach

    • Record people speaking words

    • Given a text, combine recordings for all the words in the text

      • Telephone directory enquiries

Problems l.jpg

  • Intonation/prosody

    • Difficult to understand monotone intonation

  • Cannot determine which word is meant

    • He lives on Don St.

    • St. Louis is a great city.

  • Conventions

    • £20 is twenty pounds, not pound twenty

  • New words (names, technical terms)

Problems10 l.jpg

  • Pronouncing symbols

    • £ is pound or pounds ??

    • I have £1 vs I have £5 vs I ate a £5 lunch

  • Pronouncing numbers

    • Individual digits or as a whole

    • 01224 273443 vs 1,224,273,443 people

Lexical disambiguation l.jpg
Lexical Disambiguation

  • Which word is meant

    • a cat has nine lives (noun)

    • She lives here (verb)

    • I have a bow and arrow

    • I will not bow to her

Sophisticated text to speech l.jpg
Sophisticated text-to-speech

  • Determine grammatical structure

    • parsing

    • statistical techniques

  • Use this to determine

    • How to pronounce symbols, numbers

    • Lexical disambiguation

    • Rhetoric structure (for intonation)

Example att natural voices l.jpg
Example: ATT Natural Voices

  • One of several commercial TTS systems

  • Nice demo at

    • http://www.research.att.com/~ttsweb/tts/demo.php

Prosodic structure l.jpg
Prosodic Structure

  • Pitch change shows sentence type [?, ! ,.]

    • Hello.

    • Hello!

    • Hello?

  • Stress reflects importance, new information

    • *Mary gave John a book

    • Mary *gave John a book

    • etc

Pronunciation of new words l.jpg
Pronunciation of new words

  • Eg, “Inverurie”

  • Rule-based

    • Use rules describing how phonemes are said in different contexts

    • Maybe models of human vocal cords, mouth

  • Concatenative

    • library of acoustic units, human-spoken

    • merged together for new words

  • Problems with both approaches

Markups l.jpg

  • Speech markups (low-level)

    • pause

    • speed

    • volume

    • pitch

    • type (money, phone number)

  • Competing standards:

    • SAPI (Microsoft)

    • SSML (W3C)

Example17 l.jpg

I want to go

<break/> <prosody volume="loud">



Speech markups l.jpg
Speech Markups

  • Higher level markups

    • emphasis, deemphasis

    • character (eg, whisper) ??

    • emotion ???

    • Voice (accent, gender, age, …) ??

When is speech useful l.jpg
When is speech useful?

  • Ideas from class?

When not useful l.jpg
When (not) useful

  • Useful

    • Get attention (eg, urgent warning)

    • No screen or hands busy (eg, diver in water)

    • For visually impaired users

  • Not useful

    • Distracting (“you have spam”)

    • Long messages (text can be reread!)

    • Noisy environments

    • Deaf users

Systems l.jpg

  • FreeTTS – free Java-based text-to-speech

    • Low voice quality, limited func, easy to use

  • Microsoft – Speech SDK

    • Higher quality, more func than FreeTTS

    • Tied to Windows, stresses VB, .net, etc

  • Commercial – highest quality

    • Natural Voices, RealSpeak, …

    • rVoice (Scottish software, mostly defunct)

Digression rvoice l.jpg
Digression: rVoice

  • From Rhetorical Systems

    • Edinburgh Uni spinout

      • From Festival, also source of FreeTTS (practical)

    • High-profile “success story” of high-tech Scotland

  • rVoice

    • Very high quality voices (best in world?)

    • Could imitate a real person

Digression rvoice23 l.jpg
Digression: rVoice

  • Not very successful as a business

    • Too expensive?

      • Some users (eg, blind people) wanted cheap soln

      • When high-quality voices needed (weather info), cheaper to hire people to speak messages

  • Recently bought by a competitor

    • Essentially being closed down, customers encouraged to move to competitors product

  • Sad…

Speech output from java l.jpg
Speech output from Java

  • Set up system

  • Set up a voice

  • Call “speak” method

  • (some systems) wait until speech finished

    • Speech takes time, system can do something else while speech is happening

Freetts example l.jpg
FreeTTS example

VoiceManager voiceManager = VoiceManager.getInstance();

Voice helloVoice = voiceManager.getVoice(“kevin16”);


helloVoice.speak(“Mary had a little lamb.");


Advanced topic concept to text l.jpg
Advanced topic: concept-to-text

  • Currently NLG systems produce text, which is fed into speech synthesiser

  • But speech quality should improve if the NLG system gave more information

    • Syntactic structure (for pauses)

    • Desired meaning of word (for pronunciation)

    • Importance (for emphasis)

  • How integrate NLG and speech?

Speech input l.jpg
Speech Input

  • Talk to the computer instead of type

  • Commands (select from limited list)

    • Like cinema information line

      • Eg say name of movie you want to watch

  • Dictation

    • Dictate arbitrary texts

    • In recent versions of Office

  • Many errors

Speech dialogue l.jpg
Speech dialogue

  • Dialogue with the computer, just like in science fiction movies

    • C: your first ascent was dangerous

    • H: why?

    • C: because you came up too quickly

    • H: what should I have done?

    • C: you should have taken 5 minutes to come up instead of 3 minutes

Speech dialogue29 l.jpg
Speech dialogue

  • Key problems are

    • (a) dealing with speech input errors

      • Need to unobtrusively check that understood correctly

    • (b) dealing with strange things users say

      • Speech allows them to say anything, and they do!

    • (c) interpolating from ambiguous data

      • Does “Aberdeen” mean “Aberdeen, UK”, “Aberdeen, Maryland”, etc

Example30 l.jpg

User: Hello, I want to fly to London next Thursday

System: What airport will you be flying from when you go to London, UK?

User: Aberdeen

System: What time on Thursday, 16 March, do you wish to depart from Aberdeen, Scotland?

User: mid-morning

System: BA 1305 leaves Aberdeen at 940 and arrives into London Heathrow at 1115. Should I book one seat for you on Thursday, 16 March?

Conclusion l.jpg

  • Texts can be spoken instead of (or as well as) written

    • Harder than it seems, but technology exists and is getting better

  • Useful in some situations

  • In longer term, speech input and dialogue