1 / 57

Speech Technology

Speech Technology. HOT!. What are the big players in the area up to?. Google http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-technology.html Microsoft http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/ Apple

mariah
Download Presentation

Speech Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Technology

  2. HOT!

  3. What are the big players in the area up to? • Google • http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-technology.html • Microsoft • http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/ • Apple • http://www.dailyfinance.com/story/company-news/apples-siri-purchase-heats-up-the-race-toward-a-voice-activated/19458344/ • IBM • http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html • Nuance • http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speechify-apps/ • Voxeo

  4. Apple, and the case of Siri • Siri: http://www.youtube.com/watch?v=MpjpVAB06O4 • Review of Siri: http://www.youtube.com/watch?v=AohzWSkAU7c&feature=watch_response

  5. Types of dialog systems • by modality • text-based • spoken • graphical user interface • multi-modal • by device • telephone-based systems • PDA systems • in-car systems • robot systems • desktop/laptop systems • native • in-browser systems • in-virtual machine • in-virtual environment • robots • by style • command-based • menu-driven • natural language • by initiative • system initiative • user initiative • mixed initiative • by application • information service • command-and-control • entertainment • education/tutorial • edutainment • reminder systems • companion systems • healthcare • eldercare • assistive/access systems

  6. More about application types • Information providing systems: • weather reports • stock quotes • timetables • ... • Transaction-based systems: • calendar functions • shopping • financial transactions • travel reservations • ...

  7. Why Voice?

  8. Why voice? • Wireless devices have small screens and limited input capabilities. • Telephone keypad can give users only a limited number of choices. • Speech technology is improving. • The exchange of information between a person and a computer is becoming more like a real conversation. • Users want hands-free or eyes-free use. • From a business viewpoint, voice applications open up a host of new revenue opportunities. • There exist many more telephones than computers with the potential to access the Internet.

  9. Traditional Interactive Voice Response (IVR)

  10. Speech versus Touch Tone

  11. Architecture 1

  12. Architecture 2

  13. Today • Presentation of project ideas • TTS evaluation • Short intro to XML • Speech technology standards overview • Speech Synthesis Markup Language (SSML) • Presentation of home assignment 3: ASR evaluation

  14. Project ideas?

  15. Intro to XML

  16. W3C Speech Standards Torbjörn Lager

  17. VoiceXML – a part of the web HTML HTML browser VoiceXML Web servers VoiceXML browser(ASR, TTS, interpreter)

  18. The place of speech technology • … speech technology itself has a very long way to go. … the most important thing may turn out to be not the speech technology itself, but the way in which speech technology connects to all the other technologies. Tim Berners-Lee

  19. The What and Why of Standards • Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages. • Advantages: • developers can create applications using the standard languages that are portable across a variety of platforms; • products from different vendors are able to interact with each other; • a community of experts evolves around the standard and is available to develop products and services based on the standard. • Disadvantages: • some developers feel that standards may inhibit creativity and stall the introduction of superior technology. • However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough. • Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.

  20. World Wide Web Consortium http://www.w3.org/

  21. W3C Speech Standards • Speech Recognition Grammar Specification (SRGS) – • What the user can say • Semantic Interpretation for Speech Recognition (SISR) – • What the user means • Speech Synthesis Markup Language (SSML) – • What the user hears • VoiceXML – • Dialog management: What the system is to do

  22. Speech Recognition Grammar Specification (SRGS) • Covers both speech and DTMF (Dual-Tone Multi-Frequency) input. (DTMF is valuable in noisy conditions or when the social context makes it awkward to speak.) • Grammars can be specified in either an XML or an equivalent augmented BNF (ABNF) syntax. • Speech recognition is an inherently uncertain process. Recognizers may report confidence values. • If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (N-best results). • What about statistical language models? Not covered by SRGS!

  23. Semantic Interpretation for Speech Recognition (SISR) <grammar root="answer"> <rule id="answer" scope="public"> <one-of> <item><ruleref uri="#yes"/></item> <item><ruleref uri="#no"/></item> </one-of> </rule> <rule id="yes"> <one-of> <item>yes</item> <item>yeah<tag>yes</tag></item> <item><token>you bet</token><tag>yes</tag></item> <item xml:lang="fr-CA">oui<tag>yes</tag></item> </one-of> </rule> <rule id="no"> <one-of> <item>no</item> <item>nope</item> <item>no way</item> </one-of> <tag>no</tag> </rule> </grammar>

  24. Semantic Interpretation for Speech Recognition (SISR) • I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } }

  25. <grammar root="order"> <rule id="order"> I would like a <ruleref uri="#drink"/> <tag>out.drink = new Object(); out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize;</tag> and <ruleref uri="#pizza"/> <tag>out.pizza=rules.pizza;</tag> </rule> <rule id="kindofdrink"> <one-of> <item>coke</item> <item>pepsi</item> <item>coca cola<tag>out="coke";</tag></item> </one-of> </rule> <rule id="foodsize"> <tag>out="medium";</tag> <item repeat="0-1"> <one-of> <item>small<tag>out="small";</tag></item> <item>medium</item> <item>large<tag>out="large";</tag></item> <item>regular<tag>out="medium";</tag></item> </one-of> </item> </rule> <rule id="tops"> <tag>out=new Array;</tag> <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> <item repeat="1-"> and <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> </item> </rule> <rule id="top"> <one-of> <item>anchovies</item> <item>pepperoni</item> <item>mushroom<tag>out="mushrooms";</tag></item> <item>mushrooms</item> </one-of> </rule> <rule id="drink"> <ruleref uri="#foodsize"/> <ruleref uri="#kindofdrink"/> <tag>out.drinksize=rules.foodsize; out.type=rules.kindofdrink;</tag> </rule> <rule id="pizza"> <ruleref uri="#number"/> <ruleref uri="#foodsize"/> <tag>out.pizzasize=rules.foodsize; out.number=rules.number;</tag> pizzas with <ruleref uri="#tops"/> <tag>out.topping=rules.tops;</tag> </rule> <rule id="number"> <one-of> <item> <tag>out=1;</tag> <one-of> <item>a</item> <item>one</item> </one-of> </item> <item>two<tag>out=2;</tag></item> <item>three<tag>out=3;</tag></item> </one-of> </rule> </grammar> I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }}

  26. Foundational • Grammar (CFG, PSG) • Automata theory (FSMs, FSTs, etc) • Logic • Phonetics • Linguistics • Computer science

  27. Speech Synthesis Markup Language (SSML) • The key concepts of SSML are • interoperability, or interacting with other markup languages (VoiceXML, etc.); • consistency, or providing predictable control of voice output across platforms and across speech synthesis implementations; and • internationalization, or enabling speech output in a large number of languages within or across documents.

  28. Speech Synthesis Markup Language (SSML) – An Example <speak> <p> <s xml:lang="en-US"> <voice name="David" gender="male" age="25"> For English, press <emphasis>one</emphasis>. </voice> </s> <s xml:lang="es-MX"> <voice name="Miguel" gender="male" age="25"> Para español, oprima el <emphasis>dos</emphasis>. </voice> </s> </p> </speak>

  29. Text Structure: p and s Elements • A p element represents a paragraph. An s element represents a sentence. <speak> <p> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </p> </speak>

  30. The phoneme Element • The phoneme element provides a phonemic/phonetic pronunciation for the contained text. <speak> <phoneme alphabet="ipa“ ph="t&#x259;mei&#x325;&#x27E;ou&#x325;">tomato</phoneme> </speak>

  31. The sub Element • The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. <?xml version="1.0"?> <speak> <sub alias="World Wide Web Consortium">W3C</sub> </speak>

  32. The voice Element • The voice element is a production element that requests a change in speaking voice. A selection of attributes is: • gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral". • age: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. • name: optional attribute indicating a processor-specific voice name to speak the contained text. <?xml version="1.0"?> <speak> <voice gender="female">Mary had a little lamb,</voice> <!-- now request a different female child's voice --> <voice gender="female" age=“7">Its fleece was white as snow.</voice> <!-- processor-specific voice selection --> <voice name="Mike">I want to be like Mike.</voice> </speak>

  33. The emphasis Element • The emphasis element requests that the contained text be spoken with emphasis. <speak> That is a <emphasis> big </emphasis> car! That is a <emphasis level="strong"> huge </emphasis> bank account! </speak>

  34. The break Element • The break element is an empty element that controls the pausing or other prosodic boundaries between words. <speak> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! <break strength="weak"/> Please repeat. </speak>

  35. The prosody Element • The prosody element permits control of the pitch, speaking rate and volume of the speech output. • The attributes, all optional, are: • pitch: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels. • contour: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below. • range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges. • rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well. • duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s". • volume: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0. Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.

  36. The prosody Element (cont’d) • Pitch contour. The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. • The algorithm for interpolating between the targets is processor-specific. • In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). <?xml version="1.0"?> <speak> <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"> good morning </prosody> </speak>

  37. Today • Project reminder • Presentation of the results of the TTS evaluation • Speech Synthesis Poetry Slam • Wrapping up TTS (stages of TTS) • Presentation of home assignment 3: ASR evaluation • Automatic speech recognition (ASR) • Natural language understanding (NLU) • Speech Recognition Grammar Specification (SRGS) • Semantic Interpretation for Speech Recognition (SISR) • Thursday's Lab session

  38. Architecture 1

  39. Wrapping up TTS • Stages of TTS: • Structure analysis (sentence splitting) • Text normalisation • Text to phoneme conversion • Prosody analysis • Waveform production • Speech Synthesis Markup Language • enables developers to override default behavior

  40. TTS stages and SSML elements

  41. Prosody analysis • Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses.  • most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech.  For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed.  • Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory.  • Prosody rules and algorithms are not perfect and are a topic of ongoing research.  Prosody rules for different spoken national languages may be quite different.  For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different. 

  42. Speech Recognition(ASR)

  43. Architecture 1

  44. ASR Input and Output • A speech recognizer is a component with the following inputs and outputs: • Input • A grammar or multiple grammars as defined by the SRGS specification. These grammars inform the recognizer of the words and patterns of words to listen for. • An audio stream that may contain speech content that matches the grammar(s). • Parameters: timeouts, recognition thresholds, or N-best result counts. • Output • Descriptions of results that indicate details about the speech content detected by the speech recognizer. Recognizers will include at least a transcription of any detected words. • Errors and other performance information such as confidence

  45. SRGS

  46. SRGS <grammar root="s"> <rule id="s"> hello </rule> </grammar> s -> "hello"

  47. SRGS <grammar root="s"> <rule id="s"> <one-of> <item>hello</item> <item>goodbye</item> </one-of> </rule> </grammar> s -> "hello" s -> "goodbye" s -> "hello" | "goodbye"

  48. SRGS <grammar root="s"> <rule id="s"> hello <item repeat="0-1"> how are you </item> </rule> </grammar> s -> "hello" ("how are you")

  49. SRGS <grammar root="s"> <rule id="s"> <item repeat="1-"> hello </item> </rule> </grammar> s -> "hello" s -> "hello" s s -> "hello"+ NOTE: Listing is no longer possible

More Related