1 / 78

VoiceXML Overview

VoiceXML Overview. James A. Larson Intel Corporation jim@larson-tech.com. Outline. Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1. VoiceXML in the Marketplace.

aurora
Download Presentation

VoiceXML Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VoiceXML Overview James A. Larson Intel Corporation jim@larson-tech.com (c) 2007 Larson Technical Services

  2. Outline • Motivation for VoiceXML • W3C Speech Interface Framework Languages • Dialog—VoiceXML 2.0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • VoiceXML 2.1 (c) 2007 Larson Technical Services

  3. VoiceXML in the Marketplace • VoiceXML 2.0 is now ratified as a Recommendation (e.g., official standard) by the W3C • Hundreds of millions of VoiceXML calls are answered every day VoiceXML is the standard for building speech-enabled applications (c) 2007 Larson Technical Services

  4. Motivation for Speech Applications • Users access Web sites from any telephone, anywhere, any time. • Speaking and listening are the natural usage modes for phones. (c) 2007 Larson Technical Services

  5. Strength of VoiceXML Applications • Traditional system-directed dialogs for novice users • Mixed initiative dialogs for experienced users • Novice users smoothly become experienced users at their own pace (c) 2007 Larson Technical Services

  6. Limitations of VoiceXML Applications • No special analysis of speech input • Not suitable for training speech skills—Reading, ESL, singing, etc. • VUI conversational bandwidth is slower than GUI conversational bandwidth • Using a VUI is like drinking from Lake Superior with a straw (c) 2007 Larson Technical Services

  7. Exercise 1 • Name or describe a speech application you could use at work. • Name or describe a speech application you or family member can use at home. (c) 2007 Larson Technical Services

  8. XML • XML = eXtensible Markup Language • Elements are surrounded by tags <prompt>Welcome to the voice system </prompt> • Elements may be nested <prompt>     Welcome to Ajax Travel <break/> we have the cheapest fares </prompt> • Elements may have attributes <choice next="#boat"> <grammar type="application/grammar+xml" version="1.0"        root = "by_boat" src = “boat.grxml”>   • Because “<”, “>”, and “&” have special meanings “&lt;” in place of “<” “&gt;” in place of  “>” “&amp;” in place of “&”.                      (c) 2007 Larson Technical Services

  9. Outline • Motivation for VoiceXML • W3C Speech Interface Framework Languages • Dialog—VoiceXML 2.0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • VoiceXML 2.1 (c) 2007 Larson Technical Services

  10. Documents Multimedia Files HTML Scripts VoiceXML Scripts Web Browser DB Voice Browser Capture Voice Grammars ASR Database Server DTMF Replay Audio Audio Files TTS Speech Server/Gateway Web Server (c) 2007 Larson Technical Services

  11. W3C Speech Interface Framework VoiceXML 2.0 Speech Synthesis Call Control SemanticInterpretation Other Grammar (c) 2007 Larson Technical Services

  12. Recommendation Proposed Recommendation Candidate Recommendation Last Call Working Draft Working Draft Requirements Status of W3C Speech Interface Languages Voice XML 2.0 Grammar (SRGS) Synthesis (SSML) Semantic Interpret- Ration (SISR) Voice XML 2.1 Call Control (CCXML) V3 (c) 2007 Larson Technical Services

  13. Outline • Motivation for VoiceXML • W3C Speech Interface Framework Languages • Dialog—VoiceXML 2.0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • VoiceXML 2.1 (c) 2007 Larson Technical Services

  14. <?xml version="1.0"?> <vxml version="2.0"> <form> … <field name = "account">   <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice">      <rule id = “account_type">          <one-of>               <item> savings </item>                <item> checking </item>         <item> CD </item>                 <item> certificate of deposit <tag>$ = “CD”<tag> </item>          </one-of>     </rule> </grammar> </field> …. </form> … </vxml> Example of VoiceXML 2.0 Fragment Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI) (c) 2007 Larson Technical Services

  15. <?xml version="1.0"?> <vxml version="2.0"> <form> … <field name = "account"> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice">      <rule id = “account_type">          <one-of>               <item> savings </item>                <item> checking </item>         <item> CD </item>                 <item> certificate of deposit <tag>$ = “CD”<tag> </item>          </one-of>     </rule> </grammar> </field> …. </form> … </vxml> Example of VoiceXML 2.0 Fragment Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI) (c) 2007 Larson Technical Services

  16. <?xml version="1.0"?> <vxml version="2.0"> <form> … <field name = "account"> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice">      <rule id = “account_type">          <one-of>               <item> savings </item>                <item> checking </item>         <item> CD </item>                 <item> certificate of deposit <tag>$ = “CD”<tag> </item>          </one-of>     </rule> </grammar> </field> …. </form> … </vxml> Example of VoiceXML 2.0 Fragment Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI) (c) 2007 Larson Technical Services

  17. <?xml version="1.0"?> <vxml version="2.0"> <form> … <field name = "account"> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice">      <rule id = “account_type">          <one-of>               <item> savings </item>                <item> checking </item>         <item> CD </item>                 <item> certificate of deposit <tag>new.account = “CD”<tag> </item>          </one-of>     </rule> </grammar> </field> …. </form> … </vxml> Example of VoiceXML 2.0 Fragment Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI) (c) 2007 Larson Technical Services

  18. VoiceXML 2.0 features • Menus, forms, sub-dialogs • <menu>, <form>, <subdialog> • Inputs • Speech recognition <grammar> • Recording <record> • Keypad <grammar mode=“dtmf”> • Output • Audio files <audio> • Text-to-speech <prompt> • Variables • <var> <script> <assign> • Events • <nomatch>, <noinput>, <help>, <catch>, <throw> • Transition and submission • <goto>, <submit> • Telephony • Connection control • <transfer>, <disconnect> • Telephony information • Platform • Objects • Performance • Fetch (c) 2007 Larson Technical Services

  19. <form> <block> <prompt>Welcome to the electronic payment system.</prompt> </block> <field name="card_number"> <prompt> Please enter your credit card number? </prompt> <grammar src=“http://www.ajax.com/credit_card_number.grxml"/> </field> <field name="date"> <prompt>Please enter your expiration date </prompt> <grammar src=“http://www.ajax.com/credit_card_date.grxml"/> </field> </form> Typical Form Fill-In (c) 2007 Larson Technical Services

  20. Exercise 2Capture “birth date” <form> <block> <prompt> _____________________ </prompt> </block> <field name = "month"> <prompt> _______________________________</prompt> <grammar src=“http://www.ajax.com/month.grxml"/> </field> <field name = "day"> <prompt> ______________________________ </prompt> <grammar src=“http://www.ajax.com/day.grxml"/> </field> <field name = "year"> <prompt> ______________________________ </prompt> <grammar src=“http://www.ajax.com/year.grxml"/> </field> </form> (c) 2007 Larson Technical Services

  21. Event Handlers • Deal with exceptional or error conditions • Control mechanism for dialog turn retries • <catch event=“noinput”> … </catch> • <catch event=“nomatch” … </catch> • <catch event=“help”> … </catch> • Shorthand notation available • <noinput> … </noinput>, etc. • Scoped according to where they occur • <form>, <field>, etc. (c) 2007 Larson Technical Services

  22. Adding Event Handlers <form> <prompt> When were you born? </prompt> <field name = "month"> <catch event=“noinput”> ….. </catch> <catch event=“nomatch> ….. </catch> <prompt> What month?</prompt> <grammar src=“http://www.ajax.com/month.grxml"/> </field> ….. </form> (c) 2007 Larson Technical Services

  23. Adding Event Handlers <form> <prompt> When were you born? </prompt> <field name = "month"> <catch event=“noinput”> ….. </catch> <catch event=“nomatch> ….. </catch> <prompt> What month?</prompt> <grammar src=“http://www.ajax.com/month.grxml"/> </field> ….. </form> (c) 2007 Larson Technical Services

  24. Adding Event Handlers <form> <prompt> When were you born? </prompt> <field name = "month"> <catch event=“noinput”> ….. </catch> <catch event=“nomatch> ….. </catch> <prompt> What month?</prompt> <grammar src=“http://www.ajax.com/month.grxml"/> </field> ….. </form> (c) 2007 Larson Technical Services

  25. Default Event Handlers <catch event = "nomatch"> <prompt> I did not understand, please try again </prompt></catch> <catch event = "help"> <prompt> Sorry, no help is available. </prompt></catch> <catch event = "noinput"> <prompt> I did not hear anything, please speak again </prompt></catch> (c) 2007 Larson Technical Services

  26. Exercise 3Write event handlers for the month field <catch event = "nomatch"> <prompt> __________________________ </prompt></catch> <catch event = "help"> <prompt> ____________________ </prompt></catch> <catch event = "noinput"> <prompt> ___________________________________ </prompt></catch> (c) 2007 Larson Technical Services

  27. Outline • Motivation for VoiceXML • W3C Speech Interface Framework Languages • Dialog—VoiceXML 2.0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • VoiceXML 2.1 (c) 2007 Larson Technical Services

  28. Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: p, s Non-markup behavior: infer structure by automated text analysis (c) 2007 Larson Technical Services

  29. Before and after Structure Analysis • Before structure analysis • Dr. Smith lives at 214 Elm Dr. He weights 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass. • After structure analysis <p> <s> Dr. Smith lives at 214 Elm Dr. </s> <s> He weights 214 lb. </s> <s> He plays bass guitar. </s> <s> He also likes to fish; last week he caught a 19 lb. bass. </s> </p> (c) 2007 Larson Technical Services

  30. Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: p, s Non-markup behavior: infer structure by automated text analysis Markup support:say-as for dates, times, etc.sub for aliasing Non-markup behavior: automatically identify and convert constructs (c) 2007 Larson Technical Services

  31. After Text Normalization <p> <s> <sub alias= "doctor">Dr. </sub> Smith lives at 214 Elm <sub alias = "drive">Dr. </sub> </s> <s> He weights 214<sub alias= "pounds"> lb. </sub> </s> <s> He plays bass guitar. </s> <s> He also likes to fish; last week he caught a 19 <sub alias= "pound"> lb. </sub> bass. </s> </p> (c) 2007 Larson Technical Services

  32. Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: p, s Non-markup behavior: infer structure by automated text analysis Markup support:say-as for dates, times, etc.sub for aliasing Non-markup behavior: automatically identify and convert constructs (c) 2007 Larson Technical Services

  33. After text-to-phoneme conversion <p> <s> <sub alias = "doctor">Dr.</sub> Smith lives at <say-as interpret-as = “address"> 214 </sayas> Elm <sub alias = "drive">Dr. </sub> </s> <s> He weighs <sayas interpret-as = “number”>214 </sayas> <sub alias= "pounds"> lb.</sub> </s> <s> He plays <phoneme alphabet = “IPA" ph="b@s">bass</phoneme> guitar. </s> <s> He also likes to fish; last week he caught a <sayas interpret-as= “number">19 </sayas> <sub alias= "pound"> lb. </sub> <phoneme alphabet = “IPA" ph="bas">bass</phoneme>. </s> </p> (c) 2007 Larson Technical Services

  34. Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: p, s Non-markup behavior: infer structure by automated text analysis Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax Markup support:say-as for dates, times, etc.sub for aliasing Non-markup behavior: automatically identify and convert constructs (c) 2007 Larson Technical Services

  35. Prosody Analysis(Initial text) <prompt> Environmental control menu. Do you want to adjust the lighting or temperature? </prompt> (c) 2007 Larson Technical Services

  36. Prosody Analysis <prompt> Environmental control menu <break/> <emphasis level = "reduced" > do you want to adjust the </emphasis> <emphasis level = "strong"> lighting </emphasis> <break/> or <emphasis level = "strong"> temperature? </emphasis> </prompt> (c) 2007 Larson Technical Services

  37. Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: voice, audio* Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis *audio icons, branding, advertising Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax Markup support:say-as for dates, times, etc.sub for aliasing Non-markup behavior: automatically identify and convert constructs (c) 2007 Larson Technical Services

  38. Wave Form Production <prompt> <audio src=“http://www.example.com/adjust.wav" > <desc> Environmental control menu. Do you want to adjust the lighting or temperature </desc> </audio> </prompt> (c) 2007 Larson Technical Services

  39. Exercise 4(insert SSML commands) <prompt> Welcome to Ajax Bank do you want to withdraw or deposit funds? </prompt> (c) 2007 Larson Technical Services

  40. Outline • Motivation for VoiceXML • W3C Speech Interface Framework Languages • Dialog—VoiceXML 2.0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • VoiceXML 2.1 (c) 2007 Larson Technical Services

  41. Grammars • Describe what the user may say at a point in the dialog • Enable the speech recognition engine to work faster and more accurately • Consist of one or more “rules” (c) 2007 Larson Technical Services

  42. Example Grammar <grammar type = "application/srgs+xml"root = "zero_to_ten" mode = "voice"><rule id = "zero_to_ten">       <one-of> <item> zero </item>               <ruleref uri = "#single_digit"/>              <item> ten </item>        </one-of></rule>      <rule id = "single_digit">          <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> XML form of grammars (c) 2007 Larson Technical Services

  43. Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten"mode = "voice"><rule id = "zero_to_ten">       <one-of><item> zero </item>              <ruleref uri = "#single_digit"/>              <item> ten </item>        </one-of></rule>      <rule id = "single_digit">          <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> Grammar processor should start with the “zero_to_ten” rule (c) 2007 Larson Technical Services

  44. Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"><rule id = "zero_to_ten">       <one-of><item> zero </item>               <ruleref uri = "#single_digit"/>              <item> ten </item>        </one-of></rule>      <rule id = "single_digit">          <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> This is a grammar used by the speech recognizer. (There may also be grammars for DTMF recognizers.) (c) 2007 Larson Technical Services

  45. Example Grammar Rule describing single digits <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"><rule id = "zero_to_ten">       <one-of> <item> zero </item>               <ruleref uri = "#single_digit"/>              <item> ten </item>        </one-of></rule> <rule id = "single_digit">          <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> Rule describing digits one through ten (c) 2007 Larson Technical Services

  46. Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"><rule id = "zero_to_ten">       <one-of><item> zero </item>              <ruleref uri = "#single_digit"/>              <item> ten </item>        </one-of></rule>      <rule id = "single_digit">      <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> <one-of> describes alternatives (c) 2007 Larson Technical Services

  47. Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"><rule id = "zero_to_ten">       <one-of><item> zero </item><ruleref uri = "#single_digit"/>               <item> ten </item>        </one-of></rule>      <rule id = "single_digit">          <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> Rule element references another rule (c) 2007 Larson Technical Services

  48. Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"><rule id = "zero_to_ten">       <one-of>              <item> zero </item>              <ruleref uri = "#single_digit"/>               <item> ten </item>        </one-of></rule>      <rule id = "single_digit">          <one-of>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of>     </rule></grammar> Exercise 5: Write a grammar for that recognizes the digits zero to nineteen (c) 2007 Larson Technical Services

  49. More Grammar Elements • Repeat and optional <rule id = "goodness" scope = "public">       <item repeat = "0-3" > very </item> good </rule> • Sequence <rule id = "twenty_thru_twentynine“> Twenty <ruleref uri = "#single_digit"/> </rule> • Garbage <rule name = "James_Lewis">    <item> James <ruleref special = “garbage"/> Lewis </item> </rule> (c) 2007 Larson Technical Services

  50. Reusing existing grammars <grammar type = "application/srgs+xml" root = "size” src = “http://www.example.com/size.grxml"/> (c) 2007 Larson Technical Services

More Related