Creating User Interfaces

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Speech recognition • Encompasses variety and range of activities • Totally open-ended to content and audience • May claim more than really exists • Restricted to small[er] set of phrases • Phrases within longer sections of speech • Restricted to require training • OR system learns • Dictation systems learn your voice

Speech recognition • User speaks. System 'understands', at least enough to perform some action. • Related to (but not the same as) • Natural language understanding • Voice print identification • Record information to be re-played to human in compressed form for later interaction • Speech synthesis (other direction): words to speech • ?

Natural language understanding • Skip speech altogether, but type in statements or phrases in normal language • What is normal? We tend not to speak that grammatically • Many 'natural language systems' actually use keywords • Histor • Moon rocks example • Combine speech to natural language …

Continuous versus discrete • Speaker speaks 'naturally' versus • Speaker separates words

Examples • Dictation: no understanding as such, produce words/sentences in a program • (Telephone) Help desk / Information: generally restricted or directed speech, choosing from alternatives (may or may not be given). Advances the process • [Restricted] commands: actually carrying out operations • Factory example: start and stop • Car: radio, heat/AC • Phone: call specific number

Training • Dictation application: user takes time to read specific test to train the system • Note: some systems also adapt with use. If & when user corrects the results, system may do better next time. • Phone lookup: user records names. No 'understanding', just record for matching.

Audience & content • Some systems may allow adapting to audiences, for example, male versus female • Some systems have restrictions on types of content • Historical note: IBM system in 1980s & 1990s was restricted to male, American-born speakers (no speech impediments) and legal text.

Speech recognition concepts • Air pressure  diaphragm in phone electrical signal  (Fourier Transform) wave pattern matched against • sets of canonical patterns (native speaker of English, perhaps male/female & young/old alternatives) • generated for the specified grammar (using a segmentation=dividing up of the parts) Note: interplay of grammar and statistics distinguishes different approaches

Fourier Transform(Fast Fourier Transform -- FFT) • Takes data representing a signal • And produces numbers representing the combination of sine and cosine waves that make up the signal

Speech recognition • Works on the product of the FFT • Uses (in most cases) • Segmentation: attempt to break up into pieces, perhaps syllables or words • Grammar: definition of what is to be expected • Probabilities: if first part matched X, then greater probability that then next would match to Y

Current State of the Art • General, no restrictions, speech reco, good enough to act on the speech? always about to happen? • dictation / substitute for keyboard+ exists and satisfies many • Is this most important application for most users? • May not be killer ap, but may be good for motivating research Extra credit posting: prepare brief report on [a] current product or application. Can be one you use yourself.

Speech synthesis • aka TTS (text to speech) • Application determines that the computer needs to say certain words • lexical units (syllables of words) phonemes pre-recorded (wav) files of phonemes

Speech synthesis • This is again a segmentation process: need to divide up the words and then put together so speech sounds 'natural'. • particular phoneme may [need to] sound different in different context. • also need to deal with abbreviations & local accents • Place names (important in travel & weather applications) • Special case: detect and use wav file for each name. • Older methods were all synthesized • similar distinction between all synthesized and samples of music

Speech synthesis is essentially ‘the computer’ reading ‘out loud’. Easy to do most things More and more difficult to do complete job Different languages may be easier than English. People who are not monolingual please comment!

Restricted / directed speech applications • The language is VoiceXML • We will use evolution.voxeo.com to create directed speech applications. • Free facilty: put in URL pointing to a VoiceXML document. Supplies phone numbers to call in to test. • You need to register. • Note: previously used Tellme studios but they stopped offering service.

XML • Generalization of HTML • XML documents have markup. • Tag indicating type of element and, possibly with attributes, content, tag closer. • Document must be well-formed. • Elements nested in other elements • Quotation marks around attribute values • Developers decide on element types. • So, we need to obey rules of VoiceXML • Each element type can only have certain child elements

Notes on VoiceXML • There are field and filled elements! • You can start and have text-to-speech as backup and, when appropriate and possible, make wav recordings. • You can open file directly or in Voxeo and make check for well-formed XML. • But this doesn't check for legal VoiceXML • You can include JavaScript in file or as external script. • Can put in pauses, other tricks to improve SR and TTS.

<?xml version="1.0" encoding="UTF-8"?> <vxml version = "2.1"> <meta name="Jeanine" content="jeanine.meyer@purchase.edu"/> <meta name="speak_exceptions" content="true" /> <form> <block> <prompt> Hello World. This is my first Voxeo application. </prompt> </block> </form> </vxml>

My modification of the SouthPark example: outline <?xml version="1.0" encoding="UTF-8"?> <vxml version = "2.1" xmlns=http://www.w3.org/2001/vxml> <meta name="jeanine.meyer" content="jeanine.meyer@purchase.edu"/> <form id="MainMenu"> <field name="DowntonCharacter"> … </field> <filled namelist="DownCharacter">… </filled> <form> </vxml>

<field name="DowntonCharacter"> <prompt> Please say your favorite Downton Abbey character's name. </prompt>  <grammar xml:lang="en-US" root = "myrule"> <rule id="myrule"> <one-of> <item> Carson </item> <item> Mrs. Hughes </item> … <item> Mary </item> <item> Cora </item> </one-of> </rule> </grammar>

 <noinput> I did not hear anything. Please try again. <reprompt/> </noinput>  <nomatch> I did not recognize that character. Please try again. <reprompt/> </nomatch> </field>

<filled namelist="DowntonCharacter"> <if cond="DowntonCharacter == 'Carson'"> <prompt> Carson grew less likeable as the seasons went on. </prompt> <elseif cond="DowntonCharacter == 'Mrs. Hughes'"/> <prompt> Mrs. Hughes is wise, so why did she marry Carson? </prompt> … <else/> <prompt> A match has occurred, but we have no specific response prepared. Perhaps you liked Mary or Cora. </prompt> </if> <goto next="#MainMenu"/> </filled>

Notes • The list in the field has names not referenced in the field element, such as Mary and Cora. • If it doesn't work AND you have checked it is well-formatted • and after you start to use other elements, check the element documentation to check that you are putting elements within allowed elements • Consider using file manager to upload to their storage (www). May give more reliable results.

Screen shot from Voxeo

Screen shot: phone numbers

My examples • Family greeting: built in audio files, use of calculations in VoiceXML to determine number of cranes to be done • Rock paper scissors: JavaScript code to determine random move for computer, VoiceXML variables, break for timing, count and timeout with prompt • ?

Homework (over break) • Sign up to be Voxeo developer. • Start VoiceXML tutorials: http://help.voxeo.com/go/help/xml.vxml.tutorials.overview • Do your own hello, world application. • Do a second application, involving some speech recognition. • Do more? • Check out http://help.voxeo.com/go/help/xml.vxml.elements.overview • Start planning your VoiceXML project.

Creating User Interfaces