Vocalizer Technical Architecture

Vocalizer Technical Architecture Aug, 2013

Outline Background Embeddedand Network solutionsFront-end and Back-end Architecture Languages and Voices Portfolio Tools & Services Strategic directions

Vocalizer = Voice Output Solutions • Recorded prompts with dynamic slots • Street names • Person names • Dates • Amounts • Pure TTS • Dialog • Email • News • Knowledge-base Traditional New

TTS Use cases Markets • Applications Traffic info Tourist info SMS/Email reading Banking Social media updates Mobiles Enterprise Turn-by-turn directions Talking toys & game characters Automotive Healthcare eBooks TVs Weather info eBook reading Talking avatars News reading Toys Home app Language learning Voice UI feedback Talking appliances Access to information for visually impaired Owner’s manual Accessibility

Nuance TTS Heritage Built on large expertise in TTS systems TTS 2500 Speechify & Speechify Solo Loquendo TTS Nuance Vocalizer TTS 3000 rVoice RealSpeak SVOX TTS RealSpeak & RealSpeak Solo Vocalizer 6Vocalizer Expressive RealSpeak, RealSpeak Solo & Prompt Sculptor 2001 Vocalizer for Automotive / Network 2013

Requirements for natural and expressive TTS

Embedded and Network Solutions

Network TTS Release Process Vocalizer 6 stakeholders TTS R&D ENT Build&Release (NSS integration) ENT QA NOD NCS Customers TTS R&D ENT Continuing Engineering Tech Support

Architecture overview Text Preprocessor Linguistic Preprocessor Parameter Generator Synthesizer Text input Raw text HTML SSML markup Mixed lingual Speech signal 16bit PCM 22.1kHz / 8kHz Language data Voice data Front-End Linguistic Processing Back-End Signal Generation

Front-end Architecture Convert orthographic input to intermediate phonetic representation Grapheme-to-Phoneme conversion and symbolic prosody prediction Text Preprocessor Phonetics • DEPES • Pattern matching formalism • cfg1: limited domain mp3, vad • cfg3: large lexicon SSML Parsing Post-lexical rules Language Identification • Morpho-syntax • Morphological and syntactic parsing • Accentuation+phrasing • cfg4: large footprint Cross-lingual Phone Mapping (CLM) Tokenization • Text Normalization • Abbreviations • Numbers • Symbols ($/-...) • NLU • OpenNLP for POS • IGtree for prominence, phrasing prediction • Research in a.o.corpus based disambiguation, named entity tagging, genre classification

Back-end Technologies Convert phonetic representation to speech Multi-form Synthesis (MFS) MFS/NVF 1GB Unit Selection Standard 20 MB Plus 60 MB Premium 100 MB Prem. High 300 MB HMM Compact 2-5 MBEmb. Pro ~10 MB Manually tuned unit selection used on larger embedded platforms and server applications < 45 MB RAM Statistical unit selection and multi-form synthesis used in network applications ~500 MB RAM (not optimized) Model based speech (Vocoding), used on small devices and for Mandarin (Embedded Pro) < 8 MB RAM

Back-end technologies: below the surface • HMM • Train parameters in HMM states • Emit durations, F0, spectra • Smooth using delta information • Convert to speech (vocoder) • Unit Selection • Segment recordings into units, i.e. demiphones • Annotate units with linguistic properties • During synthesis match units with target properties • Concatenate selected units

Languages & Voices portfolio 45 languages, 84 voices (Jun’13) 1 1 2 1 1 2 F F F F F F F F 1 1 1 2 1 1 M M M M M M M M Arabic Argentinean Spanish Australian English Basque Belgian Dutch Brazilian Portuguese British English Canadian French 1 1 2 1 1 1 2 1 F F F F F F F F 1 1 1 1 1 M M M M M M M M Cantonese Catalan Colombian Spanish Czech Danish Dutch Finnish French 1 1 1 1 1 2 1 1 F F F F F F F F 2 1 M M M M M M M M Galician German Greek Hebrew Hindi Hungarian Indian English Indonesian 3 1 1 2 2 1 1 3 F F F F F F F F 1 1 1 1 M M M M M M M M Irish English Italian Japanese Korean Mandarin Chinese Mexican Spanish Norwegian Polish 1 2 2 1 1 1 2 1 F F F F F F F F 1 1 1 M M M M M M M M Portuguese Romanian Russian Scottish English Slovak South African English Spanish Swedish Number of female TTS voices 1 4 1 1 1 F F F F F 1 1 Number of male TTS voices M M M M M Taiwanese Mandarin Thai Turkish US English Valencian

Chinese TTS News: 房地产税是国家对不动产保有环节缴纳的法定税赋，两者不存在所谓不可克服的法理障碍和不可解决的重复征收问题。 • Front-end challenges • No word separators -> word disambiguation problem • Same characters are used for function words, contents words • Word or accent group detection • Phrasing prediction (breaks) • Back-end challenges • Tones require precise modeling of pitch (F0) • No lexical stress defined -> problem to model emphasis • Business challenges • Fast growing market (handset, auto) with demanding customers • Strong competitor: iFlytek

Chinese TTS GM comparison VfA vs. VE • Recent benchmarks • GM benchmark: “closing the gap to iFlytek” • Daimler Newsreader -> independent evaluation by Beijing University: on par with iFlytek • Back-end improvements • Embedded: more accurate tones with HMM technology • Network (research): syllable based pitch modeling and unit selection • Feb’13: Front-end improvements through syntactic analysis • Oct’13: 10-20% accuracy increase expected with new statistical components

Vocalizer Studio Prompt sculpting, dictionary editor, text normalization rules editor Available for Windows, OSX (Nov) Sold as Vocalizer Offline in ENT Heavily used in PS for creation of - tuned Active Prompt databases - custom dictionaries - custom rulesets

Services / Tuning Through PS • Prompt projects • Recorded Active Prompts (rAPDB) • Tuned Active Prompts (tAPDB) • Text normalization rulesets (RETT) • Custom dictionaries • Porting services • QNX, WinCE, Win Auto, Neutrino, Osal Linux, Android 4, Meego, … • Issue reporting and tracking • PS projects: fogbugz.nuance.com • General: network portal -> Incidents

Custom Voice Development Through R&D • Fast increasing demand for custom voices • 2013: ~20 custom voices incl. Audi, Engram • 2014: ~30 custom voices incl. 4 languages for Aldebaran (talking robot “Nao”, ) ~5 voices expected for Intel, ~5 for Samsung • Voice development work-flow • Trade-off between efficiency and latency • Avoid running below capacity by putting new projects in waiting queue

Demos • Online web demo: http://gh-vd002:8080/synthdemo/ • Consumer products: BMW, Audi, S-voice, … • Digital Mobile Assistant (DMA): Android app • Currently with Ava Premium and Compact, plans to move to network TTS • Latest US flagship voice “Zoe”; MFS • Latest Mandarin; Embedded Pro

TTS Strategic Directions • Network TTS • Highest quality for long-form readout • Advanced text analysis and prosody prediction (NLU) • MFS models • New US voice“Zoe” internal release: Oct’13 • Connected advantages for Embedded TTS • Automatic pronunciation updates (OOV framework) • Split processing: text analysis in cloud, signal generation on device • Mechanical Turk for evaluation • Asian languages • Continue progress for Mandarin • Roll-out new front-end and back-end capabilities for Korean, Japanese

Backup Slides

Vocalizer Technical Architecture