1 / 12

Linguistic Resources needed by Nuance

Linguistic Resources needed by Nuance. Jan Odijk 060528 Cocosda/Write Workshop. Overview. Nuance History Nuance Technologies Nuance Language Coverage Which Languages are needed Which data are needed Advantages. Nuance History. ScanSoft (Digital Imaging) acquired:

dorjan
Download Presentation

Linguistic Resources needed by Nuance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linguistic Resources needed by Nuance Jan Odijk 060528 Cocosda/Write Workshop

  2. Overview • Nuance History • Nuance Technologies • Nuance Language Coverage • Which Languages are needed • Which data are needed • Advantages

  3. Nuance History • ScanSoft (Digital Imaging) acquired: • Lernout & Hauspie speech divisions (2001) • Philips Speech Processing embedded and network divisions (2002) • Telelogue (2003) • LocusDialog (2003) • SpeechWorks (2004) • Talks (2004) • ART (2005) • Phonetic Systems (2005) • Rhetorical (2005) • MedRemote (2005) • Nuance (2005) company renamed Nuance • Dictaphone (2006)

  4. Nuance Technologies • Digital Imaging • Speech Technologies • Text-to-Speech (TTS) • Automatic Speech Recognition (ASR) • Dictation • Speaker Verification • Audiomining • Speech Applications/Solutions • Automated Attendant Systems • Directory Assistance Systems • Dictation end-user application • Multimodal applications

  5. Nuance Technologies • Platforms • Server • DeskTop • Embedded • Automotive • Mobile Phones • Domains • Horizontal • Vertical • Medical • Legal • Navigation • ....

  6. Nuance Language Coverage • Broad language coverage • OCR supports 114 languages • DeskTop Dictation in 8 languages • TTS > 23 languages • Telephony ASR > 40 languages • Embedded ASR > 11 languages • Broad language coverage necessary • Most business customers are operating internationally • Want a single provider of language and speech technologies

  7. Nuance Language Coverage • Language Coverage must be further broadened! • Data are needed for that, but ... • Costs are high • No single company can afford the investments

  8. Which Languages? • Priority 1 • Arabic, Chinese (Mandarin, Cantonese), Danish, Dutch, English (UK), English (US), Farsi, Finnish, French, French (Canadian), German, Hindi, Indonesian, Italian, Malaysian, Pilipino (Tagalog), Polish, Portuguese, Portuguese (Brazil), Russian, Spanish, Spanish (American), Swedish, Thai, Turkish, Vietnamese,... • Priority 2 • Bulgarian, Croatian, Czech, Estonian, Greek, Gujarati, Hebrew, Hungarian, Icelandic, Japanese, Kannada, Kazak, Khmer, Latvian, Lithuanian, Macedonian, Malayalam, Marathi, Norwegian, Punjabi Romanian, Serbian, Sesotho, Sinhalese, Slovak, Slovenian, Swahili, Tamil, Telugu, Ukrainian, Urdu, Uzbek, Xhosa, Zulu,...

  9. Which Data? • There’s not Data but More Data • but... • Given Time and Costs constraints a minimal set is needed to develop technologies/applications for new languages

  10. Which Data? • Network ASR: SpeechDat family • SpeechDat-II, Orientel, SALA (I and II), LILA • Embedded ASR • Automotive: SpeechDat-Car • Consumer Apps: SPEECON • Pronunciation and Grammatical Lexicons: LC-STAR • TTS synthesis: TC-STAR • see • http://www.speechdat.org • http://www.tc-star.org • http://www.lc-star.com

  11. Which Data? • Desktop Office data • Large Text Corpora (>300 million tokens plain text) • news • business / finance • traffic messages, weather messages • e-mail • SMS • ...

  12. Advantages • Research can be done in your own language • Part of the costs can be recovered by licensing data via ELRA to companies • Companies can develop technologies/applications for your languages • Contributes to securing the position of your language in the Internet era • Ask your government for funding and support • Some good examples: • STEVIN Programme Netherlands/Flanders • UPC databases for Catalan (Asunción Moreno)

More Related