1 / 40

Localization and Language Technology Standards

Localization and Language Technology Standards. Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007. Outline. Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization

Sophia
Download Presentation

Localization and Language Technology Standards

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

  2. Outline • Character Encoding Standards • Fonts, Glyphs, Mapping Standards • OS/Browser Support, Drivers • Transliteration, Romanization • Translation, Linguistic Resources • Speech and OCR Technologies • Enforcement Kavi Narayana Murthy UoH

  3. Goals • Functionality • Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease • Inter-operability, Platform Independence • All Applications must work seemlessly on all hardware and software platforms • Language and Script Independence • Multi-lingual, Multi-Script Support Kavi Narayana Murthy UoH

  4. Standards • Even a poor standard is better than no standard • Standards save us a lot in the long run • Commercial forces promoting non-standard, proprietary, secret systems must not be allowed to succeed • Let us not say “Let the Market Decide”!!! Kavi Narayana Murthy UoH

  5. Character Encoding Standards • ISCII and Unicode • ISCII is a BIS Standard, Unicode is not • Unicode is based on ISCII • In some sense, Unicode is a step in the backward direction • Let us understand ISCII first Kavi Narayana Murthy UoH

  6. Language and Script • Do not confuse one for the other • Many-to-Many • Script is neither language nor font • Script and SuperScript • Phonetic Basis • Common SuperScript for all ILs • Script Grammar Kavi Narayana Murthy UoH

  7. Language and Script • Sanskrit is written in Devanagari, Telugu, Kannada, Bangla etc. scripts • Devanagari is used for writing Sanskrit, Hindi, Marathi, etc. • English words are often written (transliterated) in local language scripts Kavi Narayana Murthy UoH

  8. Phonetic Basis • Words: Meanings, Sounds, Written Symbols • Meanings are supreme but difficult to quantify and encode • Sounds are the next best • A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’ • No need for ‘Spellings’ • What is write is what we speak - directly Kavi Narayana Murthy UoH

  9. Orthography • Written symbols correspond with phonemes – basic sound units • Minor variations in sounds (allophones, co-articulation effects etc.) are not depicted in orthography • t: Mountain, tea, truck, spilt, little • Special Symbols not to confused with basic Characters Kavi Narayana Murthy UoH

  10. What is a Character? • Indian Languages: • No ‘alphabet’, not letters, no spellings • Phoneme-based • Units are syllable-like: called ‘akshara’-s • akshara-s very large in number • Corpus studies not sufficient • Made up of vowels, consonants etc. • Not all sequences valid Kavi Narayana Murthy UoH

  11. Script Grammar • A Grammar for Scripts • Allows all valid sequences, only valid sequences • No need to code all possible akshara-s • Script grammar must be part of standards: ISCII includes. UNICODE? • Script Grammar to be enforced by s/w Kavi Narayana Murthy UoH

  12. SuperScript • ILs: 10 Scripts with a nearly common sound system – all derived from the ancient ‘braahmi’ script • => SuperScript • Super Set of all Phonemes • Common encoding: ISCII • Extendable to all languages of the world Kavi Narayana Murthy UoH

  13. ISCII: (BIS – 1991: IS 13194) • 128 codes more than sufficient • Uses second half of ASCII, first half untouched – allows mixing with English • SuperScript: Transliteration built-in • Long Standing: ISCII 1988, 1991 • Well thought and well designed Kavi Narayana Murthy UoH

  14. Why did ISCII fail to catch on? • Silent on Character-to-Font mapping • A complex many-to-many mapping • Fonts not standardized, fonts not available • Not registered, no OS/Browser Support • (BIS – 1991: IS 13194) • Rationale not explained • Not publicized, not enforced Kavi Narayana Murthy UoH

  15. History • Proprietary, non-standard, secret font based encoding schemes • Promoted by commercial companies • Near Zero Inter-operability • Ad-hoc ISCII-to-font mapping schemes • Mapping schemes not made public • To be made Illegal and Punishable • Put India back by at least a decade! Kavi Narayana Murthy UoH

  16. Improving ISCII • Register - To get OS/Browser Support • Remove encoding of allophones, allographs • Script Grammar: FSM enough, CFG - not needed • Include Rationale, explanatory notes • Remove Attribute/Extension codes • Standardize ISCII-to-Font Mapping Scheme • Promote, Enforce Kavi Narayana Murthy UoH

  17. Character-to-Font Mapping • Complex scripts – not linear • Glyphs: shape units convenient for rendering • Poor correspondence with sound units • Many-to-Many mappings • Glyph selection, scaling, positioning • No Glyph Encoding Standard Kavi Narayana Murthy UoH

  18. From Character to Font • Must be provably complete and 100% consistent • Current systems are all ad-hoc – neither complete nor consistent • Finite State Transducers: • Necessary and Sufficient • Without restricting Creativity and Flexibility • Simple, Efficient, Re-Usable Kavi Narayana Murthy UoH

  19. Encoding Standards: Unicode • For Language/Script/SuperScript? • CJK. Why not for ILs? • Script Grammar? • Character-to-Font: • relegated to font level • font effects • ISCII-88 Based, Has Errors • Once added, cannot be deleted! Kavi Narayana Murthy UoH

  20. ISCII or Unicode? • Unicode: • To be with the World, to know and be known • ‘Correcting’ Mistakes, Improving Standards • Support (OS, Fonts, etc.), Education, Training • Converting Legacy Data – A Huge Task • ISCII-to-Unicode is not trivial • Ignore BIS Standard and embrace what is not yet ‘standardized’? • Why not co-exist? – Internal and External Views Kavi Narayana Murthy UoH

  21. Keyboard Layouts, Drivers • Several de-facto standards and many variations in use • To select a few and standardize • So called Roman Phonetic Typing • ILs through English! • OK for oldies, not for future! • INSCRIPT: ISCII Standard, Good for new comers • To strictly enforce Script Grammar Kavi Narayana Murthy UoH

  22. Document Encoding Standards • Plain Text: pure ISCII/UNICODE • Mono-lingual Plain Text? • Annotated Text (Ex. Word Processors) • XML Style, Open, Readable formats to be encouraged • Proprietary, secret, non-standard encodings must be discouraged Kavi Narayana Murthy UoH

  23. Transliteration • Widely used, part of our Tradition • Sanskrit texts in local scripts • English, Hindi, Urdu words in local scripts • Music Compositions • Automatic in ISCII. Unicode? • Quality of transliteration • To and From English? Kavi Narayana Murthy UoH

  24. Romanization • Need: • Where there is no support for local languages • English dailies, posters, advertisements etc. • Lack of support: OS/Browser/Fonts etc. • Where users prefer Roman • A variety of ad-hoc schemes in use • iTRANS, RTS, W-X, etc. • Standards badly wanted Kavi Narayana Murthy UoH

  25. Romanization • Multi-dimensional optimization problem • Case Mix-up • 26 Letters not sufficient • 52 nearly sufficient • Not always supported • Storage space, Ease of Typing, Aesthetics • Scientific/Logical Design/Naturalness • English-like – for the oldies: a, ee, oo, a, oa ??? • Futuristic: aa/ii/uu/ee/oo Kavi Narayana Murthy UoH

  26. Romanization • Clashes: a+u/au, k+h/kh, s’ • Two way conversion, cyclic check • Ex. Long Vowels: • a: -clashes with colon • diacritic –not supported • ipa –not understood –not supported • A +single char. +saves space –ugly –difficult to type –case-mix-up • aa +logical (like ee) +easy to type Kavi Narayana Murthy UoH

  27. Romanization: An Example • a aa i ii u uu R RR e ee ai o oo au M H • k kh g gh n~ • c ch j jh n` • T TH D DH N • t th d dh n • p ph b bh m • y r l v s’ S s h L Kavi Narayana Murthy UoH

  28. Translation • Create Material Afresh • Translate by Hand • Automatic/Machine Translation • Machine Aided Translation • English – Local Language Translation • Local – Local Language Translation Kavi Narayana Murthy UoH

  29. Translation • Resource Intensive • Manpower, Time, Cost • Quality/Uniformity • Standards, Bench-Mark Data, Testing and Evaluation Procedures • Dictionaries, Terminology Databases • Pan-Indian Terms/Sanskritize/Localize Kavi Narayana Murthy UoH

  30. Linguistic Resources • Dictionaries – General, Domain Specific • Terminological Databases • Thesauri, WordNets, Ontologies • Morphological Analyzers, Generators • Spell/Grammar/Style Checkers • Annotated Text and Speech Corpora Kavi Narayana Murthy UoH

  31. India: Future is in Speech • One Billion People, A Sixth of the World • More than 150 Languages, 22 Recognized • 95 % not comfortable with English • Computers, Current, Connectivity • Info Revolution benefits: Majority Deprived • 10 M Computers, 100 M Phones • Future is in Speech Kavi Narayana Murthy UoH

  32. Speech • Natural • Easy, Fast • Hands-Free • No need to Learn • Technology • Language • Available to all Kavi Narayana Murthy UoH

  33. Text and Speech • Speech is Natural • Reading/Writing is learnt, Artificial • Some never learn – Illiterates • Oral Tradition • Speech is more permanent than Text! • “I did not steal that ring of gold” • Trust Yourself! Kavi Narayana Murthy UoH

  34. Speech Technologies • Speech Recognition: Speech to Text • Speech Synthesis: Text to Speech • Speaker Recognition,Verification,ID • Speech Coding/Decoding, Compression • Slow down, Speed up • Speech as Evidence Kavi Narayana Murthy UoH

  35. Applications • Telephone Dialing • Form Filling • Dictation Machine • Command and Control • Voice enabled Web • OCR+WP+TTS • MT: Cross-Lingual IR, S2S Kavi Narayana Murthy UoH

  36. OCR • OCR in Local Scripts Needed • To digitize and save legacy data • To compile/process/edit/refine data • For Printed Texts/Manuscripts • Old Data • deterioration of paper • old type fonts, problems of type-setting Kavi Narayana Murthy UoH

  37. Multi-Modal Interfaces • To Reach out to 1 Billion People, we must get the best of many worlds: • Speech Recognition and Synthesis • Graphics and iconic Interfaces • OCR Technologies • Translation, CLIR • Camera, Gestures, Touch Screen Kavi Narayana Murthy UoH

  38. Balance • Between Backward Compatibility and Future-Proof Designs • Quick Fix Solutions and Long Haul • One Standard or Several? • Economics and Business Sense versus Social Responsibilities • Acceptance versus Enforcement Kavi Narayana Murthy UoH

  39. The 3 Most Important Things 1. Develop/Refine/Update Standards • Detailed Documentation • Including rationale, issues, evaluation, etc. 2. Education and Training 3. Enforcement • Make use of non-standard methods illegal and punishable under law • Technical Workshops for detailing Kavi Narayana Murthy UoH

  40. Thank You! Visit www.LanguageTechnologies.ac.in

More Related