Localization and language technology standards
1 / 40

Localization and Language Technology Standards - PowerPoint PPT Presentation

  • Updated On :

Localization and Language Technology Standards. Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007. Outline. Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Localization and Language Technology Standards' - Sophia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Localization and language technology standards l.jpg

Localization and Language Technology Standards

Kavi Narayana Murthy

University of Hyderabad

ELITEX - 2007

New Delhi, 10-11 January 2007

Outline l.jpg

  • Character Encoding Standards

  • Fonts, Glyphs, Mapping Standards

  • OS/Browser Support, Drivers

  • Transliteration, Romanization

  • Translation, Linguistic Resources

  • Speech and OCR Technologies

  • Enforcement

Kavi Narayana Murthy UoH

Goals l.jpg

  • Functionality

    • Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease

  • Inter-operability, Platform Independence

    • All Applications must work seemlessly on all hardware and software platforms

  • Language and Script Independence

    • Multi-lingual, Multi-Script Support

Kavi Narayana Murthy UoH

Standards l.jpg

  • Even a poor standard is better than no standard

  • Standards save us a lot in the long run

  • Commercial forces promoting non-standard, proprietary, secret systems must not be allowed to succeed

    • Let us not say “Let the Market Decide”!!!

Kavi Narayana Murthy UoH

Character encoding standards l.jpg
Character Encoding Standards

  • ISCII and Unicode

  • ISCII is a BIS Standard, Unicode is not

  • Unicode is based on ISCII

  • In some sense, Unicode is a step in the backward direction

  • Let us understand ISCII first

Kavi Narayana Murthy UoH

Language and script l.jpg
Language and Script

  • Do not confuse one for the other

  • Many-to-Many

  • Script is neither language nor font

  • Script and SuperScript

  • Phonetic Basis

    • Common SuperScript for all ILs

  • Script Grammar

Kavi Narayana Murthy UoH

Language and script7 l.jpg
Language and Script

  • Sanskrit is written in Devanagari, Telugu, Kannada, Bangla etc. scripts

  • Devanagari is used for writing Sanskrit, Hindi, Marathi, etc.

  • English words are often written (transliterated) in local language scripts

Kavi Narayana Murthy UoH

Phonetic basis l.jpg
Phonetic Basis

  • Words: Meanings, Sounds, Written Symbols

  • Meanings are supreme but difficult to quantify and encode

  • Sounds are the next best

    • A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’

    • No need for ‘Spellings’

      • What is write is what we speak - directly

Kavi Narayana Murthy UoH

Orthography l.jpg

  • Written symbols correspond with phonemes – basic sound units

  • Minor variations in sounds (allophones, co-articulation effects etc.) are not depicted in orthography

    • t: Mountain, tea, truck, spilt, little

  • Special Symbols not to confused with basic Characters

Kavi Narayana Murthy UoH

What is a character l.jpg
What is a Character?

  • Indian Languages:

    • No ‘alphabet’, not letters, no spellings

    • Phoneme-based

    • Units are syllable-like: called ‘akshara’-s

  • akshara-s very large in number

    • Corpus studies not sufficient

  • Made up of vowels, consonants etc.

  • Not all sequences valid

Kavi Narayana Murthy UoH

Script grammar l.jpg
Script Grammar

  • A Grammar for Scripts

  • Allows all valid sequences, only valid sequences

  • No need to code all possible akshara-s

  • Script grammar must be part of standards: ISCII includes. UNICODE?

  • Script Grammar to be enforced by s/w

Kavi Narayana Murthy UoH

Superscript l.jpg

  • ILs: 10 Scripts with a nearly common sound system – all derived from the ancient ‘braahmi’ script

  • => SuperScript

    • Super Set of all Phonemes

  • Common encoding: ISCII

  • Extendable to all languages of the world

Kavi Narayana Murthy UoH

Iscii bis 1991 is 13194 l.jpg
ISCII: (BIS – 1991: IS 13194)

  • 128 codes more than sufficient

  • Uses second half of ASCII, first half untouched – allows mixing with English

  • SuperScript: Transliteration built-in

  • Long Standing: ISCII 1988, 1991

  • Well thought and well designed

Kavi Narayana Murthy UoH

Why did iscii fail to catch on l.jpg
Why did ISCII fail to catch on?

  • Silent on Character-to-Font mapping

    • A complex many-to-many mapping

    • Fonts not standardized, fonts not available

  • Not registered, no OS/Browser Support

  • (BIS – 1991: IS 13194)

  • Rationale not explained

  • Not publicized, not enforced

Kavi Narayana Murthy UoH

History l.jpg

  • Proprietary, non-standard, secret font based encoding schemes

    • Promoted by commercial companies

    • Near Zero Inter-operability

    • Ad-hoc ISCII-to-font mapping schemes

    • Mapping schemes not made public

    • To be made Illegal and Punishable

  • Put India back by at least a decade!

Kavi Narayana Murthy UoH

Improving iscii l.jpg
Improving ISCII

  • Register - To get OS/Browser Support

  • Remove encoding of allophones, allographs

  • Script Grammar: FSM enough, CFG - not needed

  • Include Rationale, explanatory notes

  • Remove Attribute/Extension codes

  • Standardize ISCII-to-Font Mapping Scheme

  • Promote, Enforce

Kavi Narayana Murthy UoH

Character to font mapping l.jpg
Character-to-Font Mapping

  • Complex scripts – not linear

  • Glyphs: shape units convenient for rendering

  • Poor correspondence with sound units

  • Many-to-Many mappings

    • Glyph selection, scaling, positioning

  • No Glyph Encoding Standard

Kavi Narayana Murthy UoH

From character to font l.jpg
From Character to Font

  • Must be provably complete and 100% consistent

  • Current systems are all ad-hoc – neither complete nor consistent

  • Finite State Transducers:

    • Necessary and Sufficient

    • Without restricting Creativity and Flexibility

    • Simple, Efficient, Re-Usable

Kavi Narayana Murthy UoH

Encoding standards unicode l.jpg
Encoding Standards: Unicode

  • For Language/Script/SuperScript?

    • CJK. Why not for ILs?

  • Script Grammar?

  • Character-to-Font:

    • relegated to font level

    • font effects

  • ISCII-88 Based, Has Errors

    • Once added, cannot be deleted!

Kavi Narayana Murthy UoH

Iscii or unicode l.jpg
ISCII or Unicode?

  • Unicode:

    • To be with the World, to know and be known

    • ‘Correcting’ Mistakes, Improving Standards

    • Support (OS, Fonts, etc.), Education, Training

    • Converting Legacy Data – A Huge Task

      • ISCII-to-Unicode is not trivial

    • Ignore BIS Standard and embrace what is not yet ‘standardized’?

  • Why not co-exist? – Internal and External Views

Kavi Narayana Murthy UoH

Keyboard layouts drivers l.jpg
Keyboard Layouts, Drivers

  • Several de-facto standards and many variations in use

    • To select a few and standardize

  • So called Roman Phonetic Typing

    • ILs through English!

    • OK for oldies, not for future!

  • INSCRIPT: ISCII Standard, Good for new comers

  • To strictly enforce Script Grammar

Kavi Narayana Murthy UoH

Document encoding standards l.jpg
Document Encoding Standards

  • Plain Text: pure ISCII/UNICODE

    • Mono-lingual Plain Text?

  • Annotated Text (Ex. Word Processors)

    • XML Style, Open, Readable formats to be encouraged

    • Proprietary, secret, non-standard encodings must be discouraged

Kavi Narayana Murthy UoH

Transliteration l.jpg

  • Widely used, part of our Tradition

    • Sanskrit texts in local scripts

    • English, Hindi, Urdu words in local scripts

    • Music Compositions

  • Automatic in ISCII. Unicode?

    • Quality of transliteration

  • To and From English?

Kavi Narayana Murthy UoH

Romanization l.jpg

  • Need:

    • Where there is no support for local languages

      • English dailies, posters, advertisements etc.

      • Lack of support: OS/Browser/Fonts etc.

    • Where users prefer Roman

  • A variety of ad-hoc schemes in use

    • iTRANS, RTS, W-X, etc.

  • Standards badly wanted

Kavi Narayana Murthy UoH

Romanization25 l.jpg

  • Multi-dimensional optimization problem

    • Case Mix-up

      • 26 Letters not sufficient

      • 52 nearly sufficient

      • Not always supported

    • Storage space, Ease of Typing, Aesthetics

    • Scientific/Logical Design/Naturalness

      • English-like – for the oldies: a, ee, oo, a, oa ???

      • Futuristic: aa/ii/uu/ee/oo

Kavi Narayana Murthy UoH

Romanization26 l.jpg

  • Clashes: a+u/au, k+h/kh, s’

    • Two way conversion, cyclic check

  • Ex. Long Vowels:

    • a: -clashes with colon

    • diacritic –not supported

    • ipa –not understood –not supported

    • A +single char. +saves space –ugly –difficult to type –case-mix-up

    • aa +logical (like ee) +easy to type

Kavi Narayana Murthy UoH

Romanization an example l.jpg
Romanization: An Example

  • a aa i ii u uu R RR e ee ai o oo au M H

  • k kh g gh n~

  • c ch j jh n`

  • T TH D DH N

  • t th d dh n

  • p ph b bh m

  • y r l v s’ S s h L

Kavi Narayana Murthy UoH

Translation l.jpg

  • Create Material Afresh

  • Translate by Hand

  • Automatic/Machine Translation

  • Machine Aided Translation

  • English – Local Language Translation

  • Local – Local Language Translation

Kavi Narayana Murthy UoH

Translation29 l.jpg

  • Resource Intensive

    • Manpower, Time, Cost

  • Quality/Uniformity

    • Standards, Bench-Mark Data, Testing and Evaluation Procedures

  • Dictionaries, Terminology Databases

    • Pan-Indian Terms/Sanskritize/Localize

Kavi Narayana Murthy UoH

Linguistic resources l.jpg
Linguistic Resources

  • Dictionaries – General, Domain Specific

  • Terminological Databases

  • Thesauri, WordNets, Ontologies

  • Morphological Analyzers, Generators

  • Spell/Grammar/Style Checkers

  • Annotated Text and Speech Corpora

Kavi Narayana Murthy UoH

India future is in speech l.jpg
India: Future is in Speech

  • One Billion People, A Sixth of the World

  • More than 150 Languages, 22 Recognized

  • 95 % not comfortable with English

  • Computers, Current, Connectivity

  • Info Revolution benefits: Majority Deprived

  • 10 M Computers, 100 M Phones

  • Future is in Speech

Kavi Narayana Murthy UoH

Speech l.jpg

  • Natural

  • Easy, Fast

  • Hands-Free

  • No need to Learn

    • Technology

    • Language

  • Available to all

Kavi Narayana Murthy UoH

Text and speech l.jpg
Text and Speech

  • Speech is Natural

  • Reading/Writing is learnt, Artificial

  • Some never learn – Illiterates

  • Oral Tradition

  • Speech is more permanent than Text!

  • “I did not steal that ring of gold”

  • Trust Yourself!

Kavi Narayana Murthy UoH

Speech technologies l.jpg
Speech Technologies

  • Speech Recognition: Speech to Text

  • Speech Synthesis: Text to Speech

  • Speaker Recognition,Verification,ID

  • Speech Coding/Decoding, Compression

  • Slow down, Speed up

  • Speech as Evidence

Kavi Narayana Murthy UoH

Applications l.jpg

  • Telephone Dialing

  • Form Filling

  • Dictation Machine

  • Command and Control

  • Voice enabled Web


  • MT: Cross-Lingual IR, S2S

Kavi Narayana Murthy UoH

Slide36 l.jpg

  • OCR in Local Scripts Needed

    • To digitize and save legacy data

    • To compile/process/edit/refine data

  • For Printed Texts/Manuscripts

  • Old Data

    • deterioration of paper

    • old type fonts, problems of type-setting

Kavi Narayana Murthy UoH

Multi modal interfaces l.jpg
Multi-Modal Interfaces

  • To Reach out to 1 Billion People, we must get the best of many worlds:

    • Speech Recognition and Synthesis

    • Graphics and iconic Interfaces

    • OCR Technologies

    • Translation, CLIR

    • Camera, Gestures, Touch Screen

Kavi Narayana Murthy UoH

Balance l.jpg

  • Between Backward Compatibility and Future-Proof Designs

  • Quick Fix Solutions and Long Haul

  • One Standard or Several?

  • Economics and Business Sense versus Social Responsibilities

  • Acceptance versus Enforcement

Kavi Narayana Murthy UoH

The 3 most important things l.jpg
The 3 Most Important Things

1. Develop/Refine/Update Standards

  • Detailed Documentation

  • Including rationale, issues, evaluation, etc.

    2. Education and Training

    3. Enforcement

  • Make use of non-standard methods illegal and punishable under law

  • Technical Workshops for detailing

  • Kavi Narayana Murthy UoH

    Thank you l.jpg

    Thank You!