localization and language technology standards
Download
Skip this Video
Download Presentation
Localization and Language Technology Standards

Loading in 2 Seconds...

play fullscreen
1 / 40

Localization and Language Technology Standards - PowerPoint PPT Presentation


  • 535 Views
  • Uploaded on

Localization and Language Technology Standards. Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007. Outline. Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Localization and Language Technology Standards' - Sophia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
localization and language technology standards

Localization and Language Technology Standards

Kavi Narayana Murthy

University of Hyderabad

ELITEX - 2007

New Delhi, 10-11 January 2007

outline
Outline
  • Character Encoding Standards
  • Fonts, Glyphs, Mapping Standards
  • OS/Browser Support, Drivers
  • Transliteration, Romanization
  • Translation, Linguistic Resources
  • Speech and OCR Technologies
  • Enforcement

Kavi Narayana Murthy UoH

goals
Goals
  • Functionality
    • Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease
  • Inter-operability, Platform Independence
    • All Applications must work seemlessly on all hardware and software platforms
  • Language and Script Independence
    • Multi-lingual, Multi-Script Support

Kavi Narayana Murthy UoH

standards
Standards
  • Even a poor standard is better than no standard
  • Standards save us a lot in the long run
  • Commercial forces promoting non-standard, proprietary, secret systems must not be allowed to succeed
    • Let us not say “Let the Market Decide”!!!

Kavi Narayana Murthy UoH

character encoding standards
Character Encoding Standards
  • ISCII and Unicode
  • ISCII is a BIS Standard, Unicode is not
  • Unicode is based on ISCII
  • In some sense, Unicode is a step in the backward direction
  • Let us understand ISCII first

Kavi Narayana Murthy UoH

language and script
Language and Script
  • Do not confuse one for the other
  • Many-to-Many
  • Script is neither language nor font
  • Script and SuperScript
  • Phonetic Basis
    • Common SuperScript for all ILs
  • Script Grammar

Kavi Narayana Murthy UoH

language and script7
Language and Script
  • Sanskrit is written in Devanagari, Telugu, Kannada, Bangla etc. scripts
  • Devanagari is used for writing Sanskrit, Hindi, Marathi, etc.
  • English words are often written (transliterated) in local language scripts

Kavi Narayana Murthy UoH

phonetic basis
Phonetic Basis
  • Words: Meanings, Sounds, Written Symbols
  • Meanings are supreme but difficult to quantify and encode
  • Sounds are the next best
    • A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’
    • No need for ‘Spellings’
      • What is write is what we speak - directly

Kavi Narayana Murthy UoH

orthography
Orthography
  • Written symbols correspond with phonemes – basic sound units
  • Minor variations in sounds (allophones, co-articulation effects etc.) are not depicted in orthography
    • t: Mountain, tea, truck, spilt, little
  • Special Symbols not to confused with basic Characters

Kavi Narayana Murthy UoH

what is a character
What is a Character?
  • Indian Languages:
    • No ‘alphabet’, not letters, no spellings
    • Phoneme-based
    • Units are syllable-like: called ‘akshara’-s
  • akshara-s very large in number
    • Corpus studies not sufficient
  • Made up of vowels, consonants etc.
  • Not all sequences valid

Kavi Narayana Murthy UoH

script grammar
Script Grammar
  • A Grammar for Scripts
  • Allows all valid sequences, only valid sequences
  • No need to code all possible akshara-s
  • Script grammar must be part of standards: ISCII includes. UNICODE?
  • Script Grammar to be enforced by s/w

Kavi Narayana Murthy UoH

superscript
SuperScript
  • ILs: 10 Scripts with a nearly common sound system – all derived from the ancient ‘braahmi’ script
  • => SuperScript
    • Super Set of all Phonemes
  • Common encoding: ISCII
  • Extendable to all languages of the world

Kavi Narayana Murthy UoH

iscii bis 1991 is 13194
ISCII: (BIS – 1991: IS 13194)
  • 128 codes more than sufficient
  • Uses second half of ASCII, first half untouched – allows mixing with English
  • SuperScript: Transliteration built-in
  • Long Standing: ISCII 1988, 1991
  • Well thought and well designed

Kavi Narayana Murthy UoH

why did iscii fail to catch on
Why did ISCII fail to catch on?
  • Silent on Character-to-Font mapping
    • A complex many-to-many mapping
    • Fonts not standardized, fonts not available
  • Not registered, no OS/Browser Support
  • (BIS – 1991: IS 13194)
  • Rationale not explained
  • Not publicized, not enforced

Kavi Narayana Murthy UoH

history
History
  • Proprietary, non-standard, secret font based encoding schemes
    • Promoted by commercial companies
    • Near Zero Inter-operability
    • Ad-hoc ISCII-to-font mapping schemes
    • Mapping schemes not made public
    • To be made Illegal and Punishable
  • Put India back by at least a decade!

Kavi Narayana Murthy UoH

improving iscii
Improving ISCII
  • Register - To get OS/Browser Support
  • Remove encoding of allophones, allographs
  • Script Grammar: FSM enough, CFG - not needed
  • Include Rationale, explanatory notes
  • Remove Attribute/Extension codes
  • Standardize ISCII-to-Font Mapping Scheme
  • Promote, Enforce

Kavi Narayana Murthy UoH

character to font mapping
Character-to-Font Mapping
  • Complex scripts – not linear
  • Glyphs: shape units convenient for rendering
  • Poor correspondence with sound units
  • Many-to-Many mappings
    • Glyph selection, scaling, positioning
  • No Glyph Encoding Standard

Kavi Narayana Murthy UoH

from character to font
From Character to Font
  • Must be provably complete and 100% consistent
  • Current systems are all ad-hoc – neither complete nor consistent
  • Finite State Transducers:
    • Necessary and Sufficient
    • Without restricting Creativity and Flexibility
    • Simple, Efficient, Re-Usable

Kavi Narayana Murthy UoH

encoding standards unicode
Encoding Standards: Unicode
  • For Language/Script/SuperScript?
    • CJK. Why not for ILs?
  • Script Grammar?
  • Character-to-Font:
    • relegated to font level
    • font effects
  • ISCII-88 Based, Has Errors
    • Once added, cannot be deleted!

Kavi Narayana Murthy UoH

iscii or unicode
ISCII or Unicode?
  • Unicode:
    • To be with the World, to know and be known
    • ‘Correcting’ Mistakes, Improving Standards
    • Support (OS, Fonts, etc.), Education, Training
    • Converting Legacy Data – A Huge Task
      • ISCII-to-Unicode is not trivial
    • Ignore BIS Standard and embrace what is not yet ‘standardized’?
  • Why not co-exist? – Internal and External Views

Kavi Narayana Murthy UoH

keyboard layouts drivers
Keyboard Layouts, Drivers
  • Several de-facto standards and many variations in use
    • To select a few and standardize
  • So called Roman Phonetic Typing
    • ILs through English!
    • OK for oldies, not for future!
  • INSCRIPT: ISCII Standard, Good for new comers
  • To strictly enforce Script Grammar

Kavi Narayana Murthy UoH

document encoding standards
Document Encoding Standards
  • Plain Text: pure ISCII/UNICODE
    • Mono-lingual Plain Text?
  • Annotated Text (Ex. Word Processors)
    • XML Style, Open, Readable formats to be encouraged
    • Proprietary, secret, non-standard encodings must be discouraged

Kavi Narayana Murthy UoH

transliteration
Transliteration
  • Widely used, part of our Tradition
    • Sanskrit texts in local scripts
    • English, Hindi, Urdu words in local scripts
    • Music Compositions
  • Automatic in ISCII. Unicode?
    • Quality of transliteration
  • To and From English?

Kavi Narayana Murthy UoH

romanization
Romanization
  • Need:
    • Where there is no support for local languages
      • English dailies, posters, advertisements etc.
      • Lack of support: OS/Browser/Fonts etc.
    • Where users prefer Roman
  • A variety of ad-hoc schemes in use
    • iTRANS, RTS, W-X, etc.
  • Standards badly wanted

Kavi Narayana Murthy UoH

romanization25
Romanization
  • Multi-dimensional optimization problem
    • Case Mix-up
      • 26 Letters not sufficient
      • 52 nearly sufficient
      • Not always supported
    • Storage space, Ease of Typing, Aesthetics
    • Scientific/Logical Design/Naturalness
      • English-like – for the oldies: a, ee, oo, a, oa ???
      • Futuristic: aa/ii/uu/ee/oo

Kavi Narayana Murthy UoH

romanization26
Romanization
  • Clashes: a+u/au, k+h/kh, s’
    • Two way conversion, cyclic check
  • Ex. Long Vowels:
    • a: -clashes with colon
    • diacritic –not supported
    • ipa –not understood –not supported
    • A +single char. +saves space –ugly –difficult to type –case-mix-up
    • aa +logical (like ee) +easy to type

Kavi Narayana Murthy UoH

romanization an example
Romanization: An Example
  • a aa i ii u uu R RR e ee ai o oo au M H
  • k kh g gh n~
  • c ch j jh n`
  • T TH D DH N
  • t th d dh n
  • p ph b bh m
  • y r l v s’ S s h L

Kavi Narayana Murthy UoH

translation
Translation
  • Create Material Afresh
  • Translate by Hand
  • Automatic/Machine Translation
  • Machine Aided Translation
  • English – Local Language Translation
  • Local – Local Language Translation

Kavi Narayana Murthy UoH

translation29
Translation
  • Resource Intensive
    • Manpower, Time, Cost
  • Quality/Uniformity
    • Standards, Bench-Mark Data, Testing and Evaluation Procedures
  • Dictionaries, Terminology Databases
    • Pan-Indian Terms/Sanskritize/Localize

Kavi Narayana Murthy UoH

linguistic resources
Linguistic Resources
  • Dictionaries – General, Domain Specific
  • Terminological Databases
  • Thesauri, WordNets, Ontologies
  • Morphological Analyzers, Generators
  • Spell/Grammar/Style Checkers
  • Annotated Text and Speech Corpora

Kavi Narayana Murthy UoH

india future is in speech
India: Future is in Speech
  • One Billion People, A Sixth of the World
  • More than 150 Languages, 22 Recognized
  • 95 % not comfortable with English
  • Computers, Current, Connectivity
  • Info Revolution benefits: Majority Deprived
  • 10 M Computers, 100 M Phones
  • Future is in Speech

Kavi Narayana Murthy UoH

speech
Speech
  • Natural
  • Easy, Fast
  • Hands-Free
  • No need to Learn
    • Technology
    • Language
  • Available to all

Kavi Narayana Murthy UoH

text and speech
Text and Speech
  • Speech is Natural
  • Reading/Writing is learnt, Artificial
  • Some never learn – Illiterates
  • Oral Tradition
  • Speech is more permanent than Text!
  • “I did not steal that ring of gold”
  • Trust Yourself!

Kavi Narayana Murthy UoH

speech technologies
Speech Technologies
  • Speech Recognition: Speech to Text
  • Speech Synthesis: Text to Speech
  • Speaker Recognition,Verification,ID
  • Speech Coding/Decoding, Compression
  • Slow down, Speed up
  • Speech as Evidence

Kavi Narayana Murthy UoH

applications
Applications
  • Telephone Dialing
  • Form Filling
  • Dictation Machine
  • Command and Control
  • Voice enabled Web
  • OCR+WP+TTS
  • MT: Cross-Lingual IR, S2S

Kavi Narayana Murthy UoH

slide36
OCR
  • OCR in Local Scripts Needed
    • To digitize and save legacy data
    • To compile/process/edit/refine data
  • For Printed Texts/Manuscripts
  • Old Data
    • deterioration of paper
    • old type fonts, problems of type-setting

Kavi Narayana Murthy UoH

multi modal interfaces
Multi-Modal Interfaces
  • To Reach out to 1 Billion People, we must get the best of many worlds:
    • Speech Recognition and Synthesis
    • Graphics and iconic Interfaces
    • OCR Technologies
    • Translation, CLIR
    • Camera, Gestures, Touch Screen

Kavi Narayana Murthy UoH

balance
Balance
  • Between Backward Compatibility and Future-Proof Designs
  • Quick Fix Solutions and Long Haul
  • One Standard or Several?
  • Economics and Business Sense versus Social Responsibilities
  • Acceptance versus Enforcement

Kavi Narayana Murthy UoH

the 3 most important things
The 3 Most Important Things

1. Develop/Refine/Update Standards

    • Detailed Documentation
    • Including rationale, issues, evaluation, etc.

2. Education and Training

3. Enforcement

    • Make use of non-standard methods illegal and punishable under law
  • Technical Workshops for detailing

Kavi Narayana Murthy UoH

thank you

Thank You!

Visit

www.LanguageTechnologies.ac.in

ad