Linguistic Enrichment of Statistical
Download
1 / 37

Linguistic Enrichment of Statistical Transliteration - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Linguistic Enrichment of Statistical Transliteration. लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल. ट्रांसलिटरेशन. MTP Final Stage Presentation Guided by:- Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902) Department of Computer Science & Engineering

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Linguistic Enrichment of Statistical Transliteration' - glenda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Linguistic enrichment of statistical transliteration

Linguistic Enrichment of Statistical

Transliteration

लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल

ट्रांसलिटरेशन

MTP Final Stage Presentation

Guided by:- Presented by:-

Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902)

Department of Computer Science & Engineering

IIT Bombay


Presentation pathway
Presentation Pathway

  • Problem Statement

  • Motivation

  • What is Transliteration?

  • Syllables and their Structure

  • Sonority Theory

  • Concept of Schwa

  • Proposed Transliteration Model

  • Experiments and Results

  • Discussions

  • Conclusion and Future Work

  • References


Problem statement
Problem Statement

To exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.


Motivation
Motivation

  • An important component of Machine Translation

  • When you cannot Translate – Transliterate.

  • Critical in tackling problem of OOV words and proper nouns

  • Proves acute in translating Named entities for CLIR

  • Transliteration – a Phonetic translation process;

  • Apt to exploit phonetic and phonological properties


What is transliteration
What is Transliteration?

  • A process of phonetically translating words like named entities or technical terms from source to target language alphabet.

  • Examples:-

    • Gandhiji – गाँधीजी

    • OOV words likeनमस्कार - Namaskar


Linguistic enrichment of statistical transliteration

Humans translate/transliterate frequently for different reasons

An example of how transliteration comes to rescue when no translations exist


Linguistic enrichment of statistical transliteration
x reasons

Overview of Transliteration

Source Word

Target Word

Transliteration

Units

Transliteration

Units

Character n-grams

Syllables


Basic of syllables
Basic of syllables reasons

“Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.”

  • Vowels are the heart of a syllable(Most Sonorous Element)

  • Consonants act as sounds attached to vowels.


Syllable structure
Syllable Structure reasons

  • Simple syllables – Baba, दादा

  • Complex syllables – Andrew

Ba + ba

दा + दा

Alert!!!

Basic Structure doesn’t suffice

An

drew

VC?

CVC?


Possible syllable structures
Possible syllable structures reasons

  • The Nucleus is always present

  • Onset and Coda may be absent

  • Possible structures

    • V

    • CV

    • VC

    • CVC


Introduction to sonority theory
Introduction to sonority theory reasons

“The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.”

  • Some sounds are more sonorous

  • Words in a language can be divided into syllables

  • Sonority theory distinguishes syllables on the basis of sounds.


Sonority hierarchy
Sonority Hierarchy reasons

  • Obstruents can be further classified into:-

    • Fricatives

    • Affricates

    • Stops


Sonority sequencing principle
Sonority sequencing principle reasons

“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.”

Peak

(Nucleus)

Onset Coda


Example
example reasons

  • ABHIJEET

    • Sonority Profile 1

      A I E E

      H J

      B T

    • Sonority Profile 2

      A I E E

      H J

      B T


The concept of schwa
The concept of schwa reasons

  • First alphabet of IAL – {a}

  • Unstressed and Toneless neutral vowel

  • Some schwas deleted and some are not

  • Schwa deletion – important issue for grapheme to phoneme conversion

  • Handled using a well-established schwa deletion algorithm

  • Example:-

    • Priyatama – Last “a” changes the Gender

प्रियतम

प्रियतमा


Proposed transliteration model
Proposed Transliteration Model reasons

Source Language Words

Source Language Syllables

Syllabification Modules

Target Language Words

Target Language Syllables

Moses Training

Target Language Model

SRILM

Phrase translation tables

Moses Decoder

Source Language Words

Transliterated output


Transliteration system workflow
Transliteration system workflow reasons

  • Syllabification of parallel list of names in Roman and Devanagari

  • Using these parallel list for:-

    • Alignment of syllables

    • Training Moses translation toolkit

    • Language model generation using SRILM

  • Decoding using trained phrase-translation tables and language model

  • Comparing results to analyze performance


Experiments and results
Experiments and Results reasons

  • Syllabification of Roman and Devanagari words

Fig : Syllabification Algorithm


Linguistic enrichment of statistical transliteration


Transliteration process
Transliteration Process reasons

  • Syllabification of list of 10000 parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables.

  • Training Language Models for target language using SRILM toolkit.

  • Training MOSES with aligned corpus of 7500 names and target language model as input.

  • Testing with a list of 2500 proper names using the trained model for transliteration.


Roman to devanagari transliteration
Roman to Devanagari Transliteration reasons

Fig : Result for Roman to Devanagari Transliteration

Fig : Top-n Inclusion results


Devanagari to roman transliteration
Devanagari to Roman Transliteration reasons

Fig : Result for Devanagari to Roman Transliteration

Fig : Top-n translation results


Comparison with character n gram based model
Comparison with Character n-gram based model reasons

  • Same Experimental setup; Transliteration units changed to n-grams

    • Bigrams (Sandeep  Sa, an, nd, de, ee, ep)

    • Trigrams (Sandeep  San, and, nde, dee, eep)

    • Quadrigrams (Sandeep  Sand, ande, ndee, deep)

  • Observations suggest performance improvement using syllables as transliteration units

  • n-gram based models prove to be ignorant to phonological properties like unstressed vowels

Fig : Comparison with N-gram based model


Comparison with state of the art systems
Comparison with State-of-the-art Systems reasons

  • Google transliteration engine and Quillpad used as benchmarks for comparison

  • A list of 1000 words written in Roman alphabet used as test input

  • Our system outperforms Quillpad and just falls short of Google’s results.

  • A more intense training with larger training set might improve system performance.

Fig : Comparison with State-of-the-art transliteration systems


Discussions
Discussions reasons

  • Accents

    • थोड़ा: Thoda or thora?

  • Mapping of sounds

    • Mahaan – महान Kahaan - कहाँ

  • Silent Letters

    • Psychatrist - सायकेट्रिस्ट


Discussions cntd
Discussions (cntd…) reasons

  • Improper Schwa deletion

    • Venkatachalam – वेंकटचलम

  • Improper placement (Onset or Coda)

    • सिराजउद्दीन - सि राज उद् दिन or सि रा जउद्  दिन

  • Similar phonological structure but different pronunciation

    • सोमलता and कोमलता

वें + कट + च + लम

वेंक + टच + लम

सोम

लता

को

मल

ता


Conclusion and future work
Conclusion and Future work reasons

  • Transliteration can prove critical in supporting Machine Translation

  • Phonologically aware transliteration units like syllables show strong signs of performance improvement

  • Syllable-based transliteration performs at least up to the state-of-the-art systems.

  • Syllabification algorithms should be subjected to further improvement

  • Developed system should be supplied with larger and more accurate training set.

  • Some linguistic issues discussed above are very challenging cases for future work on transliteration


References
References reasons

  • Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K. 2003. Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

  • Gao W., Lam W., and Wang K. 2004. Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing.

  • Osamu F. 1975. Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing.

  • Phillip Koehn et.al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic.

  • Laver J. 1994. Principles of Phonetics. Cambridge University Publications. PG. 114.

  • Knight L. and Graehl J. 1997. Machine Transliteration. Proceedings of ACL 1997. Pg 128-135.

  • Stolcke A. 2002. SRILM – An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing.

  • Choudhury M. and Bose A. 2002. A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.




Complex syllable structure
Complex Syllable structure reasons

Fig : Detailed syllable structure

Fig : Complex syllables fitting in above structure


Sonority theory syllables
Sonority theory & syllables reasons

“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.”

  • Represented as waves of sonority or Sonority Profile of that syllable

    Nucleus

    Onset Coda


Sonority hierarchy for english and hindi
Sonority Hierarchy for English and Hindi reasons

Fig : Sonority hierarchy for English

Fig : Sonority hierarchy for Hindi


Maximal onset principle
Maximal Onset Principle reasons

“The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.”

  • In case of words having two valid syllable set, one with maximum onset length would be preferred.

  • Example – Diploma

    • Di + plo + ma

    • Dip + lo + ma


Schwa deletion algorithm
Schwa deletion algorithm reasons

Proceduredelete_schwa (DS)

Input : word (String of alphabets)

Output : Input word with some schwas deleted.

  • Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U.

  • If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F.

  • If y, r, l or v are marked U and preceded by consonants marked H, then mark them F.

  • If a consonant marked U is followed by a full vowel, then mark that consonant as F.

  • While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F.

  • If the last consonant is marked U, mark it H.

  • If any consonant marked U is immediately followed by a consonant marked H, mark it F.

  • While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F.

  • For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output.

    End procedure delete_schwa


Example of schwa deletion
Example of Schwa deletion reasons

Fig : Application of Schwa deletion Algorithm


Examples
Examples reasons

  • Correct Transliterations

  • Incorrect Transliteration