1 / 25

GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi

GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi. Kalyani Patel K.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. Jyoti Pareek Department of Computer Science,Gujarat University. drjyotipareek@yahoo.com.

flynn
Download Presentation

GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi Kalyani PatelK.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. JyotiPareekDepartment of Computer Science,Gujarat University. drjyotipareek@yahoo.com

  2. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  3. Introduction • GH-MAP is designed for a particular pair of a language to take advantage of similarity between sibling language pair Gujarati-Hindi. • It uses a rule based token mapping for effective word to word translation. • GH-MAP can be utilize for MT, CLIR, GurjerNet, Multilingual Dictionary. ICON 2009

  4. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  5. Hindi-Gujarati : A comparative study • Indo-Aryan family (Hindi, Bangla, Assami, Punjabi, Marathi, Oriya and Gujarati) • Being same group, there is high degree of structural similarity • Hindi and Gujarati languages have • bijectively mappable characters (Varna Maala) excluding ळ. • relatively free word-order, where the noun group can come in any order followed generally by the verb group. ICON 2009

  6. Continue.... • Nouns in Hindi and Gujarati languages are inflected based on the case (direct or oblique), number (singular or plural), and the gender (masculine or feminine). In addition to this Gujarati language also has common gender . • Verbs in both the languages are inflected based on gender, number, person, tense, aspect, modality, formality, and voice. ICON 2009

  7. Continue... • Many words in the languages have a shared origin (from Sanskrit) and because of shared culture, they usually also share meaning e.g. (book) ‘પુસ્તક’/ ‘puswaka’ in Gujarati is similar to ‘पुस्तक’/ ‘puswaka’ in Hindi. • Sentence from one language can be mapped to sentence in another language by substituting each word group in source language by appropriate word group in the target language. ICON 2009

  8. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  9. The Rule Base Rule Base for Translation: • Domain Specific monolingual data • Stores typologically different words and their relations • Domain Independent bilingual data • Stores cases , pronouns, adjectives, adverbs etc.. • Substring Substitution rules • Stores Hindi substrings corresponding to Gujarati substring and location of substring • Stem – Suffix rules • Stores bilingual stem and suffix rules • Phrases • Stores bilingual compound words ICON 2009

  10. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  11. START Sentences in language1 Yes Phrase No Tokenize the sentence Token Mapping Engine Language 2 tokens Sentences in language 2 STOP Translation GH-MAP Translate a text in source language to a text in the target language , retaining a flavor of the source language. GH-MAP utilize Token Mapping Engine for translation. ICON 2009

  12. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  13. Token Mapping Engine Token Mapping Engine uses Rule Base for finding the match of a given token in target language. ICON 2009

  14. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  15. TME Algorithm • For each SL (Source Language) word (token): • Search the word in • Table of pronouns, cases, adjectives. If match found then get TL (Target Language) word from the same table. Go to step 7. • Table of domain specific words. If match found then get corresponding TL words from the table of TL domain specific words. Go to step 7. • Remove suffix. Search for stem in table of Stem. If match found then get TL stem and corresponding TL suffix. Generate TL word. Go to step 7. • Search repeatedly for substring (affix) in SL word. If match found then substitute SL substring with corresponding TL substring. Go to step 6. • Transliterate remaining non-translated characters by TL character. • Next ICON 2009

  16. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  17. Example (‘The lotus blossoms’ (E)) • Tokenize the sentence કમળ’/kamalYa + ‘નું’/nuM + ‘ખીલવું’/ KIlavuM. • Tokens are given to Token Mapping Engine • First token ‘કમળ’/kamalYais translated by substituting substring ‘ળ’/lYa by ‘ल’/la and remaining Gujarati character ‘કમ’/kamatransliterate to ‘कम’/kama to generate‘कमल’/kamala ICON 2009

  18. Second token ‘નું’/nuM is translated to ‘का’/kA using Case (Karaka) table. • Third token ‘ખીલવું’/KIlavuMis translated by first removing suffix ‘વું’/vuM, to obtain stem ‘ખીલ’/KIla, the stem is searched in table of stem and corresponding stem in Hindi ‘खिल‘/Kila is obtained, & corresponding suffix of ‘વું’ /vuM in target language i.e. ‘ना‘/nA is obtained to generate ‘खिलना’/KilanA ICON 2009

  19. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  20. Contribution of various Approaches in translation ICON 2009

  21. Evaluation Thus we can conclude that for given test bed GH-MAP could produce about 88% correct translation ICON 2009

  22. Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

  23. Conclusion • To the best of our knowledge, this is the first attempt at rule based token mapping for sibling language pair Hindi-Gujarati. • In this model, only lexical analysis is carried out. • It requires only limited linguistic effort and tools for achieving the said goal. • The test results for a small set of data are encouraging. • There are some limitations of GH-MAP, which needs to be addressed. ICON 2009

  24. Limitations • karaka : का/ kA(H) can be map to નો/no /ની/nI/નું/nuM /ના/nA(G)[of (E)]. • pronoun :उसे /se (H) can be map to તેનો/weno/ તેની/wenI/ તેને/wene (G) [He/She/It (E)] • adjective :नया/nayA(H) can be map to નવું/navuM/નવા/navA (G) [New (E) • Work is in progress towards overcoming these limitations. • With further enhancement in rule base, GH-MAP is expected to yield better result. ICON 2009

  25. Thank You ICON 2009

More Related