1 / 23

CURRENT STATUS OF ILMT

CURRENT STATUS OF ILMT. A perspective of translation from Marathi to Hindi. Architecture. Example Flow. Modules Extensively Improved. Morph Analyzer Lexical Transfer. Marathi MA changes. The morph was modified to resolve the issues found in testing the Morph's output.

alina
Download Presentation

CURRENT STATUS OF ILMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi

  2. Architecture

  3. Example Flow

  4. Modules Extensively Improved • Morph Analyzer • Lexical Transfer

  5. Marathi MA changes • The morph was modified to resolve the issues found in testing the Morph's output. • The resources were updated by adding 16000 new roots to the Lexicon and by creating several new SRRs. • This covers all the words in the Marathi wordnet. • Revised TAM labels. • Developed methods for Handling of Taddhitas (i.e. words derived from nouns, adjectives and adverbs) and compounds, but not integrated into ILMT pipeline. • Current accuracy is 95% on ILMT data. • The stand-alone morphological analyzer also reports the derivational process.

  6. Marathi Compounding • In linguistics, a Compound Word is a lexeme that consists of more than one stem. They are a kind of MWE’s. • Easier to predict properties then MWE’s. • मामामामी{mamamami} {uncle-aunt (maternal)} (a noun). • Mostly Marathi has only 2 stems with rare 3 stem cases. • भाऊबहीण {bhaubahin}{brother-sister} has a Hindi equivalent भाई-बेहेन {bhai-behen}. • Individual components are directly translated. • Advantage for close languages like Marathi and Hindi.

  7. Problem Definition • Given a word containing two components (and hence roots) a and b, inflected and appended with suffixes, identify each one and provide linguistic information and category of compound word: • Field 1 : <input word> • Field 2 : <‘root-word-a,CGNPTAM-1a’&‘root-word-1b,CGNPTAM-1b,suffix-1b’&fincat=lexcat1>;…;<‘root-word-na,CGNPTAM-na’&‘root-word-nb,CGNPTAM-nb,suffix-nb’&fincat=lexcatn>. • fsaf means ‘feature structure in abbreviated form’. • CGNPTAM means ‘grammatical category, gender, number, person, tense, aspectand modality’. • Fincat: Grammatical category of the resultant word. • If no features then give only the root words with short description.

  8. Taxonomy of Compounds

  9. Results

  10. Marathi Synset Linkage • Total number of synsets for which words were Cross-linked: 18,000 • Now reflected in the bilingual dictionary used for lexical transfer • Total Marathi Synsets : 26557 • Total unique words : 36394 • Total linked Synsets : 23967

  11. Corpus Statistics

  12. Lexical Transfer Module changes • The dictionary currently has 316 Akhyata pairs, 68 Kridanta pairs, and 40 entries for irregular mappings. • A number of bugs involving the transfer of the base forms of verbs have been eliminated. • Bugs related a sudden crash in the system due to improper coding have been eliminated. • Lexical transfer module now selects the first synset in sequence corresponding to the given word. • Transfer of ordinals, conjunctions etc. also have been included. • The features of the NER module are now being properly utilized for the transliteration of the necessary named entities.

  13. Current Status • Results by CDAC Pune • For Health: • Comprehensibility/Adequacy : 81% • Fluency : 53% • For Tourism: • Comprehensibility/Adequacy : 78% • Fluency : 52%

  14. Evaluation Method S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences Score = Linguists give a score out of 5 to the sentences without foreknowledge of their meaning. The score tells of the subjective quality of the sentence.

  15. Examples • Extremely Fluent • कॉंग्रेसने बापूजींचा "इव्हेंट' केला. • Congress made an event about Baapuji • कॉंग्रेस ने बापूजी के इव्हेंट किया । • Moderately/Syntactically Fluent • आइन्स्टाइन एकदा म्हणाला होता, की नव्या युगातील तरुणाईला बापूजी म्हणजे एखाद्या चमत्कारासारखे वाटतील. • Einstein once said that the youth in the new age will feel that Baapuji is like a miracle • आइन्स्टाइन कभी कहा है , कि नया युग के तरुणाई को बापूजी अर्थात् एकाध चमत्कारसारखे बाँटएंगे । • Poor Fluency • कारण मुळात ती बापूजींची राहिलेली नाही. • Because it basically did not remain of Baapuji • क्योंकि आदि में वह बापूजी के बचलेली नाह ।

  16. Examples • Exact meaning transfer • येथे काही जातींची माकडे आणि कांगारू दिसतात. • Here we/one (can) see a few species of monkeys and kangaroos • यहाँ कुछ प्रकारों के बंदर और कंगारू दिखते हैं । • Medium level meaning transfer • कारण ग्रंथ खरेच एक महत्त्वाचे ठिकाण आणि ऊर्जा केंद्रही असते. • Because a book truly is an important place and a source of power • क्योंकि ग्रंथ सचही एक महत्व के स्थान और ऊर्जा केंद्रभी रहता है । • Complete distortion • स्वाभाविकच गांधींचा शोध-पुनर्शोधही चालूच राहिला. • Naturally, Gandhiji’s search-research (of self and the world) continued • स्वाभाविकही गाँधियों के खोज - पुनर्शोध चलएंगे च राहा ।

  17. Another example of high fluency • पहिल्या टप्प्यासाठी पाचऐवजीसहा रुपये, तर जलद बससाठी सातरुपये भाडे प्रस्तावित आहे. • Pahilyatappyasathipaachaiwajisaharupaye, tar jaladbussathisaatrupayebhaadeprastavitahe. • For the first few steps, six instead of five rupees and for fast buses seven rupees have been proposed. • पहले फ़ासले के लिए पाँच केबदले छ रुपये , तो द्रुत बसके लिए सात रुपये किरायाप्रस्तावित रह ।

  18. DEMO • http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterface/admin/login.php • Complete replication of the offline dashboard tool

  19. Current pain points • Fluency is attributed to proper translation of suffixes/case markers/function words. • Marathi has 2 kinds of verb suffixes – Kridantas (Non-Finite) and Akhyatas (Finite). • Verb Chunk label determines which dictionary to look into for suffix translation. • Poor Chunking leads to poor fluency. • Many mistakes in suffix transfer.

  20. Current pain points • Synsets in Wordnet are not ordered by first sense. • First sense WSD not applicable for words not disambiguated by current WSD engine. • This affects comprehensibility.

  21. Action plan for Lexical Transfer Module • Splitting the current transfer module into two parts; one for lexical transfer and the other grammar transfer. • Look into statistical mechanisms for grammar transfer as well as lexical transfer to improve the accuracy. • Including mechanisms to handle the double Vibhaktis reported by the Vibhakti Computation module.

  22. Action plan for MA • Improving the accuracy of the system further by adding new roots and SRR rules. • Revising the FSM rules for Kridantas to eliminate some glaring mistakes. • Creating more rules to handle more and more Taddhitas and compounds and integrating it into the ILMT pipeline • Using other fields in Morph analyzer’s output e.g. a flag to indicate emphatic marker. • Updating the morph to handle the double feature structure of genitive forms.

  23. Other steps • Developing simple parser for Marathi. • Improving Chunker. • Continue linking more Marathi synsets and complete the linkage of current 37,617 Hindi synsets. • Evaluation on Randomly selected Web Documents – about 20-40 per week – and improving the outputs immediately.

More Related