1 / 23

Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing

Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications & Information Technology) ‘Anusandhan Bhawan’, C 56/1 Sector 62, Noida – 201 307, India karunesharora@cdacnoida.com. Translation Support System.

maia
Download Presentation

Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications & Information Technology) ‘Anusandhan Bhawan’, C 56/1 Sector 62, Noida – 201 307, India karunesharora@cdacnoida.com ÓC-DAC Noida’2004

  2. Translation Support System Technology : Angla Bharati (Rule base) developed by IIT Kanpur. System developed jointly by IIT,Kanpur and CDAC Noida Operating system support : LINUX/ WINDOWS Performance : 85% correct parsing, 60% correct translation Embedded Text Editor ,Pre Processor and Post editor Lexicon :25,000 root words ÓC-DAC Noida’2004

  3. Translation Support System (English to Hindi) Pattern Directed Parsing English Sentence Morphological Analyzer Lexical Dictionary CORPUS Rule Base Pseudo Language Output Post Editor Hindi Text Generator ÓC-DAC Noida’2004

  4. Global Concept for Translation System: BhashaSetu Document to Translate Translator Reviewer Machine Translation Export Project Pre Process Filter to Universal Format Tokenizer Dictionary Setup Source Language Parser Translation Memory Setup Example Memory Setup User 1 2 3 4 6 8 10 12 19 13 18 14 Editor Machine Translation Software 21 5 7 9 11 15 17 16 Machine Translation Import Dictionary Parsing Translation Memory Example Memory 20 Dictionary Editor Translation Process Translation Memory Editor 25 26 Translated Document 23 24 Dictionaries Example Memories Translation Memories 29 28 27 Example Memory Merge Dictionary Merge Translation Memory Merge Post Process Filter ÓC-DAC Noida’2004

  5. ÓC-DAC Noida’2004

  6. Test suite for Translation Support Systems ÓC-DAC Noida’2004

  7. Knowledge Management Parallel Corpus & Tools ÓC-DAC Noida’2004

  8. Gyan Nidhi : Parallel Corpus ‘GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 12 Indian languages , a project sponsored by TDIL, DIT, MC &IT, Govt of India ÓC-DAC Noida’2004

  9. Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus What it is?The multilingual parallel text corpus contains the same text translated in more than one language. What Gyan Nidhi contains?GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It aims to digitize 1 million pages altogether containing at least 50,000 pages in each Indian language and English. Source for Parallel Corpus • National Book Trust India • Sahitya Akademi • Navjivan Publishing House • Publications Division • SABDA, Pondicherry ÓC-DAC Noida’2004

  10. GyanNidhi Block Diagram ÓC-DAC Noida’2004

  11. Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus Platform : Windows Data Encoding : XML, UNICODE Portability of Data : Data in XML format supports various platforms Applications of GyanNidhi Automatic Dictionary extraction Creation of Translation memory Example Based Machine Translation (EBMT) Language research study and analysis Language Modeling ÓC-DAC Noida’2004

  12. Tools: Prabandhika: Corpus Manager • Categorisation of corpus data in various user-defined domains • Addition/Deletion/Modification of any Indian Language data files in HTML / RTF / TXT / XML format. • Selection of languages for viewing parallel corpus with data aligned up to paragraph level • Automatic selection and viewing of parallel paragraphs in multiple languages • Abstract and Metadata • Printing and saving parallel data in Unicode format ÓC-DAC Noida’2004

  13. Sample Screen Shot : Prabandhika ÓC-DAC Noida’2004

  14. Tools: Vishleshika : Statistical Text Analyzer • Vishleshika is a tool for Statistical Text Analysis for Hindi extendible to other Indian Languages text • It examines input text and generates various statistics, e.g.: • Sentence statistics • Word statistics • Character statistics • Text Analyzer presents analysis in Textual as well as Graphical form. ÓC-DAC Noida’2004

  15. Sample output: Character statistics Above Graph shows that the distribution is almost equal in Hindi and Nepali in the sample text. Most frequent consonants in the Hindi Most frequent consonants in the Nepali Results also show that these six consonants constitute more than 50% of the consonants usage. ÓC-DAC Noida’2004

  16. Vishleshika: Word and sentence Statistics ÓC-DAC Noida’2004

  17. Speech Technology and tools ÓC-DAC Noida’2004

  18. Annotated Speech Corpora for Hindi, Punjabi and Marathi languages Vishleshika Statistical AnalysisTool Gyan Nidhi Corpus Phonetically Rich sentence set Manual Verification and Editing Studio Recording by Professionals XML Meta Data Creation Segmentation and labeling using Praat / Emulabel ÓC-DAC Noida’2004

  19. ÓC-DAC Noida’2004

  20. Modules under TTS ÓC-DAC Noida’2004

  21. Other Areas of expertise • OCR for Devanagri Script • Digital Library for Indian languages • Word Processing tools like Spell Checker, Transliteration, Terminology Development, Document analysis, Font converters • Indian Language eContent Creation ÓC-DAC Noida’2004

  22. Areas for future work • Machine Translation • Standardization Lexware Database design • Working on the global approach ‘BhashaSetu’ which is a amalgamation of different approaches to squeeze the best of each approach • Development of Translation system Test Bed • Knowledge Management • Automatic Text Summarization tool for Hindi and other Indian languages • Standardization of Parts of Speech TagSet for Hindi extendible to other Indian • languages • Parts of Speech Tagger development for Indian languages • Automated Terminology Development tools • Sentence alignment tool for Indian languages • Development of manually tagged parallel corpus up to word level • Speech Technology • Speech to Speech Translation System • Development of Semi-automated speech annotation tools ÓC-DAC Noida’2004

  23. Thank You ÓC-DAC Noida’2004

More Related