1 / 35

A PROPOSAL FOR CREATION OF A

A PROPOSAL FOR CREATION OF A. FOR INDIA. Focus: linguistic data. What is ‘Linguistic Data’?. But this data is of use only if it comes with linguistic analysis. Printed words - in different scripts, fonts, platforms & environments

ownah
Download Presentation

A PROPOSAL FOR CREATION OF A

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A PROPOSAL FOR CREATION OF A FOR INDIA Focus: linguistic data

  2. What is ‘Linguistic Data’? But this data is of use only if it comes with linguistic analysis • Printed words - in different scripts, fonts, platforms & environments • Domain-specific texts(e.g. 90-odd ones in current Indian languages corpora) • Samples of Spoken Corpus – telephone talk, public lectures, formal discussions, in-group conversations, radio talks, natural language queries, etc. • Hand-written samples • Ritualistic Use of languages – scriptures, chanting, etc. • Language of Performance - Reading, recitations, enactment ‘Cause it must be tagged and aligned to be of use THAT’S WHAT CREATES AN IMPORTANT ROLE FOR LINGUISTS IN THIS ENTERPRISE

  3. How the Idea of an Indian LDC Came about? • The Brown University text corpuswas adopted to build statistical language models. • TI-46 & TI DIGITS databases, of Texas Instruments (early 80's) distributed by NIST. • The LDC at U-Pennwas established in 1992. • CIIL houses 45 million Word Corpora in 15 Indian lgs with DoE-TDIL support. CIIL has been distributing it to R&D groups the world over. • Now converted into UNICODE jointly with the U of Lancaster and with another 45 million word Corpora from five Indian languages under Emille project coming in, it has been released in early 2004. • CIIL is now working with Universities of Uppsala on corpora of lesser-known languages of India; See www.ciiluppsala-spokencorpus.net • SO WHAT made us PROPOSE a LDC-IL? • The giant strides in IT that India has made. • Because demands were made by several Software and Telecom giants – Reliance, IBM, HPLabs, Modular Syetems & Infosys. • Due to suggestions of the Hindi Committee • As decided in the 1st ILPC meeting, 2004.

  4. Proposal evolved through discussion held with many Institutions in India and abroad. August 13, 2003: 1st presentation at the MHRD, with the then ES in the chair, and FA, AS, J.S.(L), Director (L) andexperts from C-DAC and IIT-Kanpur. RECOLLECTING EVOLUTION OFTHEPROPOSAL? August 17 and 18, 2003: An International Workshop on LDC was held at the CIIL, Mysore in collaboration with IIIT-Hyderabad and HPLabs, India. It was inaugurated by Smt. Kumud Bansal (the then AS & now Secretary, Elementary Ed), and attended by the J.S. (L). Those who created LDC in USA had participated. August 19, 2003: a follow up meeting of a smaller group was held at the Indian Institute of Science to thrash out further details. A Project Committee was set up.

  5. The Project Drafting Committee had top NLP specialists and linguists with the Director CIIL as the Coordinator. Five experts from IIT-B, IIT-M, IISc, IIIT- Hyd, & CIIL with inputs from the industry. All changes were made through email chats and exchanges, and after four after teleconferencing during Sept-Oct, 2003. Nov 18,’03:Modified proposal submitted. Dec 19, 2003:During the 2nd ICON, representatives of lead Institutes met in Mysore to discuss the draft sent to the Ministry. Prof. Aravind Joshi also participated. January, 2004: With additional inputs, the proposal was modified. Feb 24, '04: A number of suggestions made (see minutes) during the2nd Presentation for ES, AS, JS(L), & IFD. April 16, 2004: After the presentation before TDIL Advisory Comm., DoE offers full support.

  6. The importance of creation of a large data-archive of Indian languages is undeniable. In fact, it is this realization that resulted in government’s plan for corpora development in early ’90s. Indian languages often pose a difficult challenge for the specialists in AI/NLP. The technology developers building mass-application tools/products, have for long been calling for availability of linguistic data on a large scale. However, the data should be collected, organized and stored in a manner that suits different groups of technology developers. These issues require us to involve a number of disciplines like linguistics, statistics, & CS. Further, this data must be of high quality with defined standards. Resources must be shared, so that all R&D groups are benefited. All these are possible with a data consortium. Why LDC-IL?

  7. Spoken language data & importance of phoneticians • Numerous Indian languages, each with so many sound patterns identified/studied by phoneticians for centuries. • The inventory of IPA is invaluable for spoken language corpus, but their identification from speech data requires finesse. • For speech technology,we have to create both phonetics/ acoustics models of languages • Even when it is now aided and eased by Visual Phonetics technology, as available in CIIL or TIFR labs, what we need in addition is trained phoneticians.

  8. THE MODEL • An ideal model of Consortium could be seen if we consider the Linguistic Data Consortium (LDC) hosted by the University of Pennsylvania. • LDC (USA) is an open consortium of universities, companies & government R&D labs that creates, collects and distributes speech and text databases, lexicons, and other resources for R&D. • This ‘LDC’ has 100 plus agencies as its active users and members. Includes some non-western languages:Arabic,Chinese, Korean. • The core operations of are self-supporting after ten years. • The activities include maintaining the data archives, producing and distributing CD-ROMs, and arranging networked data distribution, etc. • All these have provided a great impetus to R&D in the field of language technology for English and other European languages. • It is proposed to adopt a similar approach in the Indian context.

  9. Who managed? 1.Govt 2.Industry 3.University Who funded LDC in US? • LDC was supported initially by US Govt grant IRI-9528587 from the Information and Intelligent Systems division • Also by a grant 9982201 from the Human Computer Interaction Program of the National Science Foundation • Powered in part by Academic Equipment Grant 7826-990 237-US from Sun Microsystems. • No member institution could afford to produce this individually.

  10. Who will set up LDC-IL in India? What will it do actually? • The Ministry of HRD through the Central Institute of Indian Languages (CIIL), Mysore along with other institutions working on Indian Languages technology like Indian Institute of Science, Bangalore, Indian Institutes of Technology at MumbaiandChennai, as well as the International Institute of Information Technology, Hyderabad propose to set up this LDC-IL. • It is proposed that they will be the Lead Institutions in this initiative, with CIIL as the coordinating body. •LDC-IL will be an archive plus. •Besides data, tools and standards of data representation and analysis must be developed. •It will create, analyze, segment, tag, align, and upload different kinds of linguistic resources. •It will accept electronic resources from authors, newspapers, publishers, film, TV, radio & process them for use of the community.

  11. Potential Participants / Institutions in India All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC-IL. The following have already shown interest: • IISc Bangalore; • All Indian Institutes of Technology; • IIITs at Hyderabad and elsewhere; • ISI Calcutta/Hyderabad/Bangalore; • C-DAC, Pune; • TIFR Mumbai; • Universities like U of Hyderabad; DU; JNU; NEHU • HP Labs India; • IBM; Infosys; Reliance Infocom; • Language institutions like CIEFL, KHS, NCPUL & RSKS;

  12. Major areas of Linguistic Resource Development as proposed • Speech Recognition and Synthesis • Character Recognition • Creation of different kinds of Corpora • NLP • By-products : Word finders, lexicons of different kind, thesauri, Usage compilations etc.

  13. Collocational restrictions for OCR building TTS: Statistical Probabilities models Build a speech recognition model Auto-summarization Develop Tree-bank tools Skeletal parses Will form a basis of MAT or MT systems Other possible applications IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY TO WHAT IS BEING PLANNED / ENCOURAGED BY TDIL of MCIT, and will complement it perfectly

  14. Funding & Management • The core funding from the Government of India. It will span over two plan periods. • All activities will be in a project mode and through CIIL’s PL account. • All staff will be on contract. • All receipts and payments through internet gateways, or through conventional means, will go to this special bank account. • Will attempt to leverage expertise already available to cut avoidable cost and delay. • As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions. • An annual progress report will be submitted to the government.

  15. Arrangements

  16. PAC of LDC-IL

  17. India: 1. Individual Researchers: Rs.2000/- per annum 2. Educational Institutions: Rs.20,000/- per annum 3. Software and related industry : Rs.2,00,000/- per annum Other countries : 1. Individual Researchers: $ 2,000/- per annum 2. Educational Institutions: $ 20,000/- per annum 3. Software and related industry : $ 50,000/- per annum MembershipDifferential rate of annual fee GOES WITHOUT SAYING THAT THIS WOULD REQUIRE CONSTANT UPDATION AND UPGRADATION AS WELL AS EXPANSION OF OUR DATA / TOOLS / PRODUCTS

  18. Estimation • It is estimated that by the third year, LDC-IL will have 50 Institutional members from India, and 200 Indian scholars as individual members, contributing to Rs. 12 lakh annually. • In addition, it is estimated to have at least 20 researchers from abroad as individual members, contributing to $ 40,000 or Rs. 20 lakhs more. • The attempt will be to secure industrial support from the IT sector internationally to raise at least 10 institutional memberships initially, creating a corpus of $ 200,000 annually by/during the third year. Should that happen, it will generate a substantial amount for LDC-IL.

  19. Budget: A broad indication*Rs. 221.60 lakhs per year. Total: Rupees 1772.8 lakhs for the next 8 years. • 1. Human Resources: 69,84,000 • 2. Tasks: 64,76,000 • 3. Events (Meetings, workshops, seminars & Training programs) : 50,00,000 • 4. Equipments & maintenance: 27,00,000 • 5. IPR costs & publications: 10,00,000 Total: Rs. 2,21,60,000 • NB: The Director CIIL on the advise of the Project Advisory Committee of the LDC-IL may be authorized to re-appropriate funds from among the heads indicated here, without exceeding the overall budget. • In case the people in service in the Government or Autonomous Institutions in substantial capacity are selected their service and salary will be protected.

  20. Resource Generation- Details • The first 2 years of the project are incubation years. It would take time to set up, and test-run tools and deliverables & advertise. • It is estimated that from the third year onwards, the annual revenue may be 8% to 10% of the annual investment, i.e. Rs. 17.73 lakhs to Rs. 22.16 lakhs contributing to Corpus Fund. • 6th year on, it will be around 25% to 35% of the amount invested, i.e. Rs.55.4 lakhs to Rs.66.48 lakhs annually. • At the end of eight years, there will be at least Rs. 201.66 lakhs to Rs. 243.76 lakhs plus interests in corpus funds. • Hopefully, there will be new lead institutions to contribute to corpus fund further, once LDC-IL works in full swing.

  21. Core Operations to be self-supporting • Beyond eight years, Govt may support only events (Rs.50 lakhs from CIIL’s OC-Plan), tasks of software development (Rs.64.76 lakhs from our OE-Plan), and maintenance of equipments (Rs.15.24 lakhs from OE-Non-Plan), i.e. Rs.130 lakhs a years. • The services of the personnel and the IPR costs will be paid from 6% interests of the corpus funds (Rs.14.63 lakhs) plus anticipated annual income, i.e. 66.48 lakhs, i.e. Rs.81.11 lakhs generated annually. With Rs.130 lakhs as above, the total comes to Rs.211.11 lakhs (approx).

  22. Thank you

  23. Speech Recognition and Synthesis: Objectives • 1.      Primarily to build speech recognition and synthesis systems. • 2.   Although there are ASR & TTS systems for many western languages, commercially viable speech systems are unavailable. • 3.     Voice User Interfaces for IT applications and services, useful especially in telephony-based applications. • 4.      If such technology is available in Indian languages, people in various semi-urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc. • 5.     However, for this a computer has to be able to accept speech input in the user’s language and provide natural speech output. • 6.   Also in India, if speech technology is coupled with translation systems between the various Indian languages. • 7.   The main obstacle is to customize this technology for various Indian languages is the lack of appropriate annotated speech databases. • 8.    Focus: (i) to collect data that can be used for building speech enabled systems in Indian languages and (ii) to develop tools that facilitate collection of high quality speech data.

  24. Goals – long & short term

  25. Methodology

  26. Possible Applications: • Speech to Speech translation for a pair of Indian languages, namely, Hindi and Telugu. • Command and control applications. • Multimodal interfaces to the computer in Indian languages. • E-mail readers over the telephone. • Readers for the visually disadvantaged. • Speech enabled Office Suite. The effort for both Speech Recognition and Speech Synthesis will be repeated across all 22 Scheduled languages. For Speech Recognition, spontaneous speech data will be collected along with read speech. For speech synthesis, data will be collected from professional speakers, with very good voice quality. Additional speech data will be collected to come out with models for prosody (intonation, duration, etc.) to improve the naturalness of synthesized speech. A database (lexicon) of proper names (of Indian origin) will be created, with the equivalent phonetic representation for each of the names.

  27. Character Recognition • Character Recognition refers to the conversion of printed or handwritten characters to a machine-interpretable form. • ”Online” handwriting recognition or Online HWR refers to the interpretation of handwriting captured dynamically using a handheld or tablet device. It allows the creation of more natural handwriting-based alternatives to keyboards for data entry in Indian scripts, and also for imparting of handwriting skills using computers. • “Offline” handwriting recognition or Offline HWR refers to the interpretation of handwriting captured statically as an image. • Optical character recognition or OCR refers to the interpretation of printed text captured as an image. It can be used for conversion of printed or typewritten material such as books and documents into electronic form. • These different areas of language technology require different algorithms and linguistic resources. • They are all hard research problems because of the variety of writing styles and fonts encountered. • Of these, OCR has seen some research in a few Indian scripts because of support from the TDIL program. However the technology is not yet mature and there is only one commercial offering.

  28. Possible Applications

  29. Natural Language Processing • Electronic dictionaries: • Electronic dictionaries are a primary requisite for developing any software in NLP. • ED 1 Monolingual/bilingual dictionaries • 25,000 words per year (per language) • ED 2. Transfer Lexicon and Grammar(TransLexGram) (per language) • Transfer Lexicon and Grammar above involves developing a language resource which would contain • English Headwords • Their grammatical category • Their various senses in Hindi • Corresponding sense in the other Indian language • An example sentence in English for each sense of a word • Corresponding translation in the concerned Indian language • o In case of verbs, parallel verb-frames from English to Indian language. • As is obvious from the above, TransLexGram will be a rich lexicon which will not only contain the word level information but also the crucial information of verb-argument structure and the vibhaktis with specific senses of a verb. • The resource, once created will be a parallel resource not only between English and Indian languages but also across all Indian languages.

  30. Creation of Corpora • Domain Specific Corpora: • Apart from these basic text corpora creation an attempt will be made to create domain specific corpora in the following areas : • a.       Newspaper corpora • b.      Child language corpus • c.       Pathological speech/language data • d.      Speech error Data • e.       Historical/Inscriptional databases of Indian languages which is one of the most important to trace not only as the living documents of Indian History but also historical linguistics of Indian languages. • f.        Grammars of comparative/descriptive/reference are needed to be considered as corpus of databases. • g.       Morphological Analyzers and morphological generators.

  31. POS tagged corpora • Part-of-speech (or POS) tagged corpora are collections of texts in which part of speech category for each word is marked. • To be developed in a bootstrapping manner. • First, manual tagging will be done on some amount of text. • Then, a POS tagger which uses learning techniques will be used to learn from the tagged data. • After the training, the tool will automatically tag another set of the raw corpus. • Automatically tagged corpus will then be manually validated which will be used as additional training data for enhancing the performance of the tool.

  32. Other kinds of Corpora Semantically tagged corpora: The real challenge in any NLP and text information processing application is the task of disambiguating senses. In spite of long years of R & D in this area, fully automatic WSD with 100% accuracy has remained an elusive goal. One of the reasons for this shortcoming is understood to be the lack of appropriate and adequate lexical resources and tools. One such resource is the "semantically tagged corpora". Chunked corpora: • The chunked corpora will also be prepared in a manner similar to the POS tagging. Here also the initial training set will be a complete manual effort. Thereafter, it will be a man-machine effort. That is why, the target in the first year is less and double in the successive years. Chunked corpora is a useful resource for various applications.

  33. Syntactic tree bank: Preparation of this resource requires higher level of linguistic expertise and needs more human effort. First, experts will manually tag the data for syntactic parsing. Since, a crucial point related to this task is to arrive at a consensus regarding the tags, degree of fineness in analysis and the methodology to be followed. This calls for some discussions amongst the scholars from varying fields such as Sanskritists, linguistics and computer scientists . It will be achieved through conduct of workshops and meetings. Parallel aligned corpora: A text available in multiple languages through translation constitutes parallel corpora. NBT & Sahitya Akademi are some of the official agencies who develop parallel texts in different languages through translation. Such Institutions have given permission to CIIL to use their works for creation of electronic versions of the same as parallel corpora. The literary magazines and news paper houses with multiple language editions will have to be approached for parallel corpora. Computer programmes have to be written for creating [I] Aligned texts; [II] Aligned sentences; and [III] Aligned chunks.

  34. Corpora Tools • 1.Tools for Transfer Lexicon Grammar (including creation of interface for building Transfer Lexicon Grammar) • 2. Spellchecker and corrector tools • 3. Tools for POS tagging. (Trainable tagging tool + an Interface for editing POS • tagged corpora) • 4. Tools for chunking (Rule-based language-independent chunkers) • 5. Interface for chunking (Building an interface for editing and validating the chunked corpora) • 6.Tools for syntactic tree bank, incl. interface for developing syntactic tree bank • 7. Tools for semantic tagging with basic resources are the Indian language WordNets showing a browser that has two windows – one showing the senses (i.e., synsets) from the WordNet appear in the other window, after which a manual selection of the sense can be done • 8. (Semi) automatic tagger based on statistical NLP (the preliminary version of which is ready in IITB) • 9. Tools for text alignment, including Text alignment tool, Sentence alignment tool and Chunk alignment tool as well as an interface for aligning corpora

More Related