A PROPOSAL FOR CREATION OF A. FOR INDIA. Focus: linguistic data. What is ‘Linguistic Data’?. But this data is of use only if it comes with linguistic analysis. Printed words - in different scripts, fonts, platforms & environments
Focus: linguistic data
But this data is of use only if it comes with linguistic analysis
‘Cause it must be tagged and aligned to be of use
THAT’S WHAT CREATES AN IMPORTANT ROLE FOR LINGUISTS IN THIS ENTERPRISE
Proposal evolved through discussion held with many Institutions in India and abroad.
August 13, 2003: 1st presentation at the MHRD, with the then ES in the chair, and FA, AS, J.S.(L), Director (L) andexperts from C-DAC and IIT-Kanpur.RECOLLECTING EVOLUTION OFTHEPROPOSAL?
August 17 and 18, 2003: An International Workshop on LDC was held at the CIIL, Mysore in collaboration with IIIT-Hyderabad and HPLabs, India. It was inaugurated by Smt. Kumud Bansal (the then AS & now Secretary, Elementary Ed), and attended by the J.S. (L). Those who created LDC in USA had participated.
August 19, 2003: a follow up meeting of a smaller group was held at the Indian Institute of Science to thrash out further details. A Project Committee was set up.
Five experts from IIT-B, IIT-M, IISc, IIIT- Hyd, & CIIL with inputs from the industry.
All changes were made through email chats and exchanges, and after four after teleconferencing during Sept-Oct, 2003.
Nov 18,’03:Modified proposal submitted.
Dec 19, 2003:During the 2nd ICON, representatives of lead Institutes met in Mysore to discuss the draft sent to the Ministry. Prof. Aravind Joshi also participated.
January, 2004: With additional inputs, the proposal was modified.
Feb 24, '04: A number of suggestions made (see minutes) during the2nd Presentation for ES, AS, JS(L), & IFD.
April 16, 2004: After the presentation before TDIL Advisory Comm., DoE offers full support.
Indian languages often pose a difficult challenge for the specialists in AI/NLP.
The technology developers building mass-application tools/products, have for long been calling for availability of linguistic data on a large scale.
However, the data should be collected, organized and stored in a manner that suits different groups of technology developers.
These issues require us to involve a number of disciplines like linguistics, statistics, & CS.
Further, this data must be of high quality with defined standards.
Resources must be shared, so that all R&D groups are benefited.
All these are possible with a data consortium.
3.UniversityWho funded LDC in US?
•LDC-IL will be an archive plus.
•Besides data, tools and standards of data representation and analysis must be developed.
•It will create, analyze, segment, tag, align, and upload different kinds of linguistic resources.
•It will accept electronic resources from authors, newspapers, publishers, film, TV, radio & process them for use of the community.
All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC-IL. The following have already shown interest:
TTS: Statistical Probabilities models
Build a speech recognition model
Develop Tree-bank tools
Will form a basis of MAT or MT systemsOther possible applications
IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY
TO WHAT IS BEING PLANNED / ENCOURAGED BY TDIL
of MCIT, and will complement it perfectly
1. Individual Researchers: Rs.2000/- per annum
2. Educational Institutions: Rs.20,000/- per annum
3. Software and related industry : Rs.2,00,000/- per annum
Other countries :
1. Individual Researchers: $ 2,000/- per annum
2. Educational Institutions: $ 20,000/- per annum
3. Software and related industry : $ 50,000/- per annumMembershipDifferential rate of annual fee
GOES WITHOUT SAYING THAT THIS WOULD REQUIRE CONSTANT UPDATION AND UPGRADATION AS WELL AS EXPANSION OF OUR DATA / TOOLS / PRODUCTS
seminars & Training programs) : 50,00,000
Total: Rs. 2,21,60,000
The effort for both Speech Recognition and Speech Synthesis will be repeated across all 22 Scheduled languages. For Speech Recognition, spontaneous speech data will be collected along with read speech. For speech synthesis, data will be collected from professional speakers, with very good voice quality. Additional speech data will be collected to come out with models for prosody (intonation, duration, etc.) to improve the naturalness of synthesized speech. A database (lexicon) of proper names (of Indian origin) will be created, with the equivalent phonetic representation for each of the names.
Semantically tagged corpora:
The real challenge in any NLP and text information processing application is the task of disambiguating senses. In spite of long years of R & D in this area, fully automatic WSD with 100% accuracy has remained an elusive goal. One of the reasons for this shortcoming is understood to be the lack of appropriate and adequate lexical resources and tools. One such resource is the "semantically tagged corpora".
Preparation of this resource requires higher level of linguistic expertise and needs more human effort. First, experts will manually tag the data for syntactic parsing.
Since, a crucial point related to this task is to arrive at a consensus regarding the tags, degree of fineness in analysis and the methodology to be followed. This calls for some discussions amongst the scholars from varying fields such as Sanskritists, linguistics and computer scientists . It will be achieved through conduct of workshops and meetings.
Parallel aligned corpora:
A text available in multiple languages through translation constitutes parallel corpora.
NBT & Sahitya Akademi are some of the official agencies who develop parallel texts in different languages through translation.
Such Institutions have given permission to CIIL to use their works for creation of electronic versions of the same as parallel corpora.
The literary magazines and news paper houses with multiple language editions will have to be approached for parallel corpora.
Computer programmes have to be written for creating
[I] Aligned texts; [II] Aligned sentences; and [III] Aligned chunks.
which is ready in IITB)