Challenges of development of Language Technology and services in multicultural and multilingual Indi...
1 / 36

Swaran Lata , Director and HoD slata@mit - PowerPoint PPT Presentation

  • Uploaded on

Challenges of development of Language Technology and services in multicultural and multilingual Indian Scenario. Swaran Lata , Director and HoD Technology Development for Indian Languages Programme (TDIL) Dept of Information Technology , Govt. of I ndia.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Swaran Lata , Director and HoD slata@mit' - rafiki

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Swaran lata director and hod slata mit

Challenges of development of Language Technology and services in multicultural and multilingual Indian Scenario

SwaranLata, Director and HoD

Technology Development for Indian Languages Programme (TDIL)

Dept of Information Technology , Govt. of India

Organization of presentation
Organization of Presentation services in multicultural and multilingual Indian Scenario

  • India – cultural diversity

  • Linguistic Diversity in India

  • Present Knowledge Society and Indian Scenario

  • ICT scenario in India

  • Internet penetration – Haves & Have Not-s

  • Mind-set - Still an inhibition

  • Bridging the gap – Service delivery –reaching the citizens doorsteps

  • Localization – Key enabler

  • Challenges and Issues

  • TDIL’s efforts

  • National Roll Out Plan – A big Step forward

  • Localization of Applications

  • Putting Standards in place

  • Collaboration and Hand-holding

Swaran lata director and hod slata mit

  • India – A civilization of more than 5000 years old services in multicultural and multilingual Indian Scenario

  • Vast ancient knowledge base

  • Diverse culture and heritage –probably one of the most spectacular in the world

  • One of largest economy in the present world

  • Rapid strides in Information and communications technology

  • Yet .. Widening divide in terms of knowledge amongst various strata of citizens

Linguistic diversity in india
Linguistic Diversity in India services in multicultural and multilingual Indian Scenario

  • According to Census 2001 India has 122 major languages and 2371 dialects.

  • Out of 122 languages 22 are constitutionally recognized languages.

  • Linguistic Diversity is very rich and wide in India

  • One Language –many script

  • Many Language –one script

  • Culturally different depending on region though using same script for different languages.

  • Even wide difference for same language across different country

Swaran lata director and hod slata mit

Marathi services in multicultural and multilingual Indian Scenario


Though same script – Devanagari – Content wise variation for Hindi and Marathi –

Depicting cultural and linguistic difference

Present ict scenario in india
Present ICT scenario in India services in multicultural and multilingual Indian Scenario

  • Despite a reputation as an emerging technology powerhouse, India’s scores on the 2009 Connectivity Scorecard are poor in the vital consumer and business segments.

  • These poor scores should not be surprising, since many of the individual metrics that we utilise are effectively measuring “penetration rates.”

  • This means that India is judged as a whole, and not by the pockets of ICT excellence that it undoubtedly possesses.

  • India scores especially low on broadband and Internet penetration rates.

  • Broadband penetration in India is below 2 percent of households compared to 20 percent of households or more in Turkey, Chile, and Mexico .

  • On the consumer usage front, India is not a strong performer in terms of Internet usage, with below 10 percent of the population regularly using the Internet. The country is hampered by a relatively low literacy rate

Swaran lata director and hod slata mit

Global Broadband divide services in multicultural and multilingual Indian Scenario

India still in low broad-band penetration region

Swaran lata director and hod slata mit

Low Rural Tele-density . Compared to urban one services in multicultural and multilingual Indian Scenario

Mind set still favouring english as medium of excellence
Mind-set : Still favouring English as medium of excellence services in multicultural and multilingual Indian Scenario

  • English and Hindi serves and link languages

  • English Learning viewed as a passport to better economic and social prospects. - Even people from low income strata now considers this.

  • Due to surge in the ICT and ICT enabled services in recent time , English now has become 2nd highest medium of instruction from school level

  • Study by National University for Education Planning and Administration (NUEPA): -- In SarbaSikshaAbhiyan no of students opting for English grew by 150% between 2003-08 while the corresponding fig of Hindi is only 32%

  • Example : Uttar-Pradesh , West Bengal and .. Now using English medium of instruction for schools and colleges

Swaran lata director and hod slata mit

  • Result : services in multicultural and multilingual Indian Scenario

  • Though , Hindi (ranked 3rd) and Bengali (ranked 8th) are among the top 10 language spoken across the world- but, no Indian language is in the top 10 languages used in the Internet.

  • Minuscule Internet usage in Indian Languages

  • Confinement of Knowledge

  • Low usage of knowledge sources and applications

Swaran lata director and hod slata mit

UNESCO’s VISION for Multilingualism in Cyberspace services in multicultural and multilingual Indian Scenario

  • Language constitutes the foundation of communication and is fundamental to cultural and historical heritage.

  • Increasingly, knowledge and information are key determinants of wealth creation, social transformation and human development.

  • Language is the primary vector for communicating knowledge and traditions, thus the opportunity to use one’s language on global information networks such as the Internet will determine the extent to which one can participate in the emerging knowledge society.

  • Thousands of languages worldwide are absent from Internet content and there are no tools for creating or translating information into these excluded tongues.

  • Huge sections of the world’s population are thus prevented from enjoying the benefits of technological advances and obtaining information essential to their wellbeing and development.

Swaran lata director and hod slata mit

An uneven growth services in multicultural and multilingual Indian Scenario

Indian Software Export Industry growing at a very fast pace in their global presence

However , Root is not expanding its base within the country

Fallout : Domestic requirement is not being looked into within the country using Indian Languages

Result : Non-availability of Information and Knowledge

to the vast section of the citizen

Expanding Software Export

Low penetration in Indian Market

Swaran lata director and hod slata mit

Requirements : services in multicultural and multilingual Indian Scenario

Reaching out to the door steps of citizens offering better services for wider dissemination of knowledge .

Localization of Software Solutions , contents and services as per local requirements .

Common services centre its objectives
Common Services Centre –Its objectives services in multicultural and multilingual Indian Scenario

  • CSC is a strategic cornerstone of the National e-Governance Plan (NeGP) – Front end service Interface for major G2C services

  • CSC is one of the three infrastructure pillars of e-governance which the government is committed to building, to ensure “anytime anywhere” web enabled delivery of government services.

  • To provide e-governance services.

  • 100,000 CSCs for 600,000 village clusters

  • To cater to service needs of major rural areas

  • Being implemented in PPP Model

Local language interface not a desirable but an essential component
Local Language Interface – Not a desirable but An essential Component

  • The success of CSC hinges upon effective delivery of the G2C applications to rural masses

  • Since most of the citizens communicate in their local languages – Local Language Interface to G2C solutions at CSC is essential

  • Hosting of content in local languages helps citizens to interact in a better way in today’s knowledge society

  • Thus , Local Language Interface is

    “Not a desirable but An essential Component”

Swaran lata director and hod slata mit

NeGP – Mission Mode Projects essential Component
















































Initiatives already taken to enable G2C applications such as Land Records , Civil Supplies and Municipal applications with Indian Language Interface

Swaran lata director and hod slata mit

Service Delivery Model of CSC essential Component

Requires Language Interface

Localization requirements for service delivery applications
Localization Requirements for Service Delivery Applications essential Component

  • To ensure seamless access of services, language Component /Localization and interface requires at:

  • Storage level – Server end

  • Date Exchange – Traffic (Language tags needs to be properly embedded

  • Display & Rendering

  • Language Interface for differently -abled citizens for more inclusive societal benefits

Globalization of it
Globalization of IT essential Component

Swaran lata director and hod slata mit

Globalization & Localization essential Component

Key enablers

Standards essential Component

Key Enablers

Localization Tools

Locale Data Repository





Linguistic Resources


Swaran lata director and hod slata mit

Complexities essential Component

Quality Assurance

  • Testing methodologies

  • Metrics for Linguistic Testing

  • Certification by Government for

  • linguistic compliance

Language Technologies


  • Machine Translation

  • Optical Character Recognition

  • Speech Technologies

  • Cross Lingual Information Retrieval

  • Certified Localization professionals

  • PG Specialization in Localization

  • PhD Programmes

Locale Data

  • Presentation of dates, times, numbers, lists, and other values.

  • Collation and sorting

  • Alternate calendars, which may include holidays, work rules, weekday/weekend.

  • Currency

  • Tax or regulatory regime


  • Encoding Standards

  • Multimodal input device standards

  • Fonts & Rendering Engines

  • Transliteration & Translation

Education & Outreach

  • Guidelines

  • Best Practices

  • Case Studies

  • Consultancy

  • Showcasing of Tools

  • & Technologies

Localization Tools

  • Project Management

  • Translation Memory

  • Translation Tools

  • Natural language for text processing: parsing, spell checking, and grammar checking etc

  • Automatic Testing Tools

Linguistic resources

Shipping issues

  • Parallel Corpora

  • Speech Corpora

  • Lexical resources

  • Ontologies

  • Dictionaries

  • Thesaurus

  • Reference Terminologies

  • Minimizing Time lag

  • Benchmarking w.r.t. English version

  • Political sensitivity

  • Pricing issues

The Tree of Localization Complexities

Globalization and localization issues
Globalization and Localization Issues essential Component

Language Issues

Language issues are the result of differences in how languages around the world differ in display, alphabets, grammar, and syntactical rules.

  • Bidirectional scripts

  • Capitalization, Uppercasing and Lowercasing

  • Code Pages

  • Complex Script Awareness

  • Fonts

  • Input Method Editors

  • Keyboards

  • Line and Word Breaks

  • Mirroring Awareness

  • Unicode

Swaran lata director and hod slata mit

  • Formatting Issues essential Component

  • From the user's perspective, formatting issues are the primary source of discrepancies when working with applications originally written for another language or culture/locale.

  • Developers should use the National Language Support (NLS) APIs in Windows or the System.

  • Globalization Namespace to handle most of these issues automatically.

  • Globalization Namespace.

    • Addresses

    • Currency

    • Dates

    • Numerals

    • Paper Sizes

    • Telephone Numbers

    • Time

    • Units of Measure

Localization tool for increasing financial sustainability
Localization- Tool for increasing Financial Sustainability essential Component

  • Training of local youth in Localized Content Creation

  • Working with Self Help Groups to up-lift their business

  • Identify Dynamically changing Local Content which helps in their local professions

  • E-Tutor

  • Entertainment during non-official hours

Tdil s efforts
TDIL’s Efforts essential Component

  • More than a decade’s sustained and major national initiative

  • Leading to development and consolidation of various language Tools , resources and components

  • Continuous and untiring representation in various International and National Standards bodies- ISO ,UNICODE, W3C, IETF , ELRA and BIS

  • Represented and included 22Indian Languages in UNICODE

  • First time in India to launch consortium mode projects in the technology intensive areas of Machine Translation , Cross-lingual Information Access, Text to Speech etc - to develop state of the art technologies in Indian languages

  • Promotes futuristic research in Language Technology

National roll out plan a big step forward
National Roll-Out Plan –A Big Step Forward essential Component

  • CDs containing Software Tools and Fonts for all 22 Officially Recognized Languages released in public domain for free use

  • Contains Fonts, Localized Open Office, Keyboard drivers, E-mail clients and Firefox browsers in Indian languages

  • Freely downloadable from Indian Language Data centre –

  • Already crossed ~ 41 lakhs downloads and 7.0 lakhs shipments

  • NASSCOM may take active role towards proliferating the benefits of these language CDs

  • These free CDs would also benefit NGOs and CSC operators for developing and promoting local language contents.

Putting standards in place
Putting Standards in place essential Component


  • UNICODE – Default Text Encoding Standard.

  • Compatible with ISO 10646

  • Seamless data storage and search if data is stored in UNICODE

  • All 22 Officially recognized Indian Languages including Vedic Sanskrit represented in UNICODE

  • Declared as Text Encoding Standard for All E-Governance Applications

Swaran lata director and hod slata mit

Extracting Knowledge from our vast ancient knowledge base essential Component

UNICODE Encoding for Vedic Sanskrit , Grantha scripts : Key towards computerization of knowledge base

Swaran lata director and hod slata mit

Capturing Region Specific Requirements : Common Locale Data Repository (CLDR)

  • The Unicode CLDR provides key building blocks for software to support the world's languages.

  • CLDR is by far the largest and most extensive standard repository of locale data.

  • This data is used by a wide spectrum of companies for their software internationalization and localization: adapting software to the conventions of different languages for such common software tasks as formatting of dates, times, time zones, numbers, and currency values; sorting text; etc.

  • Locale Data for Indian Languages are in the process of modification

  • Six Languages CLDR Hindi , Nepali, Bengali , Assamese, Malayalam and Gujarati are finalized.

  • Other languages in process

Swaran lata director and hod slata mit

Example of CLDR: Hindi Repository (CLDR)

All Region specific requirements have been captured and put in Hindi Locale repository

Putting standards in place contd w3c
Putting Standards in place… Contd. W3C Repository (CLDR)


  • World-Wide –web Consortium (W3C) develops web standards for interoperable web solutions across platform, devices and access methodology

  • Ensures interoperability across major browsers, IE, Firefox, Opera etc.

  • Work already started to represent all Indian Language representation in W3C standards.

  • Desirable – Pro-active Industry & Industry Body like NASSCOM participation

Putting standards in place contd
Putting Standards in place…Contd. Repository (CLDR)

  • Keyboard Layouts

  • Open Type Fonts.. SakalBharti Fonts

  • Locale Data

  • Language Tag. (For Language Negotiation in Internet)

  • Domain Names in Indian Languages

  • IT Terminology

    … and Standards for major Linguistic Resources and Tools

Collaboration and hand holding
Collaboration and Hand Holding Repository (CLDR)

  • Collaborative efforts required for wider proliferation and sustained initiatives.

  • Govt., Industry Bodies and Academia needs to join hand to address the challenges of Local Language Computing and to promote and bring services closer to doorsteps to millions of citizens in their own languages

Swaran lata director and hod slata mit

धन्यवाद Repository (CLDR)

Thank You

SwaranLata, Director and HoD