slide1 l.
Skip this Video
Loading SlideShow in 5 Seconds..
S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers PowerPoint Presentation
Download Presentation
S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers

Loading in 2 Seconds...

play fullscreen
1 / 29

S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers - PowerPoint PPT Presentation

  • Uploaded on

Asian Languages on the Web S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, Leader LOP, Nagaoka University of Technology, Japan Asian Languages on the Web

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers' - albert

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Asian Languages on the Web

S. T. NandasaraLecturer

USCS, University of Colombo, Sri Lanka

Ashu Marasinghe

Associate ProfessorLOP, Nagaoka University of Technology, Japan

Yoshiki Mikami

Professor, LeaderLOP, Nagaoka University of Technology, Japan

asian languages on the web
Asian Languages on the Web
  • Introduction of Asian Languages
  • Survey Objectives and Methodology
  • Asian Language Presence on the Web
  • Multilingualism in the Asian Web
  • Script and Encoding Issues
  • Asian Language Resource Network (ALRN) Project

Survey Objectives

  • Give an overview for Asian Languages on the web
  • To describe the state of multilingualism in Asian country domains
    • Defined at various levels, from a personal or document level to a societal level
    • Multiple language presence in each country domain
    • Give an overview of cross-border languages
  • To shed light on script and encoding issues of Asian languages
    • What extent is UCS/Unicode employed for Asian languages?
    • What scripts are actually used to represent a specific language?
    • What extent are locally developed encodings used?
  • Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.

Survey Methodology

  • Used a web crawler (Ubi crawler)
    • It traces links within pages and recursively crawls to gather those newly discovered pages
  • The collection of downloaded web pages passed to the language identification engine
  • The language properties of the pages were identified

Web Pages Collected

  • Focused on web pages in 42 country domains in Asia.
  • The crawl was begun from a seed file containing 13,286 URLs
  • The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye.
  • The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs
  • Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size

Language Identification Process

  • The language identification engine LIM (Language Identification Module) used
  • LIM consists of two components
    • Training component
      • Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights
    • The second component is identification component
  • LIM can simultaneously detect the triplet of language, script and encoding scheme

Discovered 55 Asian languages

Chinese, Japanese and Korean are excluded from the analysis

Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani


Multilingualism by Country Domain

  • The most recent version of Ethnologue lists close to seven thousand languages around the world.
  • More than 2600 of them are spoken in the Asian region.
  • Large scale linguistic diversity is observable in Asia. Among the 2600, only around 51 languages are recognized by Asian governments as official or national language(s)
    • Richest diversity of languages in the region, i.e. Indonesia
    • Interesting to note that there is a significantly larger number of pages in Javanese compared to either Indonesian or Malay
    • The major language found in Indonesia, Malaysia, Brunei, Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects.
    • Javanese has a dominating web presence in Indonesia.
    • The lesser Sundanese, Madurese, Achehnese and Buginese languages are found to be of great importance to Indonesia’s local language diversity on the Internet

Cross-Border Languages

  • Another aspect of the multilingualism in the region is the overwhelming presence of cross-border languages on the web
  • Defined two categories of languages
    • First category is “local languages”, which are officially recognized language(s) and home speakers’ languages of the state
    • The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations

Same Script Shared by Various Languages

Devanagari Script used by

  • More than 480 million speakers
    • Hindi
  • More than 10 million speakers
    • Marathi
    • Nepali
  • More than 1 million speakers
    • Awadhi
    • Bhojpuri
    • Braj-Dhasha
    • Chahattsigarhi
    • Konkani
    • Kachchi
    • Marwani
    • Maithali
    • Magahi
  • Scholars’ language
    • Sanskrit

Less than 1 millionspeakers


































UDHR Document by Major Script Grouping

Representation of the UDHR Document by Major Script Grouping

[1]Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)


To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language research, industry and communication players, and establishing a protocol and standards for developing a LR Network for the languages spoken in the region.

Asian Language Resources – Agenda

ALRN Mission


ALRN Action Plan

  • The project will be focusing on South, South East, Central & West Asian Languages
  • Act as an umbrella with Asian Language Resources (ALR)
  • To accommodate Secure and Sustainable UTF base encoding
  • Take advantage of existing Organization such as Language Observatory Project (LOP,TCL)
  • Corpus collection from the web using LO’s crawler/language identifier
  • Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc)
  • Multilingual Terminology Dictionary
  • Information Standards of language corpus building
  • Liaison with international organization such as UNESCO, UDHR, etc.
  • Information resource shearing web site (

Asian Academy of Languages …?


Thank you

Danke schön









Language Presence in Asian Countries

(The exact number of languages may never be determined exactly)


Language Diversity

(Half of the world’s languages are spoken in only eight countries)


Will Cover

4 Asian Regions (West, Central, South & South East Asia)

42 Countries

9 Language Families

62 Languages

18 Major Scripts

Asian Language Resources Network - Agenda