Slide1 l.jpg
Advertisement
This presentation is the property of its rightful owner.
1 / 29

S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan PowerPoint PPT Presentation

Asian Languages on the Web S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, Leader LOP, Nagaoka University of Technology, Japan Asian Languages on the Web

Download Presentation

S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

Asian Languages on the Web

S. T. NandasaraLecturer

USCS, University of Colombo, Sri Lanka

Ashu Marasinghe

Associate ProfessorLOP, Nagaoka University of Technology, Japan

Yoshiki Mikami

Professor, LeaderLOP, Nagaoka University of Technology, Japan


Asian languages on the web l.jpg

Asian Languages on the Web

  • Introduction of Asian Languages

  • Survey Objectives and Methodology

  • Asian Language Presence on the Web

  • Multilingualism in the Asian Web

  • Script and Encoding Issues

  • Asian Language Resource Network (ALRN) Project


Slide3 l.jpg

Survey Objectives

  • Give an overview for Asian Languages on the web

  • To describe the state of multilingualism in Asian country domains

    • Defined at various levels, from a personal or document level to a societal level

    • Multiple language presence in each country domain

    • Give an overview of cross-border languages

  • To shed light on script and encoding issues of Asian languages

    • What extent is UCS/Unicode employed for Asian languages?

    • What scripts are actually used to represent a specific language?

    • What extent are locally developed encodings used?

  • Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.


Slide4 l.jpg

Survey Methodology

  • Used a web crawler (Ubi crawler)

    • It traces links within pages and recursively crawls to gather those newly discovered pages

  • The collection of downloaded web pages passed to the language identification engine

  • The language properties of the pages were identified


Slide5 l.jpg

Web Pages Collected

  • Focused on web pages in 42 country domains in Asia.

  • The crawl was begun from a seed file containing 13,286 URLs

  • The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye.

  • The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs

  • Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size


Slide6 l.jpg

Downloaded Pages by ccTLD – Top 10


Slide7 l.jpg

Downloaded Pages by ccTLD – Least 10


Slide8 l.jpg

Language Identification Process

  • The language identification engine LIM (Language Identification Module) used

  • LIM consists of two components

    • Training component

      • Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights

    • The second component is identification component

  • LIM can simultaneously detect the triplet of language, script and encoding scheme


Slide9 l.jpg

Discovered 55 Asian languages

Chinese, Japanese and Korean are excluded from the analysis

Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani


Slide10 l.jpg

No of web pages per 1000 population


Slide11 l.jpg

Number of pages by language – Top 10


Slide12 l.jpg

Number of pages by language – Least 10


Slide13 l.jpg

Multilingualism by Country Domain

  • The most recent version of Ethnologue lists close to seven thousand languages around the world.

  • More than 2600 of them are spoken in the Asian region.

  • Large scale linguistic diversity is observable in Asia. Among the 2600, only around 51 languages are recognized by Asian governments as official or national language(s)

    • Richest diversity of languages in the region, i.e. Indonesia

    • Interesting to note that there is a significantly larger number of pages in Javanese compared to either Indonesian or Malay

    • The major language found in Indonesia, Malaysia, Brunei, Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects.

    • Javanese has a dominating web presence in Indonesia.

    • The lesser Sundanese, Madurese, Achehnese and Buginese languages are found to be of great importance to Indonesia’s local language diversity on the Internet


Slide14 l.jpg

Cross-Border Languages

  • Another aspect of the multilingualism in the region is the overwhelming presence of cross-border languages on the web

  • Defined two categories of languages

    • First category is “local languages”, which are officially recognized language(s) and home speakers’ languages of the state

    • The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations


Slide15 l.jpg

Cross-Border Language Presence

West Asia


Slide16 l.jpg

Central Asia


Slide17 l.jpg

South East Asia


Slide18 l.jpg

South Asia


Slide19 l.jpg

Script Diversity of Asia


Slide20 l.jpg

Same Script Shared by Various Languages

Devanagari Script used by

  • More than 480 million speakers

    • Hindi

  • More than 10 million speakers

    • Marathi

    • Nepali

  • More than 1 million speakers

    • Awadhi

    • Bhojpuri

    • Braj-Dhasha

    • Chahattsigarhi

    • Konkani

    • Kachchi

    • Marwani

    • Maithali

    • Magahi

  • Scholars’ language

    • Sanskrit

Less than 1 millionspeakers

Kului

Kumaoni

Khadiya

Khortha

Kului

Kumaoni

Kurku

Kurukh

Kurmali

Palpa

Panchpargania

Santali

Nagpuri

Kankan

Limbu

Sherpa

Garhwali

Mundari

Newari

Begheli

Bhatneri

Bathi

Bateri

Bhili

Gondi

Jaipuri

Harauti

Ho

Kachchhi

Kanauji

Khadiya

Khorthi


Slide21 l.jpg

UDHR Document by Major Script Grouping

Representation of the UDHR Document by Major Script Grouping

[1]Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)


Utf 8 encoding in selected languages l.jpg

UTF-8 Encoding in Selected Languages


Slide23 l.jpg

To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language research, industry and communication players, and establishing a protocol and standards for developing a LR Network for the languages spoken in the region.

Asian Language Resources – Agenda

ALRN Mission


Slide24 l.jpg

ALRN Action Plan

  • The project will be focusing on South, South East, Central & West Asian Languages

  • Act as an umbrella with Asian Language Resources (ALR)

  • To accommodate Secure and Sustainable UTF base encoding

  • Take advantage of existing Organization such as Language Observatory Project (LOP,TCL)

  • Corpus collection from the web using LO’s crawler/language identifier

  • Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc)

  • Multilingual Terminology Dictionary

  • Information Standards of language corpus building

  • Liaison with international organization such as UNESCO, UDHR, etc.

  • Information resource shearing web site (www.language-resource.net)

Asian Academy of Languages …?


Slide25 l.jpg

Thank you

Danke schön

Merci

Gracias

Obrigado

Grazie

Danke

Spaciba

Ευχάριστο


Slide26 l.jpg

Language Presence in Asian Countries

(The exact number of languages may never be determined exactly)


Slide27 l.jpg

Language Diversity

(Half of the world’s languages are spoken in only eight countries)


Slide28 l.jpg

Asian Language Recognition


Slide29 l.jpg

Will Cover

4 Asian Regions (West, Central, South & South East Asia)

42 Countries

9 Language Families

62 Languages

18 Major Scripts

Asian Language Resources Network - Agenda


  • Login