1 / 29

S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers

Asian Languages on the Web S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, Leader LOP, Nagaoka University of Technology, Japan Asian Languages on the Web

albert
Download Presentation

S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asian Languages on the Web S. T. NandasaraLecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate ProfessorLOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, LeaderLOP, Nagaoka University of Technology, Japan

  2. Asian Languages on the Web • Introduction of Asian Languages • Survey Objectives and Methodology • Asian Language Presence on the Web • Multilingualism in the Asian Web • Script and Encoding Issues • Asian Language Resource Network (ALRN) Project

  3. Survey Objectives • Give an overview for Asian Languages on the web • To describe the state of multilingualism in Asian country domains • Defined at various levels, from a personal or document level to a societal level • Multiple language presence in each country domain • Give an overview of cross-border languages • To shed light on script and encoding issues of Asian languages • What extent is UCS/Unicode employed for Asian languages? • What scripts are actually used to represent a specific language? • What extent are locally developed encodings used? • Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.

  4. Survey Methodology • Used a web crawler (Ubi crawler) • It traces links within pages and recursively crawls to gather those newly discovered pages • The collection of downloaded web pages passed to the language identification engine • The language properties of the pages were identified

  5. Web Pages Collected • Focused on web pages in 42 country domains in Asia. • The crawl was begun from a seed file containing 13,286 URLs • The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye. • The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs • Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size

  6. Downloaded Pages by ccTLD – Top 10

  7. Downloaded Pages by ccTLD – Least 10

  8. Language Identification Process • The language identification engine LIM (Language Identification Module) used • LIM consists of two components • Training component • Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights • The second component is identification component • LIM can simultaneously detect the triplet of language, script and encoding scheme

  9. Discovered 55 Asian languages Chinese, Japanese and Korean are excluded from the analysis Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani

  10. No of web pages per 1000 population

  11. Number of pages by language – Top 10

  12. Number of pages by language – Least 10

  13. Multilingualism by Country Domain • The most recent version of Ethnologue lists close to seven thousand languages around the world. • More than 2600 of them are spoken in the Asian region. • Large scale linguistic diversity is observable in Asia. Among the 2600, only around 51 languages are recognized by Asian governments as official or national language(s) • Richest diversity of languages in the region, i.e. Indonesia • Interesting to note that there is a significantly larger number of pages in Javanese compared to either Indonesian or Malay • The major language found in Indonesia, Malaysia, Brunei, Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects. • Javanese has a dominating web presence in Indonesia. • The lesser Sundanese, Madurese, Achehnese and Buginese languages are found to be of great importance to Indonesia’s local language diversity on the Internet

  14. Cross-Border Languages • Another aspect of the multilingualism in the region is the overwhelming presence of cross-border languages on the web • Defined two categories of languages • First category is “local languages”, which are officially recognized language(s) and home speakers’ languages of the state • The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations

  15. Cross-Border Language Presence West Asia

  16. Central Asia

  17. South East Asia

  18. South Asia

  19. Script Diversity of Asia

  20. Same Script Shared by Various Languages Devanagari Script used by • More than 480 million speakers • Hindi • More than 10 million speakers • Marathi • Nepali • More than 1 million speakers • Awadhi • Bhojpuri • Braj-Dhasha • Chahattsigarhi • Konkani • Kachchi • Marwani • Maithali • Magahi • Scholars’ language • Sanskrit Less than 1 millionspeakers Kului Kumaoni Khadiya Khortha Kului Kumaoni Kurku Kurukh Kurmali Palpa Panchpargania Santali Nagpuri Kankan Limbu Sherpa Garhwali Mundari Newari Begheli Bhatneri Bathi Bateri Bhili Gondi Jaipuri Harauti Ho Kachchhi Kanauji Khadiya Khorthi

  21. UDHR Document by Major Script Grouping Representation of the UDHR Document by Major Script Grouping [1]Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)

  22. UTF-8 Encoding in Selected Languages

  23. To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language research, industry and communication players, and establishing a protocol and standards for developing a LR Network for the languages spoken in the region. Asian Language Resources – Agenda ALRN Mission

  24. ALRN Action Plan • The project will be focusing on South, South East, Central & West Asian Languages • Act as an umbrella with Asian Language Resources (ALR) • To accommodate Secure and Sustainable UTF base encoding • Take advantage of existing Organization such as Language Observatory Project (LOP,TCL) • Corpus collection from the web using LO’s crawler/language identifier • Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc) • Multilingual Terminology Dictionary • Information Standards of language corpus building • Liaison with international organization such as UNESCO, UDHR, etc. • Information resource shearing web site (www.language-resource.net) Asian Academy of Languages …?

  25. Thank you Danke schön Merci Gracias Obrigado Grazie Danke Spaciba Ευχάριστο

  26. Language Presence in Asian Countries (The exact number of languages may never be determined exactly)

  27. Language Diversity (Half of the world’s languages are spoken in only eight countries)

  28. Asian Language Recognition

  29. Will Cover 4 Asian Regions (West, Central, South & South East Asia) 42 Countries 9 Language Families 62 Languages 18 Major Scripts Asian Language Resources Network - Agenda

More Related