slide1
Download
Skip this Video
Download Presentation
S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers

Loading in 2 Seconds...

play fullscreen
1 / 29

S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers - PowerPoint PPT Presentation


  • 497 Views
  • Uploaded on

Asian Languages on the Web S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, Leader LOP, Nagaoka University of Technology, Japan Asian Languages on the Web

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka Univers' - albert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Asian Languages on the Web

S. T. NandasaraLecturer

USCS, University of Colombo, Sri Lanka

Ashu Marasinghe

Associate ProfessorLOP, Nagaoka University of Technology, Japan

Yoshiki Mikami

Professor, LeaderLOP, Nagaoka University of Technology, Japan

asian languages on the web
Asian Languages on the Web
  • Introduction of Asian Languages
  • Survey Objectives and Methodology
  • Asian Language Presence on the Web
  • Multilingualism in the Asian Web
  • Script and Encoding Issues
  • Asian Language Resource Network (ALRN) Project
slide3
Survey Objectives
  • Give an overview for Asian Languages on the web
  • To describe the state of multilingualism in Asian country domains
    • Defined at various levels, from a personal or document level to a societal level
    • Multiple language presence in each country domain
    • Give an overview of cross-border languages
  • To shed light on script and encoding issues of Asian languages
    • What extent is UCS/Unicode employed for Asian languages?
    • What scripts are actually used to represent a specific language?
    • What extent are locally developed encodings used?
  • Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.
slide4
Survey Methodology
  • Used a web crawler (Ubi crawler)
    • It traces links within pages and recursively crawls to gather those newly discovered pages
  • The collection of downloaded web pages passed to the language identification engine
  • The language properties of the pages were identified
slide5
Web Pages Collected
  • Focused on web pages in 42 country domains in Asia.
  • The crawl was begun from a seed file containing 13,286 URLs
  • The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye.
  • The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs
  • Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size
slide8
Language Identification Process
  • The language identification engine LIM (Language Identification Module) used
  • LIM consists of two components
    • Training component
      • Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights
    • The second component is identification component
  • LIM can simultaneously detect the triplet of language, script and encoding scheme
slide9
Discovered 55 Asian languages

Chinese, Japanese and Korean are excluded from the analysis

Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani

slide13
Multilingualism by Country Domain
  • The most recent version of Ethnologue lists close to seven thousand languages around the world.
  • More than 2600 of them are spoken in the Asian region.
  • Large scale linguistic diversity is observable in Asia. Among the 2600, only around 51 languages are recognized by Asian governments as official or national language(s)
    • Richest diversity of languages in the region, i.e. Indonesia
    • Interesting to note that there is a significantly larger number of pages in Javanese compared to either Indonesian or Malay
    • The major language found in Indonesia, Malaysia, Brunei, Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects.
    • Javanese has a dominating web presence in Indonesia.
    • The lesser Sundanese, Madurese, Achehnese and Buginese languages are found to be of great importance to Indonesia’s local language diversity on the Internet
slide14
Cross-Border Languages
  • Another aspect of the multilingualism in the region is the overwhelming presence of cross-border languages on the web
  • Defined two categories of languages
    • First category is “local languages”, which are officially recognized language(s) and home speakers’ languages of the state
    • The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations
slide20
Same Script Shared by Various Languages

Devanagari Script used by

  • More than 480 million speakers
    • Hindi
  • More than 10 million speakers
    • Marathi
    • Nepali
  • More than 1 million speakers
    • Awadhi
    • Bhojpuri
    • Braj-Dhasha
    • Chahattsigarhi
    • Konkani
    • Kachchi
    • Marwani
    • Maithali
    • Magahi
  • Scholars’ language
    • Sanskrit

Less than 1 millionspeakers

Kului

Kumaoni

Khadiya

Khortha

Kului

Kumaoni

Kurku

Kurukh

Kurmali

Palpa

Panchpargania

Santali

Nagpuri

Kankan

Limbu

Sherpa

Garhwali

Mundari

Newari

Begheli

Bhatneri

Bathi

Bateri

Bhili

Gondi

Jaipuri

Harauti

Ho

Kachchhi

Kanauji

Khadiya

Khorthi

slide21
UDHR Document by Major Script Grouping

Representation of the UDHR Document by Major Script Grouping

[1]Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)

slide23
To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language research, industry and communication players, and establishing a protocol and standards for developing a LR Network for the languages spoken in the region.

Asian Language Resources – Agenda

ALRN Mission

slide24
ALRN Action Plan
  • The project will be focusing on South, South East, Central & West Asian Languages
  • Act as an umbrella with Asian Language Resources (ALR)
  • To accommodate Secure and Sustainable UTF base encoding
  • Take advantage of existing Organization such as Language Observatory Project (LOP,TCL)
  • Corpus collection from the web using LO’s crawler/language identifier
  • Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc)
  • Multilingual Terminology Dictionary
  • Information Standards of language corpus building
  • Liaison with international organization such as UNESCO, UDHR, etc.
  • Information resource shearing web site (www.language-resource.net)

Asian Academy of Languages …?

slide25
Thank you

Danke schön

Merci

Gracias

Obrigado

Grazie

Danke

Spaciba

Ευχάριστο

slide26
Language Presence in Asian Countries

(The exact number of languages may never be determined exactly)

slide27
Language Diversity

(Half of the world’s languages are spoken in only eight countries)

slide29
Will Cover

4 Asian Regions (West, Central, South & South East Asia)

42 Countries

9 Language Families

62 Languages

18 Major Scripts

Asian Language Resources Network - Agenda

ad