Ewika towards the digitalization of philippine languages
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

eWika: Towards the Digitalization of Philippine Languages PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on
  • Presentation posted in: General

eWika: Towards the Digitalization of Philippine Languages. Charibeth K. Cheng ([email protected]) DLSU, College of Computer Studies Natural Language Processing Research Lab. Isalin. Translate. MT Research in RP. started in 1993 at UP-Los Ba ň os Dr. Rachel Roxas and Allan Borra

Download Presentation

eWika: Towards the Digitalization of Philippine Languages

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ewika towards the digitalization of philippine languages

eWika: Towards the Digitalization of Philippine Languages

Charibeth K. Cheng ([email protected])

DLSU, College of Computer Studies

Natural Language Processing Research Lab

Isalin

Translate


Mt research in rp

MT Research in RP

  • started in 1993 at UP-Los Baňos

  • Dr. Rachel Roxas and Allan Borra

    • grammar-based

  • in 2004 start at DLSU

    • hybrid approach


Eng fil mt system project

ENG-FIL MT System Project

  • 3-year project

  • started 2005

  • funded by DOST-PCASTRD

  • composition:

    • 6 faculty members of College of Computer Studies

    • 15 computer science majors

    • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M


Architectural design of the program

Architectural Design of the Program

Source Text

User Interface

Target Text

MT: Example-based

Output Modeller

MT: Rule-based

Translator Engine

  • Language Resources:

    • Lexicon (electronic dictionary),

    • Morphological Analyzer & Generator

    • Part-of-Speech tagger

    • Grammar,

    • Corpus (Tagged)


Rule based approach

Where do we get the translation rules?

Rule-Based approach

The boy ate apples.

Apply translation rules

Kumain ng mga mansanasang batang lalaki.


Example based

A

B

C

D

B

C

D

A

Rule Learned:

ABCD

C ng DA B

Example-Based

  • Learn the rules from examples

Theboyateapples.

Kumainngmga mansanasangbatang lalaki.


Using the rule

ABCD

C ng DA B

Using the rule

Themothercookedfish.

A

B

C

D

Naglutongisdaangnanay.

B

C

D

A


Using the rule1

ABCD

C ng DA B

Using the rule

Themotherwenthome.

A

B

C

D

Umuwingbahayangnanay.

B

C

D

A


Limitation of a rule

ABCD

C ng DA B

Limitation of a Rule

Theboyate the fish.

B

C

D

A


Results of the mt engine

Results of the MT Engine

  • Qualities of a Good Translation

    • Clarity – 3.3

    • Accuracy – 3.2

    • Naturalness - 2.8

  • highest score of 5

  • 100 respondents (5 linguists)


Challenge

Challenge!

  • Language resources

    • Quality of translation is dependent on it.

    • Built from almost non-existent digital forms

    • manual vs. automatic construction

Dictionary

Grammar

Sample Translations


Lexicon

Lexicon

  • Diksyunaryo ng Wikang Filipino

  • automatic construction (AeFLEX):

    • accuracy rate - 57%

  • Currently contains about 30,000+ entries

  • Challenge: Lexical resources

    • translation documents

    • part-of-speech tagger


Morphological analyzer and generator

Morphological Analyzer and Generator

  • Dictionary is incomplete

  • Create a software that:

    • analyzes – determines the root word

    • generates – generates the inflected word

      Given: eating -> eat -> kain -> kumakain

  • Challenge : Lexical resources

    • lexicon

    • part-of-speech tagger


Part of speech tagger

Part-Of-Speech Tagger

  • automatic association of parts-of-speech to words in a document

    • Can? – kaya vs. lata

    • Baba? – chin or go down

  • Challenge : Lexical resource

    • corpora

    • lexicon

    • morphological analyzer

    • grammar


Corpora

Corpora

  • collection of translation-pair documents

  • used by the lexicon extractor and part-of-speech tagger, example-based MT

  • came from translation works of DLSU English majors, verified by linguists

  • consists of 207,000 words


Lexicon resource dependency

Lexicon Resource Dependency

Lexicon

Corpus

POS Tagger

Morph AG


Bringing it home

Bringing it home …

  • 171 Philippine Languages (SIL)

  • No Philippine Corpora

  • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc)

  • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)


Ewika digitalization of philippine languages

eWika: Digitalization of Philippine Languages

  • Build the Philippine Corpus

  • Build software tools to study or use the corpus

    • Across Regions

    • Across Forms and Genres

    • Across Languages


Across regions

Across Regions

  • Web-based application: GLOBALIZATION

    • upload, download, tools

  • Contributors (Main players)

  • Verifiers

  • Server: DLSU-M commits to host the server for the next three years.

  • Terms of Use: Research purposes.


Across languages

Across Languages

  • 171 Philippine Languages (SIL List)

  • start with 8 major languages

    • Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano

  • Filipino Sign Language


Across forms and genres

Across Forms and Genres

  • In various forms:

    • Text

    • Speech

    • Video: Filipino sign language

  • In various Genres:

    • Text – literary & creative, essays, news articles, religious, etc

    • Speech – scripted, conversations, etc

    • Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)


Ewika towards the digitalization of philippine languages

  • The dream of building electronic, online Philippine language resources and tools

  • Many many many major hurdles to overcome

  • NEEDED : Language Resources, Tools, & Peopleware


  • Login