ewika towards the digitalization of philippine languages
Download
Skip this Video
Download Presentation
eWika: Towards the Digitalization of Philippine Languages

Loading in 2 Seconds...

play fullscreen
1 / 22

eWika: Towards the Digitalization of Philippine Languages - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

eWika: Towards the Digitalization of Philippine Languages. Charibeth K. Cheng ([email protected]) DLSU, College of Computer Studies Natural Language Processing Research Lab. Isalin. Translate. MT Research in RP. started in 1993 at UP-Los Ba ň os Dr. Rachel Roxas and Allan Borra

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' eWika: Towards the Digitalization of Philippine Languages ' - candice-austin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ewika towards the digitalization of philippine languages
eWika: Towards the Digitalization of Philippine Languages

Charibeth K. Cheng ([email protected])

DLSU, College of Computer Studies

Natural Language Processing Research Lab

Isalin

Translate

mt research in rp
MT Research in RP
  • started in 1993 at UP-Los Baňos
  • Dr. Rachel Roxas and Allan Borra
    • grammar-based
  • in 2004 start at DLSU
    • hybrid approach
eng fil mt system project
ENG-FIL MT System Project
  • 3-year project
  • started 2005
  • funded by DOST-PCASTRD
  • composition:
    • 6 faculty members of College of Computer Studies
    • 15 computer science majors
    • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M
architectural design of the program
Architectural Design of the Program

Source Text

User Interface

Target Text

MT: Example-based

Output Modeller

MT: Rule-based

Translator Engine

  • Language Resources:
    • Lexicon (electronic dictionary),
    • Morphological Analyzer & Generator
    • Part-of-Speech tagger
    • Grammar,
    • Corpus (Tagged)
rule based approach

Where do we get the translation rules?

Rule-Based approach

The boy ate apples.

Apply translation rules

Kumain ng mga mansanasang batang lalaki.

example based

A

B

C

D

B

C

D

A

Rule Learned:

ABCD

C ng DA B

Example-Based
  • Learn the rules from examples

Theboyateapples.

Kumainngmga mansanasangbatang lalaki.

using the rule

ABCD

C ng DA B

Using the rule

Themothercookedfish.

A

B

C

D

Naglutongisdaangnanay.

B

C

D

A

using the rule1

ABCD

C ng DA B

Using the rule

Themotherwenthome.

A

B

C

D

Umuwingbahayangnanay.

B

C

D

A

limitation of a rule

ABCD

C ng DA B

Limitation of a Rule

Theboyate the fish.

B

C

D

A

results of the mt engine
Results of the MT Engine
  • Qualities of a Good Translation
    • Clarity – 3.3
    • Accuracy – 3.2
    • Naturalness - 2.8
  • highest score of 5
  • 100 respondents (5 linguists)
challenge
Challenge!
  • Language resources
    • Quality of translation is dependent on it.
    • Built from almost non-existent digital forms
    • manual vs. automatic construction

Dictionary

Grammar

Sample Translations

lexicon
Lexicon
  • Diksyunaryo ng Wikang Filipino
  • automatic construction (AeFLEX):
    • accuracy rate - 57%
  • Currently contains about 30,000+ entries
  • Challenge: Lexical resources
    • translation documents
    • part-of-speech tagger
morphological analyzer and generator
Morphological Analyzer and Generator
  • Dictionary is incomplete
  • Create a software that:
    • analyzes – determines the root word
    • generates – generates the inflected word

Given: eating -> eat -> kain -> kumakain

  • Challenge : Lexical resources
    • lexicon
    • part-of-speech tagger
part of speech tagger
Part-Of-Speech Tagger
  • automatic association of parts-of-speech to words in a document
    • Can? – kaya vs. lata
    • Baba? – chin or go down
  • Challenge : Lexical resource
    • corpora
    • lexicon
    • morphological analyzer
    • grammar
corpora
Corpora
  • collection of translation-pair documents
  • used by the lexicon extractor and part-of-speech tagger, example-based MT
  • came from translation works of DLSU English majors, verified by linguists
  • consists of 207,000 words
lexicon resource dependency
Lexicon Resource Dependency

Lexicon

Corpus

POS Tagger

Morph AG

bringing it home
Bringing it home …
  • 171 Philippine Languages (SIL)
  • No Philippine Corpora
  • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc)
  • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
ewika digitalization of philippine languages
eWika: Digitalization of Philippine Languages
  • Build the Philippine Corpus
  • Build software tools to study or use the corpus
    • Across Regions
    • Across Forms and Genres
    • Across Languages
across regions
Across Regions
  • Web-based application: GLOBALIZATION
    • upload, download, tools
  • Contributors (Main players)
  • Verifiers
  • Server: DLSU-M commits to host the server for the next three years.
  • Terms of Use: Research purposes.
across languages
Across Languages
  • 171 Philippine Languages (SIL List)
  • start with 8 major languages
    • Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano
  • Filipino Sign Language
across forms and genres
Across Forms and Genres
  • In various forms:
    • Text
    • Speech
    • Video: Filipino sign language
  • In various Genres:
    • Text – literary & creative, essays, news articles, religious, etc
    • Speech – scripted, conversations, etc
    • Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)
slide22
The dream of building electronic, online Philippine language resources and tools
  • Many many many major hurdles to overcome
  • NEEDED : Language Resources, Tools, & Peopleware
ad