ewika digitalization of philippine languages n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
eWika: Digitalization of Philippine Languages PowerPoint Presentation
Download Presentation
eWika: Digitalization of Philippine Languages

Loading in 2 Seconds...

play fullscreen
1 / 25

eWika: Digitalization of Philippine Languages - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

eWika: Digitalization of Philippine Languages. Charibeth K. Cheng March 19, 2008. Isalin. Translate. Sentence in SOURCE LANGUAGE. MT System. Sentence in TARGET LANGUAGE. Machine Translation. Automate translation A study under Natural Language Processing. ENG-FIL MT System Project.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'eWika: Digitalization of Philippine Languages' - velika


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ewika digitalization of philippine languages
eWika: Digitalization of Philippine Languages

Charibeth K. Cheng

March 19, 2008

Isalin

Translate

machine translation

Sentence in

SOURCE LANGUAGE

MT System

Sentence in

TARGET LANGUAGE

Machine Translation
  • Automate translation
  • A study under Natural Language Processing
eng fil mt system project
ENG-FIL MT System Project
  • 3-year project
  • started last year
  • funded by DOST-PCASTRD
  • composition:
    • 6 faculty members of College of Computer Studies
    • 15 computer science majors
    • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M
agenda
Agenda
  • Architecture of the MT System
  • Linguistic resources
  • Demo of the Translation Engine
  • Results for English to Japanese translation
architectural design of the program
Architectural Design of the Program

Source Text

User Interface

Target Text

MT: Example-based

Output Modeller

MT: Rule-based

Translator Engine

  • Language Resources:
    • Lexicon (electronic dictionary),
    • Morphological Analyzer & Generator
    • Part-of-Speech tagger
    • Grammar,
    • Corpus (Tagged)
challenge
Challenge!
  • Language resources
    • Quality of translation is dependent on it.
    • Built from almost non-existent digital forms
    • manual vs. automatic construction
lexicon builder
Lexicon Builder
  • Used IsaWika! database as initial lexicon
  • Created a lexicon extraction program to automatically determine candidate translation pairs from corpora
  • Currently contains about 23,000 entries
  • Co-occurring words are likely translation
  • Challenge: Lexical resources
    • parallel corpora
    • part-of-speech tagger

Database

morphological analyzer
Morphological Analyzer
  • Initially collected morphological rules from grammar books
  • Developed an example-based morphological phenomenon learner
    • learn from <inflected word, root-word>
    • example: <kumakain, kain>
  • Challenge : Lexical resources
    • lexicon
    • part-of-speech tagger
    • morphological rules

Generator

part of speech tagger
Part-Of-Speech Tagger
  • automatic association of parts-of-speech to words in a document
  • existing Filipino tagger achieves < 80% accuracy
  • Challenge : Lexical resource
    • tagged parallel corpora
    • lexicon
    • morphological analyzer
    • grammar
grammar
Grammar
  • Derived manually
  • Challenge: Free word order in sentence formation.
  • The man bought an umbrella from the store.
  • Bumili ang lalaki ng payong sa tindahan.
  • Bumili sa tindahan ng payong ang lalaki.
  • Ang lalaki ay bumili ng payong sa tindahan.
corpora
Corpora
  • used by the lexicon extractor and part-of-speech tagger, example-based MT
  • came from translation works of DLSU English majors, verified by linguists
  • consists of 207,000 words, 5000 of which are tagged
translation rules
Translation Rules
  • currently learned from the corpora
  • disadvantages
    • garbage-in-garbage-out
    • comprehensiveness
  • need for linguistic-verified rules
bringing it home
Bringing it home …
  • 171 Philippine Languages (SIL)
  • No Philippine Corpora
  • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc)
  • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
ewika digitalization of philippine languages1
eWika: Digitalization of Philippine Languages
  • Build the Philippine Corpus
  • Build software tools to study or use the corpus
    • Across Languages
    • Across Regions
    • Across Forms and Genres
    • Across Land and Sea
across languages
Across Languages
  • 171 Philippine Languages (SIL List)
  • Summer Institute of Linguistics http://www.ethnologue.com/
  • Major languages
  • Near extinction languages
  • How about the languages in-between?
filipino sign language
Filipino Sign Language
  • The History of Sign Language in the Philippines: Piecing Together the Puzzle (Abat & Martinez, 9th Phil Linguistics Congress, 2006)
  • Deaf individuals: handicapped vs members of a linguistic minority
  • Sign languages as true languages
across boundaries
Across Boundaries
  • Across Languages
  • Across Regions
  • Across Forms and Genres
  • Across Land and Sea
across regions
Across Regions
  • e-Wika: Connecting the Philippine Islands through Language
  • 17 Regions: The regions are: Ilocos Region (Region I), Cagayan Valley (Region II), Central Luzon (Region III), CALABARZON (Region IV-A) , MIMAROPA (Region IV-B) , Bicol Region (Region V), Western Visayas (Region VI), Central Visayas (Region VII), Eastern Visayas (Region VIII), Zamboanga Peninsula (Region IX), Northern Mindanao (Region X), Davao Region (Region XI), SOCCSKSARGEN (Region XII), Caraga (Region XIII), Autonomous Region in Muslim Mindanao (ARMM), Cordillera Administrative Region (CAR), National Capital Region (NCR) (Metro Manila)
across boundaries1
Across Boundaries
  • Across Time: historical, contemporary
  • Across Languages
  • Across Regions
  • Across Forms and Genres
  • Across Land and Sea
across forms and genres
Across Forms and Genres
  • In various forms:
  • Text
  • Speech: speech to text system (ongoing project)
  • Video: Filipino sign language
  • In various Genres: categories of entries in the corpus
across boundaries2
Across Boundaries
  • Across Time: historical, contemporary
  • Across Languages
  • Across Regions
  • Across Forms and Genres
  • Across Land and Sea
across land and sea
Across Land and Sea
  • Web-based application: c/o Solomon See (upload, download, tools)
  • Contributors (Main players)
  • Verify-ers
  • Facilitators
  • Server: DLSU-M commits to host the server for the next three years.
  • Terms of Use: Research purposes.
slide25
The dream of building Philippine language resources and tools
  • Many many many major hurdles to overcome
  • Language Resources, Tools, & Peopleware: Needed