1 / 32

The Sawa Corpus A Parallel Corpus English - Swahili

The Sawa Corpus A Parallel Corpus English - Swahili. Guy De Pauw (guy.depauw@aflat.org) Peter Waiganjo Wagacha (waiganjo@aflat.org) Gilles-Maurice de Schryver (gillesmaurice.deschryver@aflat.org). Resource-scarceness. Language technology vs the digital divide

mjohn
Download Presentation

The Sawa Corpus A Parallel Corpus English - Swahili

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Sawa CorpusA Parallel Corpus English - Swahili Guy De Pauw(guy.depauw@aflat.org) Peter Waiganjo Wagacha(waiganjo@aflat.org) Gilles-Maurice de Schryver(gillesmaurice.deschryver@aflat.org)

  2. Resource-scarceness • Language technology vs the digital divide • Digital data increasingly important for African languages (web, mobile phone, …) • But: most research on African languages is rooted in knowledge-based paradigm (↔ LT for Indo-European languages): • Hand-crafted expert systems • Typically high accuracy for domain • Limited portability to other languages and subdomains • Costly development phase • Limited resources (linguistic, expertise, financial, …) • Need for a cheaper and faster (language-independent) alternative for developing African language technology

  3. Data-driven approaches • For Indo-European and Asian languages: the data-driven, corpus-based approach has become the dominant paradigm since the 90’s • Basic methodology: automatically extract linguistic knowledge from annotated text material (corpus) and bootstrap the development of language technology component • Advantages: • language independence: portability (!!!!) • Knowledge acquisition bottleneck  data-acquisition bottleneck • Robustness • AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)

  4. Machine Translation 3 paradigms: • Rule-based MT • Statistical MT • Example-based MT data-driven Learn translation from examples: !! Parallel corpus !!

  5. Parallel Corpus Collection of translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level Sawa Corpus: parallel corpus English - Swahili

  6. Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." • UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU • UTANGULIZI • Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, • Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote, Example • Universal Declaration of Human Rights • Preamble • Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, • Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,

  7. 3 phases • Data-collection: finding parallel texts • Data-constitution: aligning the parallel texts on word level • Data-exploitation • Statistical Machine Translation • Bootstrapping linguistic annotation

  8. Data Collection • Limited availability of parallel texts English – Kiswahili: • Smaller documents: investment reports, political texts, e.g. Universal Declaration of Human Rights “there is no data, like more data” • Bible, Quran, secular literature • New translations

  9. Data Collection • Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution. e.g. paragraph alignment

  10. Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." • UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU • UTANGULIZI • Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, • Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote, • Universal Declaration of Human Rights • Preamble • Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, • Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,

  11. e.g. sentence alignment • Article 12 • No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. • Everyone has the right to the protection of the law against such interference or attacks. • Kifungu cha 12 • Kila mtu asiingiliwe bila sheria katika mambo yake ya faragha, ya jamaa yake, ya nyumbani mwake au ya barua zake. • Wala asivunjiwe heshima na sifa yake. • Kila mmoja ana haki ya kulindwa na sheria kutokana na pingamizi au mambo kama hayo.

  12. Available data in Sawa Corpus All manually sentence aligned!

  13. Available data in Sawa Corpus All manually sentence aligned!

  14. Available data in Sawa Corpus Thanks to Mahmoud Shokrollahi-Far University College of NabiyeAkram (Iran) All manually sentence aligned!

  15. Available data in Sawa Corpus All manually sentence aligned!

  16. Available data in Sawa Corpus All manually sentence aligned!

  17. Available data in Sawa Corpus All manually sentence aligned!

  18. Available data in Sawa Corpus All manually sentence aligned!

  19. Available data in Sawa Corpus Thanks to Dr. James Omboga Zaja University of Nairobi All manually sentence aligned!

  20. Available data in Sawa Corpus All manually sentence aligned!

  21. Word alignment Most difficult task: relate words between languages No , she , uh , up north ‘s La , yuko , aa , juu kaskazini

  22. Word alignment You caught me skiving , I ‘m afraid . . Samahani , umenidaka nikihepa

  23. Word alignment • Can be done automatically using established tools (GIZA++) • Provide manual reference to evaluate automatic word alignment tools (5000 words)

  24. Current results Still a lot of room for improvement

  25. Word alignment Some alignment patterns are easy No , she , uh , up north ‘s La , yuko , aa , juu kaskazini

  26. Alignment problems I have turned him down nimemkatalia

  27. Morphological decomposition I have turned him down ni+ me+ m+ katalia

  28. Current results Morpheme/Word alignment Better alignment, but more complicated decoding

  29. Future work • Projection of Annotation

  30. Future work • Projection of Annotation • Refine GIZA++ alignment • Part-of-speech tagger

  31. Future work • Projection of Annotation • Refine GIZA++ alignment • Part-of-speech tagger • No data like more data: web-mining & comparable corpora • Example-based MT (omegaT) • Statistical MT (Moses)

  32. Conclusion • Modest, but workable parallel corpus English – Swahili • Bi-directional Machine Translation is now in the cards • Modest, but encouraging word alignment scores • Data-driven approach is viable for African languages

More Related