1 / 25

Towards Developing a Multi-Dialect Morphological Analyser for Arabic

Towards Developing a Multi-Dialect Morphological Analyser for Arabic. 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat, Morocco. Khalid Almeman and Mark Lee The University of Birmingham www.almeman.com. Outline. Introduction

Download Presentation

Towards Developing a Multi-Dialect Morphological Analyser for Arabic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4th International Conference on Arabic Language Processing May 2–3, 2012, Rabat, Morocco Khalid Almeman and Mark Lee The University of Birmingham www.almeman.com

  2. Outline • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  3. Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  4. Introduction The usage of: MSA vs. Dialects

  5. Introduction • Dialectal Morphology & Variation • Arabic MSA has a rich morphology in two main aspects: • Affixes and stems (word level) • Syntax (context level) • Dialects have MSA complex and also the big change between MSA and the dialects in both word and syntax levels

  6. Introduction • Dialectal Morphology & Variation (the changes) • Transforming in some phonetics • e.g. s to h (N Africa) , q to a (LEV), s to H (EGY) • New phonetics • e.g. k to ts or ch(Gulf), j to g (EGY) • The changes in syntax between MSA and dialects • No standardisation in writing • e.g. a loanword ‘sandwich’ can be represented in many forms;

  7. Introduction • Dialectal Morphology & Variation (the changes)the changes in phonetics between arabic dialects comparing with msa e.g.

  8. Introduction What is the problem: • The rich morphology in Arabic language • The variety between MSA and dialects • The variety between dialects themselves • No standardisation in Arabic dialects. • State of the art: MAGEAD • Restricted to verbs • Levantine – need to define rules for new dialects So, the need of dialects morphology analyser

  9. Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  10. Multi dialect morphology analyser • Three methods have been applied: • Modify MSA analyser • Segment the rest of words • Check the frequency in the web corpus 1 2 3

  11. Multi dialect morphology analyser • Baseline experiment We have extracted 2229 dialects words from the web and then checked them in MSA morphology analyser (Al Khalil, 2011) the result

  12. Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  13. Multi dialect morphology analyser The first method: Adopt MSA analyser According to Haack (1996) the stem patterns of Arabic dialects are identical to those of MSA in many cases So the suggestion is to add NEW dialects affixes to MSA morphology analyser

  14. The Results after the first layer: An example of output after first layer

  15. Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  16. Multi dialect morphology analyser The second method: the segmenter Segments the rest of words by extracting four shapes of the word yet; we do not know which one is the correct?

  17. Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  18. Multi dialect morphology analyser The third method: Use web corpus الواد حيلعب الكورة وولدي ما بيلعبكمان ولدي ماهيلعب FULL WORD usage ---- DISAGREED Between Arab countries in many cases However, STEM usage ---- AGREED Between Arab countries in many cases ابني بيحب يلعب الكورة ولدي مابيحبش يلعب الكورة وولدي مابيحب يلعب الكورة كمان So

  19. Multi dialect morphology analyser The third method (cont.) According to a hypothesis: We will check the frequency in the web corpus; Full Word:بيصطاد (16500) Prefix:ب Suffix: Stem:يصطاد (800000) Full Word:بيتارجح (2850) Prefix:ب Suffix: Stem:يتارجح (212000) Full Word:بيتهجأ (5) Prefix:ب Suffix: Stem:يتهجأ (10100) Full Word:بيركع (13100) Prefix:ب Suffix: Stem:يركع (568000) Then: we choose the greatest frequency if it is >= 10000

  20. The final Results: An example of the output after last layer

  21. Last focus

  22. Last focus

  23. Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work

  24. Conclusions and future work & Future work • Works on a larger corpus • Deal with diacritisation • Add more linguistic rules in both adopted MSA morphology analyser and in web searching to improve the accuracy

  25. ? Any questions ?Thank you

More Related