1 / 19

Arabic morphology and POS-tagging

A short intro with a couple of demonstrations. Arabic morphology and POS-tagging . Arabic morphology: overview of the problem Prior Art with demonstration of Buckwalter’s AraMorph Sketch of enhancements to AraMorph Demonstration Future directions. Outline.

kreeli
Download Presentation

Arabic morphology and POS-tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UW CLMA A short intro with a couple of demonstrations Arabic morphology and POS-tagging

  2. Arabic morphology: overview of the problem Prior Art with demonstration of Buckwalter’sAraMorph Sketch of enhancements to AraMorph Demonstration Future directions UW CLMA Outline

  3. Short vowels are not represented The contrast between diphthongs and long vowels is not represented Most closed class morphemes are written as affixes to the content word categories: Nouns, Adjectives, Verbs and prepositions UW CLMA Arabic morphology: overview of the problem

  4. Some examples (glossing over a lot of detail): • شاهد الرجل الفيلم فرجع إلى البيت • $AhdalrjlAlfylmfrjE {lAAlbyt • $aaHada r-rajul-u l-fiylm-a fa-rajaEa {ilaA l-bayt-i • Saw-3sg.m. the-man-nom the-film-accand-so-returned-3sg.m to the-house-gen • The man watched the film and then went home. • This example is not so bad UW CLMA Arabic morphology: overview (cont.)

  5. (conj)?(enclitic_preposition)? noun_stem (plural)(possesive_pronoun) (conj)?(definiteness marker)? noun_stem (plural)? (conj)? full_word_preposition (genitive_pronoun)? (conj)? complementizer (object_pron)? (conj)? (modal)? (ImpVerbSubjAgr) verb_stem (plural_subject_marker)? (object_pronoun)? (conj)? (modal)? verb_stem (perfVerbSubjAgr)? (object_pronoun)? UW CLMA Regular expressions for orthographic words

  6. Some strings with multiple analyses • فقد== fqd : either the verb • fqd = he lost • OR • f qd = and so (verbal modal) • fqdsmEth = فقد سمعته ; Can be analyzed as • a) f qdsmE t h (and so I had heard him) • b) fqdsmEp h (he lost his reputation) UW CLMA Inherent ambiguity

  7. Arabic spans 14 centuries and 22 countries Is the liturgical language of over 1 billion Muslims The Standard Language has never been a spoken variety. The vernaculars have never been standardized. The LDC corpus is the only annotated corpus that is readily available. The last time I looked the treebank part was less than a million tokens UW CLMA Other issues beyond the scope of this talk

  8. Buckwalter’sAramorph from LDC (a port from work done @ Xerox) Ported to Java on top of Lucene(!) by PierrickBrihaye circa 2003 http://cvs.savannah.gnu.org/viewvc/aramorph Tagset and segmentation description http://www.ldc.upenn.edu/Catalog/docs/LDC2003T06/POS-info.txt Buckwalter’sTransliteration scheme http://www.qamus.org/transliteration.htm. UW CLMA Prior Art

  9. The point here is that most word strings have more than one legal analysis. • The other point is that the number of types is quite high, unless you do something to reveal the content word behind all the function morpheme affixes. • Kitaab (book) • Al-kitaab (the book) • These two queries in Arabic return different sets of results on google UW CLMA And now a demonstration of Aramorph

  10. AraMorph will generate all the legal analyses for which it has an entry in its lexicon PierrickBrihaye ported AraMorph to Java AraMorph is the first stage in a lot of Arabic text processing done by researchers in the US. The Java port was done on top of Lucene, which is an open source indexing and IR system UW CLMA A few words WRT AraMorph

  11. I build this POS tagger in stages on top of PierrickBrihaye’s port of ofAraMorph The first thing I did was to port in a bigram model of segmented text from the LDC This was used to choose the most likely segmentation sequence out of all of the analyses returned by Buckwalter’s analyzer UW CLMA Enhancements to Aramorph

  12. With a 5-word sliding window • generate all sequences of segmentations for that 5-word window • based on all the analyses returned by AraMorph. • This scheme produced acceptable results • Sometime later a trigram model of the tags was added and • given 50% weight with the segmentation scores to decide which tags to keep with the segments UW CLMA Architecture (as it evolved)

  13. Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL’05). His team used Ripper (Cohen, 1996) to learn a rulebased classifier (Rip). They also used AraMorph as their starting point to produce all legal morphological sequences. http://www.mt-archive.info/ACL-2005-Habash-1.pdf UW CLMA This bears some similarity to other work done in 2005

  14. Good question, still TBD • I meant to pull out some of the training data and test it against a piece of the LDC corpus. • I ran out of time • Hand analysis puts it at better than 90%. • At some point I turned on the option to not toss the vowels provided by AraMorph. • This is observably less accurate UW CLMA How well does the POS tagger perform?

  15. I’m allowed to talk about this system I was told that I could expose its functionality on a website I am not allowed to distribute it or use it for commercial purposes There is an earlier tagger that does not inorporateLucene or AraMorph. It is based on Brill’s TB learning @ http://innerbrat.org/segmentTagDownload UW CLMA First: a word from my sponsor

  16. Tag to Buckwalter transliteration output • Tag to enamex style tags • Tag to • Utf8 arabic • Re-attaching the segments • Reduced tagset • Reloading the dictionary every time is annoying • Tag with a server and thin client UW CLMA The demos

  17. Any further work will require me to rebuild everything from scratch • Uncouple it from Lucene • Port it to c++ or c# • Bring in a statistical language model or two for recovering the short vowels. • Use some state-of-the-art machine learning toolkits to improve performance • Start annotating some of my corpora UW CLMA Future directions

  18. See if I can embed it in some practical applications such as • language teaching document production • preprocessing for • machine translation systems • preprocessing ASR • Text to speech • Bootstrap annotation tools for other Afro-Asiatic languages • Tigrinya, Somali, Hausa, Hebrew, Arabic vernaculars, Amharic, Amazigh, Coptic, Egyptian Hyroglyphs, Babylonian, Punic • Help with ODIN?? UW CLMA Future directions

  19. UW CLMA The end

More Related