1 / 19

Prague Arabic Dependency Treebank

Prague Arabic Dependency Treebank. Introduction & Related Projects. MALACH Workshop in Prague August 28, 2003. Otakar Smrž et al. PADT Project at a Glance. Dependency treebank of Modern Standard Arabic Morphology 58,148 tokens Analytical syntax 41,288 tokens

maura
Download Presentation

Prague Arabic Dependency Treebank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prague Arabic Dependency Treebank Introduction & Related Projects MALACH Workshop in Prague August 28, 2003 Otakar Smrž et al.

  2. PADT Project at a Glance • Dependency treebank of Modern Standard Arabic • Morphology 58,148 tokens • Analytical syntax 41,288 tokens • Tectogrammatical description in preparation • Experience of the Prague Dependency Treebank • Guidelines and annotations by Charles University • Since 2001 ~ five annotators ~ three researchers • Cooperation with the Linguistic Data Consortium • Source corpora, morphological analyzer, workshops Prague Arabic Dependency Treebank: Introduction & Related Projects

  3. Presentation Outline • Introductory issues in Arabic • Morphology and the writing system • Elementary syntactic constructs • LDC Arabic Treebank • Reference to ConDep conversion • Prague Arabic Dependency Treebank • Progress in the project, applications • Related projects and perspectives • Exchange of tools and ideas • Workshops and cooperation Prague Arabic Dependency Treebank: Introduction & Related Projects

  4. Arabic Language and Script • Semitic language, inner flexion and concatenation, consonantal roots, weak derivation patterns • Phonemic script, non-vocalized script, word tying, other omissions الافرادفهم the members understood fahimaal-'afra~du the members were understood fuhimaal-'afra~du he understood the members fahimaal-'afra~da understanding to the isolation fahmual-'ifra~di and they are the individuals fa-humal-'afra~du Prague Arabic Dependency Treebank: Introduction & Related Projects

  5. Morphology Issues • Arabic strings are extremely ambiguous • Short vowels, consonantal geminations, glottal-stop marks etc. normally omitted in the script • Strings need not correspond to single words • Morphonological changes increase the homonymy • Tokenization of input surface strings • Necessary pre-requisite to analytical annotation • Requires morphological disambiguation • Lexicon update, foreign names and terms • Use those analyzers which are flexible in this respect Prague Arabic Dependency Treebank: Introduction & Related Projects

  6. Elementary Syntax Issues • Mostly VSO in verbal sentences, but … • … not so in clauses with non-verbal predication • … neither if topicalizers are present • Non-verbal predication of several types • Verbal nature of some nominal formations • Grammatical co-reference, accusative of the inner object • Complex referencing, rich expressions Prague Arabic Dependency Treebank: Introduction & Related Projects

  7. da~ma [Pred] lasted iqtira~Hu [Sb] proposal sa~Eatayni [Adv] two-hours[acc.] ‑hu [Atr] his al-Eamali~yata [Obj] the-operation[acc.] Eala~ [AuxP] on kabi~run [Pnom] a-big zumala~’i [Obj] colleagues al-baytu [Sb] the-house ‑hi [Atr] his la- [PredP] for -hu [Obj] him baytun [Sb] a-house[nom.] a~mili~na [???] hoping[acc.] qubu~la [Obj] accepting[acc.] -kum [Atr] your daEwata [Obj] invitation[acc.] -na~ [Atr] our Dependency Formalism Prague Arabic Dependency Treebank: Introduction & Related Projects

  8. Non-terminal nodes + Text tokens Constituent labeling on non-terminals Slots and traces Linguistic Data Consortium, University of Pennsylvania Sentence root node + Text tokens Analytical function for every tree node Government and roles CCL & IFAL & ICL, Charles University in Prague Constituency X Dependency Prague Arabic Dependency Treebank: Introduction & Related Projects

  9. Trace of the antecedent subject Compound function of the head of the clause – outer and inner perspectives Free word-order compliant Model Arabic Phrase I Prague Arabic Dependency Treebank: Introduction & Related Projects

  10. Sister-like co-ordination Conjunction of co-ordination Status constructus Model Arabic Phrase II Prague Arabic Dependency Treebank: Introduction & Related Projects

  11. LDC Arabic Treebanking • Arabic Treebank: Part 1, version 2.0 (syntax) • 160,275 words, 4,113 trees • Arabic Treebank: Part 2, version 1.0 (morphology) • 144,199 words, 2,591 paragraphs • Arabic Treebank: Part 1, Arabic-English Parallel • 10K-word parallel translation • Arabic Gigaword • Agence France Presse, Al Hayat, Al Nahar, Xinhua • 391,619,000 words, 1,256,719 documents Prague Arabic Dependency Treebank: Introduction & Related Projects

  12. PADT Annotation Progress • AFP Data Exchange Experiment • Dependency annotation of LDC’s ~10k words • 12,936 nodes, 374 trees (34.6 nodes per tree) • Additional Xerox morphological annotation • UMMAH Corpus Annotation • Morphology with the LDC tools, ~50k words • 45,212 nodes, 1,039 trees (43.5 nodes per tree) • Dependency annotation, ~30k words ready • 28,352 nodes, 646 trees (43.9 nodes per tree) Prague Arabic Dependency Treebank: Introduction & Related Projects

  13. Algorithm Progress • Constituency—Dependency Transformation • Based on the AFP Exchange Experiment • EACL ’03 Research Note • Arabic Dependency Parser & Analytical Function Assignment • Incorporated into the annotation process • Machine-learning methods involved Prague Arabic Dependency Treebank: Introduction & Related Projects

  14. Application Progress • TrEd Tree Editor • Highly powerful and reusable annotation tool • NetGraph Tree Search • Extra version for Arabic • Server/Client system architecture • Perl Modules • AG2MorphoXML, MorphoMap, Encode::Arabic Prague Arabic Dependency Treebank: Introduction & Related Projects

  15. TrEd Tree Editor • Perl and Perl/Tk interactive application or batch processor • General editor for trees and tree-like graphs • Analytical dependency annotation • Tectogrammatical dependency annotation • Phrase-structure trees, MT solution forests … • Comparison of parser/human results • Language and platform independent Prague Arabic Dependency Treebank: Introduction & Related Projects

  16. NetGraph Tree Search • Java client, C server • Interactive tree search, viewing, counting … • Query in the form of a generalized subtree • Server-side data search, client-side rendering • Dependency trees, phrase-structure trees, trees • Linguistic research, verifying of hypotheses • Quick & easy system, language and platform independent Prague Arabic Dependency Treebank: Introduction & Related Projects

  17. Perl Modules • AG2MorphoXML • Token reconstruction from morpheme sequence • Various readings/annotations, Prague XML • MorphoMap • Conversion from AraMorph multi-word POS tags to positional/bit-vector compact description • Encode::Arabic • Incorporation of Buckwalter and ArabTeX transliterations into the useful Encode module Prague Arabic Dependency Treebank: Introduction & Related Projects

  18. Related Projects Prague/Penn • Tectogrammatical description guidelines • Excellent PhD students joining the project • Taggers, parsers, tree-node classifiers • AraMorph re-implementation, spell-checkers • Dictionaries, on-line or printed • Projects in CR, USA and the Netherlands • ACE named entity annotation • Currently in LDC, “included” in tectogrammatics • LDC’s CallHome & CallFriend for Arabic Dialects Prague Arabic Dependency Treebank: Introduction & Related Projects

  19. Workshops Penn/Prague • Philadelphia, July 2002 • Setting up, POS tool demo, intro to descriptions • AFP data exchange experiment • Prague, May 2003 • Reports, tutorials on applications and theories • Morphology improvements, Arabic Gigaword • Tool exchange and data revision plans • Lisbon, April 2004 • Open workshop proposed for the LREC ’04 • Publication of the projects, the results & consequences Prague Arabic Dependency Treebank: Introduction & Related Projects

More Related