1 / 18

A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data. Olga Pustylnikov, Alexander Mehler Bielefeld University. Motivation. Exploring similarities among languages by means of syntactic treebanks We collected a database covering 11 languages

allie
Download Presentation

A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Unified Database of Dependency TreebanksIntegrating, Quantifying & EvaluatingDependency Data Olga Pustylnikov, Alexander Mehler Bielefeld University

  2. Motivation • Exploring similarities among languages by means of syntactic treebanks • We collected a database covering 11 languages • Treebanks have been developed separately by different research projects • quantitative investigations on these treebanks-> the need for unification

  3. Motivation Demands on the unified format of treebanks (+)generic:allowing to represent as many treebanks as possible (+)extensibleto new treebanks (+)complete:preserving all corpus specific information (+) transferable to other kinds of corpora (–) complex: exhibiting the minimal complexity -> graph representations

  4. Motivation GXL (Holt et al., 2006) • Graph eXtensible Language is a graph model representig corpora in terms of graphs XML Multimodal Data GXL eGXL TOOLS WIKI Treebanks Treebanks • GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008))

  5. Agenda

  6. eGXL 2-level data model Types <graph id=“Types”> <node id=“POS” /> <node id=“t245” name=“VERB” /> … </graph> IDREF <graph id="Sentences"> <graph id="g8"> <node id="s8_1" form="Detta" pos="t151" /> <node id="s8_2" form="vill" pos="t245" /> ... <rel> <relend direction="in" target="s8_2" /> <relend direction="out" target="s8_1" /> </rel> ... </graph> Sentences

  7. The eGXL Sentences-graph vill . Detta jag bestämt bemöta each token of a treebank each token of a treebank an IDREF to the POS-node of the Types-graph an IDREF to the POS-node of the Types-graph <graph id="Sentences"> <graph id="g8"> <node id="s8_1" form="Detta" pos="t151" /> <node id="s8_2" form="vill" pos="t245" /> ... <rel> <relend direction="in" target="s8_2" /> <relend direction="out" target="s8_1" /> </rel> ... </graph> word form word form a (syntactic) relation a (syntactic) relation from (e.g. a head verb) from (e.g. a head verb) to (e.g. a dependent argument) to (e.g. a dependent argument)

  8. Agenda

  9. 11 Dependency Treebanks 7 different formats

  10. Input vs. Output Formats Examples from Dutch, Swedish, Italian treebanks

  11. Unification is possible… … due to the separation of the core from the secondary parts <graph id=“Types”> <node id=“POS” /> <node id=“t245” name=“VERB” /> … </graph> diversity <graph id="Sentences"> <graph id="g8"> <node id="s8_1" form="Detta" pos="t151" /> <node id="s8_2" form="vill" pos="t245" /> ... <rel> <relend direction="in" target="s8_2" /> <relend direction="out" target="s8_1" /> </rel> ... </graph> commonality

  12. The TreebankWiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/

  13. Agenda

  14. Complexity of eGXL Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) node rel other eGXL other eGXL

  15. Agenda

  16. DTDB

  17. Agenda

  18. Conclusions • a database covering 11 languages • eGXL – a generic XML graph model adopted to syntactic treebanks • use of treebanks within a single application (Ariadne) olga.pustylnikov@uni-bielefeld.de alexander.mehler@uni-bielefeld.de ruediger.gleim@uni-bielefeld.de SFB 673 Thank you for your attention!

More Related