1 / 18

Outline:

Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana Šindlerová, Josef Toman, Zdeněk Žabokrtský. Outline:. Functional Generative Description Parallel Treebanks PCEDT 2.0 – Project Report

Download Presentation

Outline:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3rd PIRE Meeting Tectogrammatical Representation of English in Prague Czech-English Dependency TreebankLucie MladováSilvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana Šindlerová, Josef Toman, Zdeněk Žabokrtský

  2. 3rd PIRE Meeting Outline: • Functional Generative Description • Parallel Treebanks • PCEDT 2.0 – Project Report • tectogrammatical level of annotation • valency treatment • annotation manual for English • interannotator agreement

  3. 3rd PIRE Meeting Functional Generative Description • Basic approach for Prague Treebanks • dependency • stratificational description of the language: • From structure to function (meaning) - 3 layers of annotation: • morphological • analytical (=surface syntax) • tectogrammatical (=“deep“ syntax, semantics)

  4. 3rd PIRE Meeting Functional Generative Description • Since 1995: Prague Dependency Treebank (PDT) -> Czech data (1.0 released LDC 2001, 2.0 – LDC 2006) • The idea of a parallel corpus: English data, Czech data – translated: Prague Czech-English Dependency Treebank (PCEDT) (1.0 released LDC 2004)

  5. 3rd PIRE Meeting The Idea of a Parallel, Syntactically Annotated Corpus Build an English corpus in the same formalism as PDT (data resource: Wall Street Journal section of Penn Treebank) Translate it into Czech Manual annotations of both parts of the corpus Train tectogrammar-based machine translation

  6. 3rd PIRE Meeting Phrasal x Dependency Tree Mr. Payson, an art dealer and collector, sold Vincent van Gogh's "Irises" at a Sotheby's auction in November 1987 to Australian businessman Alan Bond.

  7. 3rd PIRE Meeting Dependency Trees:a-layer = surface syntaxt-layer = underlying syntax, semantics It may have been painted instead by a Rubens associate.

  8. 3rd PIRE Meeting Dependency Trees:a-layer = surface syntaxt-layer = underlying syntax, semantics It may have been painted insteadof Rubensby a Rubens associate.

  9. 3rd PIRE Meeting Tectogrammatical Representation (t-tree) Contains: • syntactic dependency and coordination: edges • semantic relations: tectogrammatical functors • verb arguments (inner participants) • semantic ACT, PAT • syntactic ADDR, ORIG, EFF • free modifications (e.g. TWHEN, LOC, DIR, MANN,CAUS, CPR, ACMP...) • other: rhematizers, idiomatic expressions, foreign phrases... • valency of the verbs: valency lexicon EngValLex

  10. 3rd PIRE Meeting Tectogrammatical Representation (t-tree) Contains: • links to the lower layers • grammatical (and textual) coreference • topic-focus articulation

  11. 3rd PIRE Meeting additional work conversion of the PropBank-lexicon into EngVallex (verbs only) tools adjustment (TrEd, unified macros for both CZ and ENG annotation) interannotator-agreement measuring first version of the annotation manual, is being revised training of new annotators Building the PCEDT 2.0, the Current Annotation of the English Data work with the corpus data • input: WSJ texts (PTB), approx. 50 000 sentences (1.2 million words), automatically converted into PDT-like shape – a-layer • automatic t-layer procession • manual annotation running (approx. 4000 trees annotated) • meanwhile – Czech section annotation of the t-layer launched

  12. 3rd PIRE Meeting EngValLex • adaptation of PropBank into the format of PDT-Vallex (Valency lexicon for Czech) • manual correction • continuous checking during the annotation • current version contains only verbs future work on EngValLex: • defining surface realizations – morphosyntactic characteristics of the semantics roles • valency of nouns and adjectives

  13. 3rd PIRE Meeting Annotation Manual = "Annotation of English on the tectogrammatical level: Reference book" • based on the abbreviated version of the annotation manual for PDT (Czech) • chapters specific to English data annotation added • first rough version 1.0.1: April 2007 • revision in progress • extensions planned (concurrently with the annotation)

  14. 3rd PIRE Meeting Interannotator Agreement • monthly controlof the annotation consistency • approx. 30 trees • measured: • structure: agreement in parent node • functors • further analysis: • list of unpaired nodes • statistics for diverging functors • elimination of detected annotation divergences at annotator meetings

  15. 3rd PIRE Meeting Average Interannotator Agreement

  16. 3rd PIRE Meeting Future goals • annotation expansion • 500 trees/annotator/month • increasing (or at last keeping) the interannotator agreement • training of new annotators • EngValLex precision • annotation manual precision and expansion

  17. 3rd PIRE Meeting Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589.

  18. 3rd PIRE Meeting Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589. Thank you for your attention!

More Related