1 / 20

Syntactically annotated corpora of Estonian

Syntactically annotated corpora of Estonian. Heli Uibo Institute of Computer Science University of Tartu Heli.Uibo@ut.ee. Outline. Who? Why? Three initiatives: CG-corpus Sofie Parallel Treebank Arborest What next?. Who are we ?. Kaili Müürisep , PhD Tiina Puolakainen , PhD

brand
Download Presentation

Syntactically annotated corpora of Estonian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu Heli.Uibo@ut.ee

  2. Outline • Who? • Why? • Three initiatives: • CG-corpus • Sofie Parallel Treebank • Arborest • What next?

  3. Who are we? Kaili Müürisep, PhD Tiina Puolakainen, PhD Mare Koit, PhD Tiit Roosmaa, PhD Kadri Muischnek, M.A. Heli Uibo, M.Sc. Andriela Rääbis, M.A. Heili Orav, M.A. Kaarel Kaljurand, M.Sc. + students of computational linguistics (experienced in shallow syntactic annotating of texts)

  4. Why do we need syntactically annotated corpora? • To evaluate language technologicalsoftware (tools forinformation retrieval and extraction, automatic summarization,machine translation) • To build a new up-to-date description of Estonian syntax, taking into account real language usage

  5. Three syntactically annotated corpora for Estonian 1. Constraint Grammar (CG) Corpus • size – 200 000running words≈ ca 15 000 sentences • 184 000 words of Estonian original fiction • 10 000 words of newspaper texts • 6 000 words of legal texts • shallow annotation, using Constraint Grammar: a syntactic function is determined for every word-form

  6. Three syntactically annotated corpora for Estonian (2) Two small-scale experimental treebanks: 2. Sofie Parallel Treebank – a Penn-style phrase structure treebank of 50 sentences 3. Arborest – a VISL-style hybrid treebank of 2500 sentences (first 149 sentences manually revised)

  7. Constraint Grammar Corpus • Has been built to train and test the Constraint Grammar shallow syntactic parser ESTCG • Currently the precision of ESTCG is 76,4-79,2 % and recall is 95,5-96,9 %.

  8. ESTCG: Syntactic tags @SUBJ – subject @OBJ – object @PRD – predicative @ADVL – adverbial @+FMV, @-FMV, @+FCV, @-FCV – parts of the predicate @AN> @<AN – adjective as attribute @NN> @<NN – noun as attribute, apposition @AD> @<AD – adverb as attribute @Q> @<Q – complements of quantor @P> @<P – complements of adposition ...

  9. CG-corpus: example <s> Mitmekesisus mitme_kesi=sus+0 //_S_ com sg nom #cap // **CLB @SUBJ on ole+0 //_V_ main indic pres ps3 sg ps af #FinV #Intr // @+FMV elu elu+0 //_S_ com sg gen // @NN> vaieldamatu vaieldamatu+0 //_A_ pos sg nom // @AN> omapära oma_pära+0 //_S_ com sg nom // @PRD $. . //_Z_ Fst // </s>

  10. CG-corpus: the process of extending the corpus • Input: morphologically hand-annotated text • Automatic syntactic analysis (ESTCG parser) • Hand-correcting – two linguists in parallel (annotating manual + GUI-based annotation tool) • Automatic comparison • Discussion of problematic cases • Creation of final version

  11. Sofie Parallel Treebank • Sofie Parallel Treebank is being developed inside Nordic Treebank Network, funded by NorFA language technology program and joining 15 academic institutions from Sweden, Norway, Denmark, Finland, Estonia and Iceland. • Material – the 1st chapter of Jostein Gaarder's novel "Sophie's World". • Currently, the parallel treebank includes Swedish, German, Norwegian, Estonian and two versions of Danish, 50-100 sentences from each language.

  12. Sofie Parallel Treebank (cont-d) • The syntactic structure represented in the trees of different languages is notuniform: • Danish: Discontinuous Grammar dependency treebank and VISL-style phrase structure treebank • Swedish: dependency treebank • German: NEGRA-style treebank • Norwegian: phrase structure treebank • Estonian: Penn-style phrase structure treebank. • The representation format of trees is TIGER XML.

  13. Estonian part of Sofie treebank: how we did it? • Trees drawn on paper by K. Muischnek and H. Nigol. • “Electronic” trees drawn with ANNOTATE tool, using Penn treebank tagset by H. Uibo and K. Kaljurand • Database of trees exported from ANNOTATE in NEGRA format • TigerRegistry and TigerSearch used to convert into TIGER XML • Website of Sofie Parallel Treebank: http://omilia.uio.no/sofie

  14. Sample trees from Sofie treebank Her begynte den dype skogen.

  15. Straks Sofie hadde lukket porten bak seg, åpnet hun konvolutten.

  16. Sofie Parallel Treebank – example from web-interface Sophie's father was the captain of a big oil tanker, and was away for most of the year.

  17. Arborest • Joint work with dr. Eckhard Bick, University of Southern Denmark • VISL-style experimental treebank • Annotated for both function (S = subject, P = predicate, O = object, A = adverbial,STA = statement, QUE = question, etc.) and form (np, vp, pp, advp, adjp, fcl = finite clause, par = paratagma, etc.)

  18. Arborest (cont-d) • Automatically generated from a sample of CG-corpus (2500 sentences) with CG→PSG rules • 149 sentences revised • 1/3 of sentences correct • CG→PSG rules are under improvement Webpage http://corp.hum.sdu.dk/arborest.html

  19. Arborest – sample tree

  20. What next? • To enlarge all three syntactically annotated corpora. • To improve the CG-to-PSG rules to facilitate the easy semi-automatic way of building an Estonian treebank. • To create another, syntactic-semantic dependency treebank for Estonian, which will be semi-automatically generated from one of the existing experimental phrase structure treebanks. → How many semantic information can be derived from the syntactic dependency structure?

More Related