An example of parallel corpora as currently being constructed for linguistic research
Download
1 / 10

An example of parallel corpora as currently being constructed for linguistic research - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

An example of parallel corpora as currently being constructed for linguistic research. The Multext-East "1984" Corpus http://nl.ijs.si/ME/CD/docs/1984.html. Corpus Markup COP Project 106 MULTEXT-East Work Package WP2 - Task 2.3 Deliverable D2.3 F Final Report 21 December 1997.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An example of parallel corpora as currently being constructed for linguistic research' - kiele


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An example of parallel corpora as currently being constructed for linguistic research

An example of parallel corpora as currently being constructed for linguistic research


The constructed for linguistic researchMultext-East "1984" Corpus

http://nl.ijs.si/ME/CD/docs/1984.html

Corpus Markup

COP Project 106 MULTEXT-East

Work Package WP2 - Task 2.3

Deliverable D2.3 F

Final Report

21 December 1997

http://nl.ijs.si/ME/CD/docs/mte-d23f/mte-D23F.html


Overview of the corpus constructed for linguistic research


<!DOCTYPE text PUBLIC "-//CES//DTD constructed for linguistic researchcesDoc//EN">

<text>

<body lang=en id=Oen>

<div id="Oen.1" type=part n=1>

<div id="Oen.1.1" type=chapter n=1>

<p id="Oen.1.1.1">

<s id="Oen.1.1.1.1">

It was a bright cold day in April,

and the clocks were striking thirteen.

</s>

<s id="Oen.1.1.1.2">

<name type=person>Winston Smith</name>,

his chin nuzzled into his breast in an effort to escape the

vile wind, slipped quickly through the glass doors of

<name type=place>Victory Mansions</name>,

though not quickly enough to prevent a swirl of gritty dust

from entering along with him.

</s>

</p>


The aligned corpus used the standard constructed for linguistic researchcesAna

rather than cesDOC

  • the TEXT is encoded as CHUNKLIST

  • the BODY is encoded as CHUNK

  • the DIV tags are omitted

  • the QUOTE tags are omitted

  • the P-level elements are encoded as PAR elements:

    • P is PAR, with implied TYPE;

    • the HEAD elements if present they are encoded as PAR TYPE=HEAD

    • LIST and POEM elements can be omitted, if present they are encoded as PAR TYPE=LIST and TYPE=POEM respectively

  • the S-level elements are encoded as S elements:

  • S is S, with implied TYPE;

    • if ITEM and L are present, they are marked as TYPE=ITEM and TYPE=L.

    • P-level and S-level IDs are referred to in the FROM attribute of PAR and S.

  • the Q tags are omitted

  • other cesDoc (sub-S level) tags such as DATE, NAME, ABBR, etc., are encoded as values of the CLASS attribute of the TOKen element.


<chunkList type=TEXT lang=en> constructed for linguistic research

<chunk type=BODY lang=en>

<par from='Oen.1.1.1'>

<s from='Oen.1.1.1.1'>

<tok type=WORD><orth>It</orth></tok>

<tok type=WORD><orth>was</orth></tok>

<tok type=WORD><orth>a</orth></tok>

<tok type=WORD><orth>bright</orth></tok>

<tok type=WORD><orth>cold</orth></tok>

<tok type=WORD><orth>day</orth></tok>

<tok type=WORD><orth>in</orth></tok>

<tok type=WORD><orth>April</orth></tok>

<tok type=PUNCT><orth>,</orth><ctag>COMMA</ctag></tok>

<tok type=WORD><orth>and</orth></tok>

<tok type=WORD><orth>the</orth></tok>

<tok type=WORD><orth>clocks</orth></tok>

<tok type=WORD><orth>were</orth></tok>

<tok type=WORD><orth>striking</orth></tok>

<tok type=WORD><orth>thirteen</orth></tok>

<tok type=PUNCT><orth>.</orth><ctag>PERIOD</ctag></ctag>

</tok>

</s>

Used for stand-off annotations


<p id="Oen.1.1.7"> constructed for linguistic research

<s id="Oen.1.1.7.1"><name type=org>Ministry of Truth</name>,

&mdash;

<name type=org lang=ns>Minitrue</name>,

in

<name type=language>Newspeak</name>

<ptr id="Oen.1.1.7.1.4" target="Oen.1.1.8" rend=asterisk>

&mdash; was startlingly different from any other object in sight.</s>

<s id="Oen.1.1.7.2">It

was an enormous pyramidal structure of glittering white concrete,

soaring up, terrace after terrace,

<num>300</num>

metres into the air.</s>

<s id="Oen.1.1.7.3">From where

<name type=person>Winston</name>

stood it was just possible to read, picked out on its white face in

elegant lettering, the three slogans of the

<name type=org>Party</name>:

<q id="Oen.1.1.7.3.3" rend="CE CA" type=slogan>War is peace</q>

<q id="Oen.1.1.7.3.4" rend="CE CA" type=slogan>Freedom is slavery</q>

<q id="Oen.1.1.7.3.5" rend="CE CA" type=slogan>Ignorance is strength.</q></s>

</p>

<note id="Oen.1.1.8" place=foot>

<name type=language>Newspeak</name>

was the official language of

<name type=place>Oceania</name>.

For an account of its structure and etymology see Appendix.

</note>


<p id="Oet.1.2.7"> constructed for linguistic research

<s id="Oet.1.2.7.1">

<name type=org> T&otilde;eministeerium </name> &mdash; <name type=language> uuskeeles </name> <ptr target=oet.N1 rend=asterisk> <name type=org> T&otilde;min </name> &mdash; erines rabavalt k&otilde;igest muust, mida oli n&auml;ha. </s>

<s id="Oet.1.2.7.2"> See oli tohutu kiiskavvalgest betoonist p&uuml;ramiidne ehitis, mis kerkis astanguliselt <num> 300 </num> meetri k&otilde;rgusele. </s>

<s id="Oet.1.2.7.3"> Sealt, kus <name type=person> Winston </name> seisis, seletas silm veel parajasti valgel seinal elegantses kirjas ilutsevat <name type=org> Partei </name> kolme loosungit:

<q id="Oet.1.2.7.3.3" rend=CA type=slogan> S&otilde;da on rahu </q>

<q id="Oet.1.2.7.3.4" rend=CA type=slogan> Vabadus on orjus </q>

<q id="Oet.1.2.7.3.5" rend=CA type=slogan> Teadmatus on j&otilde;ud </q> </s> </p>


Alignment across languages in the corpus constructed for linguistic research

The following hypothetical Slovene-English Orwell illustrates the overall structure of an MULTEXT-East alignment document; each link gives one type (one, many, zero) of possible alignment:

<!DOCTYPE cesAlign PUBLIC "-//CES//DTD cesAlign//EN">

<cesAlign version="4.1">

<linkList id="Oslen">

<linkGrp id="Oslen.1" type="body" targtype="s" domains="Osl Oen">

<link xtargets="Osl1.1 ; Oen1.1">

<link xtargets="Osl1.2 Osl1.3 ; Oenl1.2">

<link xtargets="Osl1.4 ; ">

</linkGrp>

</linkList>

</cesAlign>

As can be seen, the only link group in the link list is of type BODY, its target type is of type S, and its domains are the Slovene and English Orwell. The first link represents an 1 - 1 alignment, the second a 2 - 1 alignment, and the third a 1 - 0 alignment.


Corpus overview constructed for linguistic research

Tag usage in Orwell's ``1984''


ad