html5-img
1 / 10

An example of parallel corpora as currently being constructed for linguistic research

An example of parallel corpora as currently being constructed for linguistic research. The Multext-East "1984" Corpus http://nl.ijs.si/ME/CD/docs/1984.html. Corpus Markup COP Project 106 MULTEXT-East Work Package WP2 - Task 2.3 Deliverable D2.3 F Final Report 21 December 1997.

kiele
Download Presentation

An example of parallel corpora as currently being constructed for linguistic research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An example of parallel corpora as currently being constructed for linguistic research

  2. The Multext-East "1984" Corpus http://nl.ijs.si/ME/CD/docs/1984.html Corpus Markup COP Project 106 MULTEXT-East Work Package WP2 - Task 2.3 Deliverable D2.3 F Final Report 21 December 1997 http://nl.ijs.si/ME/CD/docs/mte-d23f/mte-D23F.html

  3. Overview of the corpus

  4. <!DOCTYPE text PUBLIC "-//CES//DTD cesDoc//EN"> <text> <body lang=en id=Oen> <div id="Oen.1" type=part n=1> <div id="Oen.1.1" type=chapter n=1> <p id="Oen.1.1.1"> <s id="Oen.1.1.1.1"> It was a bright cold day in April, and the clocks were striking thirteen. </s> <s id="Oen.1.1.1.2"> <name type=person>Winston Smith</name>, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of <name type=place>Victory Mansions</name>, though not quickly enough to prevent a swirl of gritty dust from entering along with him. </s> </p>

  5. The aligned corpus used the standard cesAna rather than cesDOC • the TEXT is encoded as CHUNKLIST • the BODY is encoded as CHUNK • the DIV tags are omitted • the QUOTE tags are omitted • the P-level elements are encoded as PAR elements: • P is PAR, with implied TYPE; • the HEAD elements if present they are encoded as PAR TYPE=HEAD • LIST and POEM elements can be omitted, if present they are encoded as PAR TYPE=LIST and TYPE=POEM respectively • the S-level elements are encoded as S elements: • S is S, with implied TYPE; • if ITEM and L are present, they are marked as TYPE=ITEM and TYPE=L. • P-level and S-level IDs are referred to in the FROM attribute of PAR and S. • the Q tags are omitted • other cesDoc (sub-S level) tags such as DATE, NAME, ABBR, etc., are encoded as values of the CLASS attribute of the TOKen element.

  6. <chunkList type=TEXT lang=en> <chunk type=BODY lang=en> <par from='Oen.1.1.1'> <s from='Oen.1.1.1.1'> <tok type=WORD><orth>It</orth></tok> <tok type=WORD><orth>was</orth></tok> <tok type=WORD><orth>a</orth></tok> <tok type=WORD><orth>bright</orth></tok> <tok type=WORD><orth>cold</orth></tok> <tok type=WORD><orth>day</orth></tok> <tok type=WORD><orth>in</orth></tok> <tok type=WORD><orth>April</orth></tok> <tok type=PUNCT><orth>,</orth><ctag>COMMA</ctag></tok> <tok type=WORD><orth>and</orth></tok> <tok type=WORD><orth>the</orth></tok> <tok type=WORD><orth>clocks</orth></tok> <tok type=WORD><orth>were</orth></tok> <tok type=WORD><orth>striking</orth></tok> <tok type=WORD><orth>thirteen</orth></tok> <tok type=PUNCT><orth>.</orth><ctag>PERIOD</ctag></ctag> </tok> </s> Used for stand-off annotations

  7. <p id="Oen.1.1.7"> <s id="Oen.1.1.7.1"><name type=org>Ministry of Truth</name>, &mdash; <name type=org lang=ns>Minitrue</name>, in <name type=language>Newspeak</name> <ptr id="Oen.1.1.7.1.4" target="Oen.1.1.8" rend=asterisk> &mdash; was startlingly different from any other object in sight.</s> <s id="Oen.1.1.7.2">It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, <num>300</num> metres into the air.</s> <s id="Oen.1.1.7.3">From where <name type=person>Winston</name> stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the <name type=org>Party</name>: <q id="Oen.1.1.7.3.3" rend="CE CA" type=slogan>War is peace</q> <q id="Oen.1.1.7.3.4" rend="CE CA" type=slogan>Freedom is slavery</q> <q id="Oen.1.1.7.3.5" rend="CE CA" type=slogan>Ignorance is strength.</q></s> </p> <note id="Oen.1.1.8" place=foot> <name type=language>Newspeak</name> was the official language of <name type=place>Oceania</name>. For an account of its structure and etymology see Appendix. </note>

  8. <p id="Oet.1.2.7"> <s id="Oet.1.2.7.1"> <name type=org> T&otilde;eministeerium </name> &mdash; <name type=language> uuskeeles </name> <ptr target=oet.N1 rend=asterisk> <name type=org> T&otilde;min </name> &mdash; erines rabavalt k&otilde;igest muust, mida oli n&auml;ha. </s> <s id="Oet.1.2.7.2"> See oli tohutu kiiskavvalgest betoonist p&uuml;ramiidne ehitis, mis kerkis astanguliselt <num> 300 </num> meetri k&otilde;rgusele. </s> <s id="Oet.1.2.7.3"> Sealt, kus <name type=person> Winston </name> seisis, seletas silm veel parajasti valgel seinal elegantses kirjas ilutsevat <name type=org> Partei </name> kolme loosungit: <q id="Oet.1.2.7.3.3" rend=CA type=slogan> S&otilde;da on rahu </q> <q id="Oet.1.2.7.3.4" rend=CA type=slogan> Vabadus on orjus </q> <q id="Oet.1.2.7.3.5" rend=CA type=slogan> Teadmatus on j&otilde;ud </q> </s> </p>

  9. Alignment across languages in the corpus The following hypothetical Slovene-English Orwell illustrates the overall structure of an MULTEXT-East alignment document; each link gives one type (one, many, zero) of possible alignment: <!DOCTYPE cesAlign PUBLIC "-//CES//DTD cesAlign//EN"> <cesAlign version="4.1"> <linkList id="Oslen"> <linkGrp id="Oslen.1" type="body" targtype="s" domains="Osl Oen"> <link xtargets="Osl1.1 ; Oen1.1"> <link xtargets="Osl1.2 Osl1.3 ; Oenl1.2"> <link xtargets="Osl1.4 ; "> </linkGrp> </linkList> </cesAlign> As can be seen, the only link group in the link list is of type BODY, its target type is of type S, and its domains are the Slovene and English Orwell. The first link represents an 1 - 1 alignment, the second a 2 - 1 alignment, and the third a 1 - 0 alignment.

  10. Corpus overview Tag usage in Orwell's ``1984''

More Related