automatic translation quality control using eurovoc descriptors l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic translation quality control using Eurovoc descriptors PowerPoint Presentation
Download Presentation
Automatic translation quality control using Eurovoc descriptors

Loading in 2 Seconds...

play fullscreen
1 / 16

Automatic translation quality control using Eurovoc descriptors - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Automatic translation quality control using Eurovoc descriptors. Marko Tadić , Božo Bekavac (marko.tadic@ffzg.hr, http:// www.hnk.ffzg.hr/mt / bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/ )

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatic translation quality control using Eurovoc descriptors' - yule


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic translation quality control using eurovoc descriptors

Automatic translation quality control using Eurovoc descriptors

Marko Tadić, Božo Bekavac(marko.tadic@ffzg.hr, http://www.hnk.ffzg.hr/mt/bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/)

Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr)

JRC Ispra / Arona, 2005-09-27

talk plan
Talk plan
  • motivation
  • automatic translation quality control
  • resources: glossary and test corpus
  • results
  • further directions
motivation
Motivation
  • Acquis communautaire (AC) is still being translated to Croatian
  • former Ministry of European Integrations (MEI),today Ministry of External Affairs and European Integrations
  • AC
    • ca 200,000 pages of EU OJ
    • AC corpus: from 8 Mw (Estonian) to 82 Mw (Spanish)
    • not precisely delimited (lawyers are working on that!)
    • constantly growing
    • legal texts
      • a lot of repetitious and formulaic expressions
      • low polysemy in terms expected
motivation 2
Motivation 2
  • different EU accession candidates  different organization of translation process
    • several years of work
    • large number of translators
    • in-house/out-house (tenders)
    • large-scale document translation and revision
  • MEI
    • outsourcing to ca 100 translators or translating companies
    • use of glossary with pre-established TEs
    • glossaries being translated in advance
      • Eurovoc
      • EU Law Glossary / Čtyřjazyčný slovník práva Evropské unie, Prague 1999
  • maintain the consistency of translation
    • by use of the same glossary only?
preparing ac originals for translation
Preparing AC originals for translation
  • project proposed by our Institute to MEI in 2002
  • entries from glossary marked in original text before translation
  • signal of the existence of pre-established TE to the translator
  • obligatory usage of existing TE in legal texts, e.g.:
    • Council of Europe Vijeće Europe
    • European Council Europsko vijeće
    • Council of the European Union Vijeće Europske unije
  • ...
  • AC had to be converted to XML
  • MEI dropped the project in 2003 for the lack of finances
  • now: AC corpus in XML
revision of translation
Revision of Translation
  • largest effort was put on translation in all candidate countries
  • revision of translation always in the last place
    • quality: consistency
  • task undermined by all candidate countries
    • large portions of official translation of AC poorly revised
  • usually done
    • manually
    • simple search & replace commands
    • no terms/entries marked in texts
  • automatic approach?
    • lexical level and idiomatic level
automatic translation quality control
Automatic Translation Quality Control
  • use system to check whether all pre-established TE are used
    • sentence aligned parallel corpus
    • glossary entries marked in
      • original text
      • translated text
  • if a TE of a glossary entry found in original, has not been found in aligned translated sentence

 translation is departing from pre-established TE

  • e.g.:
    • Eurovoc: (en) President of the Commission = (hr) Predsjednik Komisije
    • Corpus: (en) … if the President of the Commission declares … (hr1) … ako Predsjednik Komisije objavi … (hr2) … ako Predsjednik objavi …
resources
Resources
  • our lexicon: Eurovoc 4.1
    • documentational indexing glossary
    • ca 6000 entries (descriptors) covering topics found in EU legal texts
    • accompanied by non-descriptors (synonyms)
    • translated to Croatian in 2000
    • + 4000 Croatian specific descriptors
    • translation always 1:1
    • combination of nouns, adjectives, prepositions, conjunctions
  • our corpus: 9 documents from AC corpus and their translations from MEI
    • size: 16.053 tokens (en) 13.590 tokens (hr)
    • Croatian translations converted to AC corpus XML format
method
Method
  • simple glossary look-up?
  • problem of inflection
    • English
      • at least: sg, pl, ’s
    • Croatian
      • 7 cases  2 numbers for nouns
      • 7 cases  2 numbers  3 genders  2 definiteness  3 comparison for Adjectives
  • lemmatization of corpus or glossary?
  • Eurovoc lemmatized and converted to FSA: Intex
    • English lemmatizer from Intex for English Eurovoc
    • Croatian Lemmatization Server (hml.ffzg.hr) for Croatian Eurovoc
    • FSA with 10430 states
eurovoc as fsa
Eurovoc as FSA

"<diskrecijski><pravo>/<EVD lang=\"hr\" id=\"001444\">" 320 300 1 2

"<pravo><imenovanje>/<EVD lang=\"hr\" id=\"003048\">" 320 300 1 2

"<pravo><nadzor>/<EVD lang=\"hr\" id=\"003646\">" 320 300 1 2

"<nadzoran><tijelo>/<EVD lang=\"hr\" id=\"005492\">" 320 300 1 2

"<pravo><odluèivanje>/<EVD lang=\"hr\" id=\"003043\">" 320 300 1 2

"<pravo><poticanje>/<EVD lang=\"hr\" id=\"003045\">" 320 300 1 2

"<pravo><pregovaranje>/<EVD lang=\"hr\" id=\"003049\">" 320 300 1 2

"<pravo><procjena>/<EVD lang=\"hr\" id=\"003042\">" 320 300 1 2

"<pravo><provedba>/<EVD lang=\"hr\" id=\"003044\">" 320 300 1 2

"<pravo><ratifikacija>/<EVD lang=\"hr\" id=\"003046\">" 320 300 1 2

"<savjetodavan><pravo>/<EVD lang=\"hr\" id=\"000717\">" 320 300 1 2

"<veto>/<EVD lang=\"hr\" id=\"003964\">" 320 300 1 2

"<politièki><sustav>/<EVD lang=\"hr\" id=\"000153\">" 320 300 1 2

"<autoritaran><režim>/<EVD lang=\"hr\" id=\"000849\">" 320 300 1 2

"<diktatura>/<EVD lang=\"hr\" id=\"001428\">" 320 300 1 2

"<dvostranaèki><sustav>/<EVD lang=\"hr\" id=\"003851\">" 320 300 1 2

"<federalizam>/<EVD lang=\"hr\" id=\"001830\">" 320 300 1 2

"<jednostranaèki><sustav>/<EVD lang=\"hr\" id=\"002835\">" 320 300 1 2

"<monokracija>/<EVD lang=\"hr\" id=\"002659\">" 320 300 1 2

"<narodan><demokracija>/<EVD lang=\"hr\" id=\"002933\">" 320 300 1 2

"<oligarhija>/<EVD lang=\"hr\" id=\"033057\">" 320 300 1 2

"<parlamentaran><sustav>/<EVD lang=\"hr\" id=\"002903\">" 320 300 1 2

"<pobunjenièki><vlada>/<EVD lang=\"hr\" id=\"003233\">" 320 300 1 2

"<predsjednièki><sustav>/<EVD lang=\"hr\" id=\"034211\">" 320 300 1 2

"<promjena><politièki><sustav>/<EVD lang=\"hr\" id=\"001128\">" 320 300 1 2

"<republika>/<EVD lang=\"hr\" id=\"003302\">" 320 300 1 2

"<ustavan><monarhija>/<EVD lang=\"hr\" id=\"001254\">" 320 300 1 2

"<višestranaèki><sustav>/<EVD lang=\"hr\" id=\"002676\">" 320 300 1 2

"<vlada><u><progonstvo>/<EVD lang=\"hr\" id=\"002028\">" 320 300 1 2

"<vojni><režim>/<EVD lang=\"hr\" id=\"002617\">" 320 300 1 2

"<politièki><stranka>/<EVD lang=\"hr\" id=\"000003\">" 320 300 1 2

method 2
Method 2
  • glossary entires marked in corpus together with IDs
    • <EVD lang=“hr” id=“001747”>
  • checking whether the same ID appears on both sides of alignment (Perl script)

P1/005749 P1/005494 P2/005749 P2/005494 P7/002840 P8/001952 P8/001952 P9/000060 ...

  • statistics en hr

<P> 652 656

<EVD> 1328 1484

<EVD> with matched IDs 803 (60,47%)

  • matched <EVD>s are also word/phrase aligned parts below <P>
drawbacks
Drawbacks
  • syntactic merging
    • abbreviations not matched / marked(e.g. EP delegation vs. European Parliament delegation)
    • merged terms not matched / marked(e.g. head of State, head of government vs. heads of State or Government)
  • EUROVOC = glossary intended for indexing
    • a lot of real terms (MWU) not matched / marked(e.g. country candidate to EU accession, Stabilisation and Association Agreement)  they don’t exist as entries
    • no semantic processing  polysemous terms wrongly matched / marked(e.g. ...which might lead(olovo) to a common defence...)
  • Intex English lemmatizer didn’t cover all Eurovoc entries
further directions
Further directions
  • evaluation of matched pairs of <EVD> regarding
    • single-word units
    • multi-word units
  • improving Intex English lemmatizer / lexicon
  • use Eurovoc non-descriptors as synonyms
    • to capture a wider departure from expected TE in translation more precisely
  • use / include other glossaries
    • EU Law Glossary
  • test the whole system on larger corpus
  • use it with other languages
automatic translation quality control using eurovoc descriptors16

Automatic translation quality control using Eurovoc descriptors

Marko Tadić, Božo Bekavac(marko.tadic@ffzg.hr, http://www.hnk.ffzg.hr/mt/bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/)

Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr)

JRC Ispra / Arona, 2005-09-27