automatic translation quality control using eurovoc descriptors l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic translation quality control using Eurovoc descriptors PowerPoint Presentation
Download Presentation
Automatic translation quality control using Eurovoc descriptors

Loading in 2 Seconds...

play fullscreen
1 / 16

Automatic translation quality control using Eurovoc descriptors - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Automatic translation quality control using Eurovoc descriptors. Marko Tadić , Božo Bekavac (marko.tadic@ffzg.hr, http:// www.hnk.ffzg.hr/mt / bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/ )

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Automatic translation quality control using Eurovoc descriptors


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic translation quality control using eurovoc descriptors

Automatic translation quality control using Eurovoc descriptors

Marko Tadić, Božo Bekavac(marko.tadic@ffzg.hr, http://www.hnk.ffzg.hr/mt/bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/)

Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr)

JRC Ispra / Arona, 2005-09-27

talk plan
Talk plan
  • motivation
  • automatic translation quality control
  • resources: glossary and test corpus
  • results
  • further directions
motivation
Motivation
  • Acquis communautaire (AC) is still being translated to Croatian
  • former Ministry of European Integrations (MEI),today Ministry of External Affairs and European Integrations
  • AC
    • ca 200,000 pages of EU OJ
    • AC corpus: from 8 Mw (Estonian) to 82 Mw (Spanish)
    • not precisely delimited (lawyers are working on that!)
    • constantly growing
    • legal texts
      • a lot of repetitious and formulaic expressions
      • low polysemy in terms expected
motivation 2
Motivation 2
  • different EU accession candidates  different organization of translation process
    • several years of work
    • large number of translators
    • in-house/out-house (tenders)
    • large-scale document translation and revision
  • MEI
    • outsourcing to ca 100 translators or translating companies
    • use of glossary with pre-established TEs
    • glossaries being translated in advance
      • Eurovoc
      • EU Law Glossary / Čtyřjazyčný slovník práva Evropské unie, Prague 1999
  • maintain the consistency of translation
    • by use of the same glossary only?
preparing ac originals for translation
Preparing AC originals for translation
  • project proposed by our Institute to MEI in 2002
  • entries from glossary marked in original text before translation
  • signal of the existence of pre-established TE to the translator
  • obligatory usage of existing TE in legal texts, e.g.:
    • Council of Europe Vijeće Europe
    • European Council Europsko vijeće
    • Council of the European Union Vijeće Europske unije
  • ...
  • AC had to be converted to XML
  • MEI dropped the project in 2003 for the lack of finances
  • now: AC corpus in XML
revision of translation
Revision of Translation
  • largest effort was put on translation in all candidate countries
  • revision of translation always in the last place
    • quality: consistency
  • task undermined by all candidate countries
    • large portions of official translation of AC poorly revised
  • usually done
    • manually
    • simple search & replace commands
    • no terms/entries marked in texts
  • automatic approach?
    • lexical level and idiomatic level
automatic translation quality control
Automatic Translation Quality Control
  • use system to check whether all pre-established TE are used
    • sentence aligned parallel corpus
    • glossary entries marked in
      • original text
      • translated text
  • if a TE of a glossary entry found in original, has not been found in aligned translated sentence

 translation is departing from pre-established TE

  • e.g.:
    • Eurovoc: (en) President of the Commission = (hr) Predsjednik Komisije
    • Corpus: (en) … if the President of the Commission declares … (hr1) … ako Predsjednik Komisije objavi … (hr2) … ako Predsjednik objavi …
resources
Resources
  • our lexicon: Eurovoc 4.1
    • documentational indexing glossary
    • ca 6000 entries (descriptors) covering topics found in EU legal texts
    • accompanied by non-descriptors (synonyms)
    • translated to Croatian in 2000
    • + 4000 Croatian specific descriptors
    • translation always 1:1
    • combination of nouns, adjectives, prepositions, conjunctions
  • our corpus: 9 documents from AC corpus and their translations from MEI
    • size: 16.053 tokens (en) 13.590 tokens (hr)
    • Croatian translations converted to AC corpus XML format
method
Method
  • simple glossary look-up?
  • problem of inflection
    • English
      • at least: sg, pl, ’s
    • Croatian
      • 7 cases  2 numbers for nouns
      • 7 cases  2 numbers  3 genders  2 definiteness  3 comparison for Adjectives
  • lemmatization of corpus or glossary?
  • Eurovoc lemmatized and converted to FSA: Intex
    • English lemmatizer from Intex for English Eurovoc
    • Croatian Lemmatization Server (hml.ffzg.hr) for Croatian Eurovoc
    • FSA with 10430 states
eurovoc as fsa
Eurovoc as FSA

"<diskrecijski><pravo>/<EVD lang=\"hr\" id=\"001444\">" 320 300 1 2

"<pravo><imenovanje>/<EVD lang=\"hr\" id=\"003048\">" 320 300 1 2

"<pravo><nadzor>/<EVD lang=\"hr\" id=\"003646\">" 320 300 1 2

"<nadzoran><tijelo>/<EVD lang=\"hr\" id=\"005492\">" 320 300 1 2

"<pravo><odluèivanje>/<EVD lang=\"hr\" id=\"003043\">" 320 300 1 2

"<pravo><poticanje>/<EVD lang=\"hr\" id=\"003045\">" 320 300 1 2

"<pravo><pregovaranje>/<EVD lang=\"hr\" id=\"003049\">" 320 300 1 2

"<pravo><procjena>/<EVD lang=\"hr\" id=\"003042\">" 320 300 1 2

"<pravo><provedba>/<EVD lang=\"hr\" id=\"003044\">" 320 300 1 2

"<pravo><ratifikacija>/<EVD lang=\"hr\" id=\"003046\">" 320 300 1 2

"<savjetodavan><pravo>/<EVD lang=\"hr\" id=\"000717\">" 320 300 1 2

"<veto>/<EVD lang=\"hr\" id=\"003964\">" 320 300 1 2

"<politièki><sustav>/<EVD lang=\"hr\" id=\"000153\">" 320 300 1 2

"<autoritaran><režim>/<EVD lang=\"hr\" id=\"000849\">" 320 300 1 2

"<diktatura>/<EVD lang=\"hr\" id=\"001428\">" 320 300 1 2

"<dvostranaèki><sustav>/<EVD lang=\"hr\" id=\"003851\">" 320 300 1 2

"<federalizam>/<EVD lang=\"hr\" id=\"001830\">" 320 300 1 2

"<jednostranaèki><sustav>/<EVD lang=\"hr\" id=\"002835\">" 320 300 1 2

"<monokracija>/<EVD lang=\"hr\" id=\"002659\">" 320 300 1 2

"<narodan><demokracija>/<EVD lang=\"hr\" id=\"002933\">" 320 300 1 2

"<oligarhija>/<EVD lang=\"hr\" id=\"033057\">" 320 300 1 2

"<parlamentaran><sustav>/<EVD lang=\"hr\" id=\"002903\">" 320 300 1 2

"<pobunjenièki><vlada>/<EVD lang=\"hr\" id=\"003233\">" 320 300 1 2

"<predsjednièki><sustav>/<EVD lang=\"hr\" id=\"034211\">" 320 300 1 2

"<promjena><politièki><sustav>/<EVD lang=\"hr\" id=\"001128\">" 320 300 1 2

"<republika>/<EVD lang=\"hr\" id=\"003302\">" 320 300 1 2

"<ustavan><monarhija>/<EVD lang=\"hr\" id=\"001254\">" 320 300 1 2

"<višestranaèki><sustav>/<EVD lang=\"hr\" id=\"002676\">" 320 300 1 2

"<vlada><u><progonstvo>/<EVD lang=\"hr\" id=\"002028\">" 320 300 1 2

"<vojni><režim>/<EVD lang=\"hr\" id=\"002617\">" 320 300 1 2

"<politièki><stranka>/<EVD lang=\"hr\" id=\"000003\">" 320 300 1 2

method 2
Method 2
  • glossary entires marked in corpus together with IDs
    • <EVD lang=“hr” id=“001747”>
  • checking whether the same ID appears on both sides of alignment (Perl script)

P1/005749 P1/005494 P2/005749 P2/005494 P7/002840 P8/001952 P8/001952 P9/000060 ...

  • statistics en hr

<P> 652 656

<EVD> 1328 1484

<EVD> with matched IDs 803 (60,47%)

  • matched <EVD>s are also word/phrase aligned parts below <P>
drawbacks
Drawbacks
  • syntactic merging
    • abbreviations not matched / marked(e.g. EP delegation vs. European Parliament delegation)
    • merged terms not matched / marked(e.g. head of State, head of government vs. heads of State or Government)
  • EUROVOC = glossary intended for indexing
    • a lot of real terms (MWU) not matched / marked(e.g. country candidate to EU accession, Stabilisation and Association Agreement)  they don’t exist as entries
    • no semantic processing  polysemous terms wrongly matched / marked(e.g. ...which might lead(olovo) to a common defence...)
  • Intex English lemmatizer didn’t cover all Eurovoc entries
further directions
Further directions
  • evaluation of matched pairs of <EVD> regarding
    • single-word units
    • multi-word units
  • improving Intex English lemmatizer / lexicon
  • use Eurovoc non-descriptors as synonyms
    • to capture a wider departure from expected TE in translation more precisely
  • use / include other glossaries
    • EU Law Glossary
  • test the whole system on larger corpus
  • use it with other languages
automatic translation quality control using eurovoc descriptors16

Automatic translation quality control using Eurovoc descriptors

Marko Tadić, Božo Bekavac(marko.tadic@ffzg.hr, http://www.hnk.ffzg.hr/mt/bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/)

Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr)

JRC Ispra / Arona, 2005-09-27