160 likes | 355 Views
Automatic translation quality control using Eurovoc descriptors. Marko Tadić , Božo Bekavac (marko.tadic@ffzg.hr, http:// www.hnk.ffzg.hr/mt / bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/ )
E N D
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac(marko.tadic@ffzg.hr, http://www.hnk.ffzg.hr/mt/bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/) Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr) JRC Ispra / Arona, 2005-09-27
Talk plan • motivation • automatic translation quality control • resources: glossary and test corpus • results • further directions
Motivation • Acquis communautaire (AC) is still being translated to Croatian • former Ministry of European Integrations (MEI),today Ministry of External Affairs and European Integrations • AC • ca 200,000 pages of EU OJ • AC corpus: from 8 Mw (Estonian) to 82 Mw (Spanish) • not precisely delimited (lawyers are working on that!) • constantly growing • legal texts • a lot of repetitious and formulaic expressions • low polysemy in terms expected
Motivation 2 • different EU accession candidates different organization of translation process • several years of work • large number of translators • in-house/out-house (tenders) • large-scale document translation and revision • MEI • outsourcing to ca 100 translators or translating companies • use of glossary with pre-established TEs • glossaries being translated in advance • Eurovoc • EU Law Glossary / Čtyřjazyčný slovník práva Evropské unie, Prague 1999 • maintain the consistency of translation • by use of the same glossary only?
Preparing AC originals for translation • project proposed by our Institute to MEI in 2002 • entries from glossary marked in original text before translation • signal of the existence of pre-established TE to the translator • obligatory usage of existing TE in legal texts, e.g.: • Council of Europe Vijeće Europe • European Council Europsko vijeće • Council of the European Union Vijeće Europske unije • ... • AC had to be converted to XML • MEI dropped the project in 2003 for the lack of finances • now: AC corpus in XML
Revision of Translation • largest effort was put on translation in all candidate countries • revision of translation always in the last place • quality: consistency • task undermined by all candidate countries • large portions of official translation of AC poorly revised • usually done • manually • simple search & replace commands • no terms/entries marked in texts • automatic approach? • lexical level and idiomatic level
Automatic Translation Quality Control • use system to check whether all pre-established TE are used • sentence aligned parallel corpus • glossary entries marked in • original text • translated text • if a TE of a glossary entry found in original, has not been found in aligned translated sentence translation is departing from pre-established TE • e.g.: • Eurovoc: (en) President of the Commission = (hr) Predsjednik Komisije • Corpus: (en) … if the President of the Commission declares … (hr1) … ako Predsjednik Komisije objavi … (hr2) … ako Predsjednik objavi …
Resources • our lexicon: Eurovoc 4.1 • documentational indexing glossary • ca 6000 entries (descriptors) covering topics found in EU legal texts • accompanied by non-descriptors (synonyms) • translated to Croatian in 2000 • + 4000 Croatian specific descriptors • translation always 1:1 • combination of nouns, adjectives, prepositions, conjunctions • our corpus: 9 documents from AC corpus and their translations from MEI • size: 16.053 tokens (en) 13.590 tokens (hr) • Croatian translations converted to AC corpus XML format
Method • simple glossary look-up? • problem of inflection • English • at least: sg, pl, ’s • Croatian • 7 cases 2 numbers for nouns • 7 cases 2 numbers 3 genders 2 definiteness 3 comparison for Adjectives • lemmatization of corpus or glossary? • Eurovoc lemmatized and converted to FSA: Intex • English lemmatizer from Intex for English Eurovoc • Croatian Lemmatization Server (hml.ffzg.hr) for Croatian Eurovoc • FSA with 10430 states
Eurovoc as FSA "<diskrecijski><pravo>/<EVD lang=\"hr\" id=\"001444\">" 320 300 1 2 "<pravo><imenovanje>/<EVD lang=\"hr\" id=\"003048\">" 320 300 1 2 "<pravo><nadzor>/<EVD lang=\"hr\" id=\"003646\">" 320 300 1 2 "<nadzoran><tijelo>/<EVD lang=\"hr\" id=\"005492\">" 320 300 1 2 "<pravo><odluèivanje>/<EVD lang=\"hr\" id=\"003043\">" 320 300 1 2 "<pravo><poticanje>/<EVD lang=\"hr\" id=\"003045\">" 320 300 1 2 "<pravo><pregovaranje>/<EVD lang=\"hr\" id=\"003049\">" 320 300 1 2 "<pravo><procjena>/<EVD lang=\"hr\" id=\"003042\">" 320 300 1 2 "<pravo><provedba>/<EVD lang=\"hr\" id=\"003044\">" 320 300 1 2 "<pravo><ratifikacija>/<EVD lang=\"hr\" id=\"003046\">" 320 300 1 2 "<savjetodavan><pravo>/<EVD lang=\"hr\" id=\"000717\">" 320 300 1 2 "<veto>/<EVD lang=\"hr\" id=\"003964\">" 320 300 1 2 "<politièki><sustav>/<EVD lang=\"hr\" id=\"000153\">" 320 300 1 2 "<autoritaran><režim>/<EVD lang=\"hr\" id=\"000849\">" 320 300 1 2 "<diktatura>/<EVD lang=\"hr\" id=\"001428\">" 320 300 1 2 "<dvostranaèki><sustav>/<EVD lang=\"hr\" id=\"003851\">" 320 300 1 2 "<federalizam>/<EVD lang=\"hr\" id=\"001830\">" 320 300 1 2 "<jednostranaèki><sustav>/<EVD lang=\"hr\" id=\"002835\">" 320 300 1 2 "<monokracija>/<EVD lang=\"hr\" id=\"002659\">" 320 300 1 2 "<narodan><demokracija>/<EVD lang=\"hr\" id=\"002933\">" 320 300 1 2 "<oligarhija>/<EVD lang=\"hr\" id=\"033057\">" 320 300 1 2 "<parlamentaran><sustav>/<EVD lang=\"hr\" id=\"002903\">" 320 300 1 2 "<pobunjenièki><vlada>/<EVD lang=\"hr\" id=\"003233\">" 320 300 1 2 "<predsjednièki><sustav>/<EVD lang=\"hr\" id=\"034211\">" 320 300 1 2 "<promjena><politièki><sustav>/<EVD lang=\"hr\" id=\"001128\">" 320 300 1 2 "<republika>/<EVD lang=\"hr\" id=\"003302\">" 320 300 1 2 "<ustavan><monarhija>/<EVD lang=\"hr\" id=\"001254\">" 320 300 1 2 "<višestranaèki><sustav>/<EVD lang=\"hr\" id=\"002676\">" 320 300 1 2 "<vlada><u><progonstvo>/<EVD lang=\"hr\" id=\"002028\">" 320 300 1 2 "<vojni><režim>/<EVD lang=\"hr\" id=\"002617\">" 320 300 1 2 "<politièki><stranka>/<EVD lang=\"hr\" id=\"000003\">" 320 300 1 2
Method 2 • glossary entires marked in corpus together with IDs • <EVD lang=“hr” id=“001747”> • checking whether the same ID appears on both sides of alignment (Perl script) P1/005749 P1/005494 P2/005749 P2/005494 P7/002840 P8/001952 P8/001952 P9/000060 ... • statistics en hr <P> 652 656 <EVD> 1328 1484 <EVD> with matched IDs 803 (60,47%) • matched <EVD>s are also word/phrase aligned parts below <P>
Drawbacks • syntactic merging • abbreviations not matched / marked(e.g. EP delegation vs. European Parliament delegation) • merged terms not matched / marked(e.g. head of State, head of government vs. heads of State or Government) • EUROVOC = glossary intended for indexing • a lot of real terms (MWU) not matched / marked(e.g. country candidate to EU accession, Stabilisation and Association Agreement) they don’t exist as entries • no semantic processing polysemous terms wrongly matched / marked(e.g. ...which might lead(olovo) to a common defence...) • Intex English lemmatizer didn’t cover all Eurovoc entries
Further directions • evaluation of matched pairs of <EVD> regarding • single-word units • multi-word units • improving Intex English lemmatizer / lexicon • use Eurovoc non-descriptors as synonyms • to capture a wider departure from expected TE in translation more precisely • use / include other glossaries • EU Law Glossary • test the whole system on larger corpus • use it with other languages
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac(marko.tadic@ffzg.hr, http://www.hnk.ffzg.hr/mt/bbekavac@ffzg.hr, http://www.hnk.ffzg.hr/bb/) Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr) JRC Ispra / Arona, 2005-09-27