Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation

Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert Gaizauskas, Dan Tufiş and Tatiana Gornostay

Challenge of Data Driven MT 3rd BUCCMalta22-05-10

Problem of availability of linguistic resources • Relevant for “smaller” or under-resourced languages • Example of Latvian: • few parallel corpora of reasonable size (e.g., JRC Acquis, EMEA) • SMT trained on this corpora performs well on domain documents, but it has unacceptable results for other domains (en-lv 43.4 BLEU in domain, 10.2 BLEU out of domain) • Solution: comparable corpora are much more widely available than parallel translation data 3rd BUCCMalta22-05-10

Accurat project • The Accurat mission is to significantly improve MT quality • for under-resourced languages and narrow domains • by researching novel approaches how comparable corpora can compensate for a shortage of linguistic resources • ACCURAT methods will be: • Adjustable to new languages and domains • Language independent where possible • 2.5 year project, started on January 1, 2010 3rd BUCCMalta22-05-10

Key objectives • To create comparability metrics - to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora • To develop, analyze and evaluate methods for automatic acquisition of comparable corpora from the Web • To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT • To measure improvements from applying acquired data against baseline results from SMT and RBMT systems • To evaluate and validate the ACCURAT project results in practical applications 3rd BUCCMalta22-05-10

Use Cases • Adjusting MT to narrow domainAutomotive engineering, assistive technology and data processing domains • Application for Web authoring Blog and social networking (Zemanta application) • Using SMT in software localizationIncreasing efficiency in localization, integration with CAT tools 3rd BUCCMalta22-05-10

Language Coverage • Focus on under-resourced languages: Latvian, Lithuanian, Estonian, Greek, Croatian, Romanian and Slovenian • Major translation directions like English-Lithuanian, English-Croatian, GermanRomanian • Minor translation directions like LithuanianRomanian, Romanian-Greek and Latvian-Lithuanian 3rd BUCCMalta22-05-10

Work Plan • WP1: To create comparability metrics – to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora (M3-M24) • WP2: To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT (M3-M23) • WP3: To develop, analyze and evaluate methods for automatic acquisition of a comparable corpus from the Web (M1-M22) • WP4: To measure improvements from applying acquired data against results from baseline SMT and RBMT systems (M7-M26) 3rd BUCCMalta22-05-10

Work Plan • WP5: To evaluate and validate the ACCURAT project results in three practical applications (M7-M30) • WP6: To disseminate project results and to transfer the project knowledge, technologies, lessons learned and best practices to interested communities and thus to ensure their worldwide impact and long-term sustainability (M1-M30) • WP7: To coordinate the project and provide administrative and financial management (M1-M30) 3rd BUCCMalta22-05-10

Milestones 3rd BUCCMalta22-05-10

Key Results • Comparability metrics developed and tools provided • Comparable corpora for under-resourced languages collected and tools provided • Methods and tools for multi-level alignment from comparable corpora developed • Methods for using comparable corpora in both SMT and RBMT developed • Proven application scenarios prepared • Strong increase in MT quality for under-resourced languages and narrow domains 3rd BUCCMalta22-05-10

Initial comparable corpora (ICC) • 1 million tokens for each under-resourced language • domain corpus for en-de 3rd BUCCMalta22-05-10

Recommended proportions • parallel – 10% • strongly comparable(heavily edited translations or independent, but closely related texts reporting the same event or describing the same subject) – 40% • weakly comparable(e.g.,texts within the same broader domain and genre, but varying in subdomains and specific genres, texts in the same narrow subject domain and genre, but describing different events) – 50% • length of each document should be between 500 and 3000 words 3rd BUCCMalta22-05-10

Initial comparable corpora: results 3rd BUCCMalta22-05-10

Metadata • Language • Domain • Genre • Source • Number of words • IPR status • Comparability level parallel and strongly comparable texts are also aligned at the document level 3rd BUCCMalta22-05-10

CES (Corpus Encoding Standards) 3rd BUCCMalta22-05-10

Extension to CES-CCES 3rd BUCCMalta22-05-10

CES Alignment–Extension to CCES Alignment 3rd BUCCMalta22-05-10

Criteria of Comparability and Parallelism • Lack of definite methods to determine the criteria of comparability • Some attempts to measure the degree of comparability according to distribution of topics and publication dates of documents in comparable corpora to estimate the global comparability of the corpora (Saralegi et al., 2008) • Some attempts to determine different kinds of document parallelism in comparable corpora, such as complete parallelism, noisy parallelism and complete non-parallelism • Some attempts to define criteria of parallelism of similar documents in comparable corpora, such as similar number of sentences, sharing sufficiently many links (up to 30%), and monotony of links (up to 90% of links do not cross each other) (Munteanu, 2006) 3rd BUCCMalta22-05-10

Criteria of Comparability and Parallelism • To investigate criteria for comparability between corpora concentrating on different sets of features: • Lexical features: measuring the degree of 'lexical overlap' between frequency lists derived from corpora • Lexical sequence features: computing N-gram distances in terms of tokens • Morpho-syntactic features: computing N-gram distances in terms of Part-of-Speech codes 3rd BUCCMalta22-05-10

First experiment • Comparability of corpora is measured in terms of lexical features (Greek—English and German—English language pairs) • The set-up is similar to (Kilgarriff, 2001): • For each corpus take the top 500 most frequent words • relative frequency is used (the absolute frequency, or the word count, divided by the length of the corpus) • Automatically generated dictionaries by Giza++ from the parallel Europarl corpus • We compare corpora pairwise using a standard Chi-Square distance measure: ChiSquare = ∑ {w1... w500}((FrqObserved - FrqExpected) ^ 2) / FrqObserved 3rd BUCCMalta22-05-10

First experiment • Asymmetric method: relative frequencies in Corpus in language A are treated as “expected” values, and those mapped from the Corpus in language B – as “observed”. Then we swap Corpora A and B and repeat the calculation. Asymmetry comes from words which are missing in one of the lists as compared to the other. Missing words have different relative frequencies that are added to the score, so distance from A to B can be different than from B to A. We use the minimum of these distances as the final score for the pair of corpora. 3rd BUCCMalta22-05-10

Features To extract the features which may be used to identify the comparability between documents

General Idea Features extraction EN EL parallel EN EL strongly comparable EN EL weakly comparable New Documents EN EL not comparable EN Predicted Comparability Level Initial Comparable Corpora strongly comparable EL Classifier

Metrics of Comparability and Parallelism • Using defined criteria for parallelism, we would like to develop formal automated metrics for determining the degree of comparability • Lack of comparability metrics to evaluate corpus usability for different tasks, such as machine translation, information extraction, cross-language information retrieval • Recent studies (Kilgarriff, 2001; Rayson and Garside, 2000) have added a quantitative dimension to the issue of comparability by studying objective measures for detecting how similar (or different) two corpora are in terms of their lexical content • Further studies (Sharoff, 2007) investigated automatic ways for assessing the composition of web corpora in terms of domains and genres 3rd BUCCMalta22-05-10

Sentence Alignment on Parallel Texts 3rd BUCCMalta22-05-10 • Danielsson, Pernilla and Ridings, Daniel. Practical presentation of a vanilla aligner. Goteborgs universitet, 1997 • Melamed, Dan. A Geometric Approach to Mapping Bitext Correspondence. University of Pennsylvania, 1996 • Chen, Stanley F. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st annual meeting on Association for Computational Linguistics (Columbus, Ohio 1993), Association for Computational Linguistics Morristown, NJ, USA, 9-16 • State of the Art: Moore, Robert C. Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas (Tiburon, California 2002), Springer-Verlag, Heidelberg, 135-244: • provisionary alignment based on sentence lengths • IBM Model 1 – estimate Translation Equivalents (TE) table • generate one to one links based on sentence lengths and TE table

Our Sentence Alignment on Parallel Texts 3rd BUCCMalta22-05-10 • Reification – a link in the alignment is treated as a context independent structured object. • Using SVM (libsvm solution). • Features: • translation equivalence • word length correlation (Pearson) • special characters occurrence similarity • word frequency ranks correlation • Crossed links are allowed

Scenarios for Aligning Comparable Corpora • Based on previous experience, literature and current constraints (time, man-power, computational resources) we envisaged 3 possible ways of tackling with the alignment of comparable corpora in order to get useful results: • QA techniques • Clustering • Windowing 3rd BUCCMalta22-05-10

Accurat partners 3rd BUCCMalta22-05-10

Contact information: Andrejs Vasiljevs andrejs tilde.lv Tilde, Vienibas gatve 75a, Riga LV1004, Latvia www.accurat-project.eu ACCURAT project has received funding from the EU7thFramework Programme for Research and Technological Developmentunder Grant Agreement N° 248347 Project duration: January 2010 – June 2012

Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation

Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation

Presentation Transcript

Learning Translation Lexicons from Comparable Corpora

Human Evaluation of Machine Translation Systems

Quantitative Evaluation of Machine Translation Systems: Sentence Level

Comparable Corpora

Generalising lexical translation strategies for MT using comparable corpora

Sign Language corpora for analysis, processing and evaluation

Evaluation of Machine Translation Errors in English and Iraqi Arabic

Extracting bilingual terminologies from comparable corpora

Dependency-Based Automatic Evaluation for Machine Translation

Comparable Corpora BootCat (CCBC)

Areas of Assessment/Evaluation

Bridging the Gap: Machine Translation for Lesser Resourced Languages

Using Comparable Corpora to Adapt a Translation Model to Domains

Comparable Corpora for Terminology

Comparable corpora and its application

On-line Compilation of Comparable Corpora and Their Evaluation

Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)

Unicode for Under Resourced Languages

Semantic Evaluation of Machine Translation

Evaluation of Machine Translation Systems: Metrics and Methodology

Unicode for Under Resourced Languages

Corpora and Translation