accurat metrics for the evaluation of comparability of multilingual corpora n.
Skip this Video
Loading SlideShow in 5 Seconds..
ACCURAT: Metrics for the evaluation of comparability of multilingual corpora PowerPoint Presentation
Download Presentation
ACCURAT: Metrics for the evaluation of comparability of multilingual corpora

Loading in 2 Seconds...

play fullscreen
1 / 28

ACCURAT: Metrics for the evaluation of comparability of multilingual corpora - PowerPoint PPT Presentation

  • Uploaded on

ACCURAT: Metrics for the evaluation of comparability of multilingual corpora. Andrejs Vasiljevs , Inguna Skadina (Tilde), Bogdan Babych, Serge Sharof (CTS), Ahmet Aker , Robert Gaizauskas , David Guthrie (USFD)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

ACCURAT: Metrics for the evaluation of comparability of multilingual corpora

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
accurat metrics for the evaluation of comparability of multilingual corpora

ACCURAT:Metrics for the evaluation ofcomparabilityof multilingual corpora

Andrejs Vasiljevs, Inguna Skadina (Tilde), Bogdan Babych, Serge Sharof (CTS), Ahmet Aker, Robert Gaizauskas, David Guthrie (USFD)

LREC 2010 Wokshop on Methods for the automatic acquisition of LanguageResources and theirevaluation

May 23, 2010

comparable corpora
Comparable Corpora
  • Non-parallel bi- or multilingual text resources
  • Collection of documents that are:
    • gathered according to a set of criteriae.g. proportion of texts of the same genre in the same domains in the same period
    • in two or more languages
    • containing overlapping information
  • Examples:
    • multilingual news feeds,
    • multilingual websites,
    • Wikipedia articles,
    • etc.
key objectives
Key objectives
  • To create comparability metrics - to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora
  • To develop, analyze and evaluate methods for automatic acquisition of comparable corpora from the Web
  • To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT
  • To measure improvements from applying acquired data against baseline results from SMT and RBMT systems
  • To evaluate and validate the ACCURAT project results in practical applications
accurat languages
ACCURAT Languages
  • Focus on under-resourced languagesLatvian, Lithuanian, Estonia, Greek, Croatian, Romanian, Slovenian
  • Major translation directionse.g. English-Lithuanian. English-Croatian, German-Romanian
  • Minor translation directionse.g. Lithuanian-Romanian, Romanian-Greek and Latvian-Lithuanian
  • Methods will be adjustable to the new languages and domains and language independent where possible
  • Applicability of methods will be evaluated in usage scenarios
project partners
Project Partners
  • Tilde (Project Coordinator) - Latvia
  • University of Sheffield - UK
  • University of Leeds - UK
  • Athena Research and Innovation Center in Information Communication and Knowledge Technologies - Greece
  • University of Zagreb - Croatia
  • DFKI - Germany
  • Institute of Artificial Intelligence - Romania
  • Linguatec - Germany
  • Zemanta - Slovenia
objectives for comparability metrics
Objectives for comparability metrics
  • To develop criteria and automated metrics to determine thekind and degree of comparability of comparable corpora andparallelism of documents and individual sentences within documents
  • To evaluate metrics designed for determining similar documents in comparable corpora
  • To develop a methodology to assess comparable corporacollected from the Web and to choose an alignment strategyand lexical data extraction methods.
criteria of comparability and parallelism
Criteria of Comparability and Parallelism
  • Lack of definite methods to determine the criteria of comparability
  • Some attempts to measure the degree of comparability according to distribution of topics and publication dates of documents in comparable corpora to estimate the global comparability of the corpora (Saralegi et al., 2008)
  • Some attempts to determine different kinds of document parallelism in comparable corpora, such as complete parallelism, noisy parallelism and complete non-parallelism
  • Some attempts to define criteria of parallelism of similar documents in comparable corpora, such as similar number of sentences, sharing sufficiently many links (up to 30%), and monotony of links (up to 90% of links do not cross each other) (Munteanu, 2006)
  • Research on automatedmethods for assessing the composition of web corpora in terms of domains and genres(Sharoff, 2007)
towards establishing metrics
Towards establishing metrics

Metrics: intralingual and interlingual comparability for genres, domains and topics

  • intralingual: distance between corpora and documents withincorpora in the same language
    • methods: distance in feature spaces and machine learning
  • interlingual: distance between corpora and documents in differentlanguages
    • methods: dictionaries and existing MT to map feature spaces between languages

Evaluation: validation of the scale by independent annotation

criteria of comparability and parallelism1
Criteria of Comparability and Parallelism
  • To investigate criteria for comparability between corpora concentrating on different sets of features:
    • Lexical features: measuring the degree of 'lexical overlap' between frequency lists derived from corpora
    • Lexical sequence features: computing N-gram distances in terms of tokens
    • Morpho-syntactic features: computing N-gram distances in terms of Part-of-Speech codes

Initial set of the features which may be used to identify the comparability between documents

initial comparable corpora
Initial comparable corpora
  • For development of comparability metrics Initial Comparable Corpora is collected
  • 11 M words, 9 languages
c omparable test c orpora
ComparableTest Corpora

Collected for evaluation of the comparablity metrics

34% parallel texts, 33% strongly comparable texts and 33% weakly comparable texts

9 languages, 247 000 running words

benchmarking comparability
Benchmarking comparability
  • Problems with human labelling:
    • They are too coarse-grained
    • Symboliclabels pose problems to establish correlation with numeric scores produced by the metric
    • Labellingcriteria and/or human annotation may be un-systematic
  • Proposal to benchmark comparability metric against the score of resulting MT quality (e.g., the standard BLEU/NIST scores)
initial experiment cross lingual comparison of corpora
Initial experiment: Cross-lingual comparison of corpora
  • Mapping feature spaces using bilingual dictionaries or translation probability tables
  • Purpose: to see how much is the difference betweenfrequencies of words and their translations in another language
  • Set-up: 500 most frequent words; Relative frequencies; Bilingual translation probability tables (Europarl)
  • χ-score (cross-lingual intersection)
  • Pearson's correlation with the degree of comparability
initial experiment
Initial experiment
  • Comparability of corpora is measured in terms of lexical features (Greek—English and German—English language pairs)
  • The set-up is similar to (Kilgarriff, 2001):
    • For each corpus take the top 500 most frequent words
    • relative frequency is used (the absolute frequency, or the word count, divided by the length of the corpus)
  • Automatically generated dictionaries by Giza++ from the parallel Europarl corpus
  • We compare corpora pairwise using a standard Chi-Square distance measure:

ChiSquare = ∑ {w1... w500}((FrqObserved - FrqExpected) ^ 2) / FrqObserved

3rd BUCCMalta22-05-10

initial experiment1
Initial experiment
  • Asymmetric method: relative frequencies in Corpus in language A are treated as “expected” values, and those mapped from the Corpus in language B – as “observed”. Then we swap Corpora A and B and repeat the calculation. Asymmetry comes from words which are missing in one of the lists as compared to the other. Missing words have different relative frequencies that are added to the score, so distance from A to B can be different than from B to A. We use the minimum of these distances as the final score for the pair of corpora.

3rd BUCCMalta22-05-10

collecting comparable texts from the web
Collecting comparable texts from the Web
  • How should we collect comparable texts from the Web?
  • Crawling all documents in the Web and use comparability metrics to align them
    • Inefficient
    • High computational effort to align
  • Instead : Build a classifier/retrieval system that given a document (or certain characteristics) can find other comparable documents
  • Combine crawling with document pair classification or searching
  • Using the initial comparable corporafor:
    • features extraction
    • training a classifier/ranking system
  • Predict whether pairs of documents are
    • parallel
    • strongly comparable
    • weakly comparable
    • not comparable
general idea
General Idea

Features extraction






strongly comparable



weakly comparable

New Documents



not comparable


Predicted Comparability Level

Initial Comparable Corpora

strongly comparable



  • Evaluation of comparability metrics against manually annotated Comparable Test Corpora (precision, recall)
  • Evaluation of document level alignment methods against manually annotated Comparable Test Corpora (precision, recall)
  • Evaluation of sentence and phrase level alignment for corpora with different level of comparability – need for aligned data, idea to spoil parallel corpus
  • Automated evaluation of applicability in MT against baseline systems (BLEU, NIST)
  • User evaluation in gisting and post editing scenarios

Contact information:

Andrejs Vasiljevs


Tilde, Vienibas gatve 75a, Riga

LV1004, Latvia

ACCURAT project has received funding from the EU7thFramework Programme for Research and Technological Developmentunder Grant Agreement N°248347

Project duration: January 2010 – June 2012