1 / 87

Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net

How to Measure the Success of Machine Translation. Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net. The World We Live In. Every day 15 petabytes of new information is being generated . A petabyte is one million gigabytes.

inga
Download Presentation

Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Measure the Success of Machine Translation Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net

  2. The World We Live In Every day 15 petabytes of new information is being generated A petabyte is one million gigabytes. 8 x more than the information stored in all US libraries. Equivalent of 20 million four drawer filing cabinets filled with text. In 2012 we have 5 times more data stored than we did in 2008. The volume of data is growing exponentially and is expected to increase by 20 times by 2020. We now have access to more data than at any time in human history.

  3. The World We Live In By 2015 there will be more than 15 billion devices connected to the internet . We live in world which is increasingly instrumented and interconnected. The number of “smart” devices is growing everyday and the volume of data they produce is growing exponentially – doubling every 18 months. All these devices create new demand for access to information – access now, on demand, in real time.

  4. The World We Live In = x 365 24 Google translates more in 1 day than all human translators in 1 year Google’s message to the market has long been that its business is making the world’s information searchable and that MT is part of that mission.

  5. The World We Live In LSPs translate a mere 0.00000067% of the text information created every day. How much new text information should be translated? Even if only 1% of new text information created each day should be translated, that still means only 0.000067% is translated by LSPs • Common Sense Advisory Calculates: • US$31.4 billion earned for language services in 2011 • Divide by 365 days • Divide by 10 cents per words

  6. The World We Live In Translation Demand is increasing. Translator Supply is decreasing

  7. The Impact of a Translator Shortage Skilled Translator Supply is Shrinking • It is already clear that at 2,000-3,000 words per day per translator that demand is many multiples of supply. • LSPs are having trouble finding qualified and skilled translators • In part due to lower rates in the market and more competition for resources • Wave of new LSPs and translators • Many will try to capitalize on the market opportunity created by translator shortage, but with deliver sub-standard services • Lack of experience – both new LSPs and translators • Lower quality translations will become more common place

  8. Expanding the Reach of Translation Human Example Words Corporate Brochures 2,000 Corporate Partly Multilingual Product Brochures 10,000 Products Software Products 50,000 User Interface Manuals / Online Help 200,000 User Documentation Existing Markets $31.4B New Markets HR / Training / Reports 500,000 Enterprise Information Communications Email / IM 10,000,000 Support / Knowledge Base Call Center / Help Desk 20,000,000+ User Generated Content Blogs / Reviews 50,000,000+ Machine

  9. Interesting Question VinodBhaskar: Machine v/s Human Translation • Are machine translations gaining ground? Can they put translators out of circulation like cars did to horses and cart drivers?

  10. Evolution of the Automotive Industry Early Innovation Production Line Service Industries Fuel and Energy Industry Transportation Research and Development Customization and Parts Mass Production

  11. Translation is Evolving Along a Similar Path Human Translator Single Language Vendors Translation Memory Dictionary / Glossary Multi Language Vendors Early Innovation Production Line • Specialization • Editing / Proofing • Translation / Localization • Global work groups • Quality assurance • Managed Crowd and Community • Professionally managed amateur workforce • Mass translation • Knowledge dissemination • Data automation • Information processing • Multilingual software Service Industries Technology Industry • Research Funding & Grants • Natural Language Programming • Language Technologies • - search, voice, text, etc… • Preservation of dying languages • Automated & Enhanced post-editing • Euro to Asian LP automation • Data, reports • Phone, Mobile data • Internet, broad band • Website translation Communications Research and Development • World Lingo • Google Translate • Yahoo Babelfish • Custom engine development • Mass translation at high quality • Newly justifiable business models Customization Translation for the Masses

  12. Evolution of Machine Translation Quality Near Human Hybrid MT Platforms New techniques mature Businesses start to consider MT LSPs start to adopt MT New skills develop in editing MT Good Enough Threshold Early SMTVendors Gist Processors became powerful enough Large volumes of digital data available Google drivesMT acceptance Quality Google switchesfrom Systran to SMT Babelfish 911 -> Research funding Quality plateau as RBMT reached its limits in many languages – only marginal improvement. Experimental Early RBMT improved rapidly as new techniques were discovered.

  13. Machine Translation Hype Cycle ALPAC report Microsoft & Google announce paid API Mainstream LSP use Near human quality examples emerge Notable quality improvement Visibility Georgetown experiment IBM Research Early LSP adopters Google switches to SMT Move fromMainframe to PC Moses Babelfish 9/11 The "Translation" memorandum 1947 1954 1966 1990 2001 2007 2011 2015 Technology Trigger Peak Of InflatedExpectations Trough Of Disillusionment Slope OfEnlightenment Plateau OfProductivity Time / Maturity *Not an official Gartner Hype Cycle

  14. Top 5 Reasons For Not Adopting MT • Perception of Quality • Many believe Google Translation is as good as it gets / state of the art. • This is true for scale, but not for quality. • Perception of Quality • Perfect quality is expected from the outsetand tests using Google or other out-of-the-box translation tools are disappointing. • When combined with #1, other MT is quickly ruled out as an option. • Perception of Quality • The opposite to #2. Human resistance to MT. “A machine will never be able to deliver human quality” mindset. • Perception of Quality • Few understand that out-of-the-box or free MT and customized MT are different. • They don’t see why they should pay for commercial MTas quality is perceived as the same. • Perception of Quality • Quality is not good enough as raw MT output. • The equation is not MT OR Human. It is MT AND Human

  15. What Is Different This Time Around? M T Machine Translation 50 Years of eMpTy Promises Q Why does an industry that has spent 50 years failing to deliver on its promises still exist? A An infinite demand – a well defined and growing problem that has always been looking for a solution – what was missing was QUALITY

  16. Whatever the customer says it is! Definition of Quality:

  17. Quality Depends on the Purpose Establish Clear Quality Goals Step 1 – Define the purpose Step 2 – Determine the appropriate quality level • Document Search and Retrieval • Purpose: To find and locate information • Quality: Understandable, technical terms key • Technique: Raw MT + Terminology Work • Knowledge Base • Purpose: To allow self support via web • Quality: Understandable, can follow directions provided • Technique: MT & Human for key documents • Search Engine Optimization (SEO) • Purpose: To draw users to site • Quality: Higher quality, near human • Technique: MT + Human (student, monolingual) • Magazine Publication • Purpose: To publish in print magazine • Quality: Human quality • Technique: MT + Human (domain specialist, bilingual)

  18. Reality Check – What Do You Really Get?

  19. Translation Speed 15,000 12,000 18,000 9,000 Typical MT + Post Editing Speed 6,000 21,000 * 25,000 3,000 Fastest MT + Post Editing Human Translation 28,000 0 Words Per Day Per Translator Average person reads 200-250 words per minute. 96,000-120,000 in 8 hours. ~35 times faster than human translation. *Fastest MT + Post Editing Speed reported by clients.

  20. Success Factors:Understanding Return On Investment

  21. Success Factors:Understanding Return On Investment

  22. Objective Measurement is Essential Targets should be defined, set and managed from the outset

  23. Why Measure?

  24. What to measure? Objective measurement is the only means to understand • Automated metrics • Useful to some degree, but not enough on their own • Post editor feedback • Useful for sentiment, but not a reliable metric. When compared to technical metrics, often reality is very different. • Number of errors • Useful, but can be misleading. Complexity of error correction is often overlooked. • Time to correct • On its own useful for productivity metrics, but not enough when more depth and understanding is required. • Difference between projects • Combined the above allow an understanding of each project, but are much more valuable when compared over several similar projects.

  25. Rapid MT Quality Assessment Automated Human Assessments Long-term consistency , repeatability and objectivity are important Butler Hill Group has developed a protocol that is widely accepted and used Can be based on error categorization like SAE J2450 Should be used together with automated metrics Will focus more on post-editing characteristics in future BLEU is the most commonly used metric “… the closer the machine translation is to a professional human translation, the better it is” METEOR, TERp and many others in development Limited but still useful for MT engine development if properly used

  26. Different Automated Metrics All four metrics compare a machine translation to human translations • BLEU (Bilingual Evaluation Understudy) • BLEU was one of the first metrics to achieve a high correlation with human judgements of quality, and remains one of the most popular. • Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. • Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. • Intelligibility or grammatical correctness are not taken into account. • BLEU is designed to approximate human judgement at a corpus level, and performs badly if used to evaluate the quality of individual sentences. • More: http://en.wikipedia.org/wiki/BLEU • NIST • Name comes from the US National Institute of Standards and Technology. • It is based on the BLEU metric, but with some alterations: • Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given. • NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much. • More: http://en.wikipedia.org/wiki/NIST_(metric)

  27. Different Automated Metrics • F-Measure (F1 Score or F-Score) • In statistics, the F-Measure is a measure of a test's accuracy. • It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. • The F-Measure score can be interpreted as a weighted average of the precision and recall, where a score reaches its best value at 1 and worst score at 0. • More: http://en.wikipedia.org/wiki/F1_Score • METEOR (Metric for Evaluation of Translation with Explicit ORdering) • The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. • It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. • The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. • This differs from the BLEU metric in that BLEU seeks correlation at the corpus level. • More: http://en.wikipedia.org/wiki/METEOR

  28. Usability and Readability Criteria Evaluation Criteria of MT output

  29. Human evaluators can develop custom error taxonomy to help identify key error pattern problems or use error taxonomy from standards such as the LISA QA Model or SAE J2450

  30. Sample Metric Report From Language Studio™

  31. Multiple References Increases Scores

  32. Why do you need rules? Normalization 2 Port Switch Double Port Switch Dual Port Switch

  33. Why do you need rules? Terminology Control and Management Non-Translatable Terms Glossary such as Product Names Job Specific Preferred Terminology

  34. BLEU scores and other translation quality metrics will vary based upon: • The test set being measured: • Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score. • How many human reference translations were used: • If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference. • The complexity of the language pair: • Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese. • Typically if the source or target language is more complex then the BLEU score will be lower.

  35. BLEU scores and other translation quality metrics will vary based upon: It is clear from the above list of variations that a BLEU score number by itself has no real meaning. • The complexity of the domain: • A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other. • The capitalization of the segments being measured: • When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured. • The measurement software: • There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. • The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics.

  36. Metrics Will Vary – Even the same metrics!!

  37. What is your BLEU score? This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked.

  38. BLEU scores and other translation quality metrics will vary based upon: • The test set being measured: • Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score. • How many human reference translations were used: • If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference. • The complexity of the language pair: • Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese. • Typically if the source or target language is more complex then the BLEU score will be lower.

  39. BLEU scores and other translation quality metrics will vary based upon: It is clear from the above list of variations that a BLEU score number by itself has no real meaning. • The complexity of the domain: • A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other. • The capitalization of the segments being measured: • When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured. • The measurement software: • There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. • The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics.

  40. Basic Test Set Criteria Checklist

  41. Basic Test Set Criteria Checklist The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result in a score that is unreliable and less meaningful. • Test Set Data should be very high quality: • If the test set data are of low quality, then the metric delivered cannot be relied upon. • Proof read a test set. Don’t just trust existing translation memory segments. • Test set should be in domain: • The test set should represent the type of information that you are going to translate. The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric. • Test Set Data must not be included in the training Data: • If you are creating an SMT engine, then you must make sure that the data you are testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the same level of quality that will be output when other data are translated.

  42. Basic Test Set Criteria Checklist • Test Set Data should be data that can be translated: • Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. A focus for a test set should be on words that are to be translated. • Test Set Data should have segments that are at between 8 and 15 words in length: • Short segments will artificially raise the quality scores as most metrics do not take into account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be very few. • Test set should be at least 1,000 segments: • While it is possible to get a metric from shorter test sets, a reasonable statistic representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.

  43. Comparing Translation Engines Initial Assessment Checklist

  44. Comparing Translation Engines: Initial Assessment Checklist All conditions of the Basic Test Set Criteria must be met.If any condition is not met, then the results of the test could be flawed and not meaningful or reliable. • Test set must be consistent: • The exact same test set must be used for comparison across all translation engines. Do not use different test sets for different engines. • Test sets must be “blind”: • If the MT engine has seen the test set before or included the test set data in the training data, then the quality of the output will be artificially high and not represent a true metric. • Tests must be carried out transparently: • Where possible, submit the data yourself to the MT engine and get it back immediately. Do not rely on a third party to submit the data. • If there are no tools or APIs for test set submission, the test set should be returned within 10 minutes of being submitted to the vendor via email. • This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output.

  45. Comparing Translation Engines: Initial Assessment Checklist • Word Segmentation and Tokenization must be consistent: • If Word Segmentation is required (i.e. for languages such as Chinese, Japanese and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology. • Provide each MT vendor a sample of at least 20 documents that are in domain • This allows each vendor to better understand they type of document and customize accordingly. • This sample should not be the same as the test set data • Test in Three Stages • Stage 1: Starting quality without customization • Stage 2: Initial quality after customization • Stage 3: Quality after first round of improvement. • This should include post editing of at least 5,000 segments, preferably 10,000.

  46. Understanding and Comparing Improvements

  47. The Language Studio™ 4 Step Quality Plan 1. Customize Create a new custom engine using foundation data and your own language assets 2. Measure Measure the quality of the engine for rating and future improvement comparisons 4. Manage Manage translation projects while generating corrective data for quality improvement. 3. Improve Provide corrective feedback removing potential for translation errors.

  48. A Simple Recipe for Quality Improvement • Asia Online develops a specific roadmap for improvement for each custom engine. • This ensures the fastest development path to quality possible. • You can start from any level of data. • We will develop based on the following: • Your quality goals • Amount of data available in foundation engine • Amount of data that you can provide • Quality expectations are set from the outset • Asia Online performs a majority of the tasks • Many are fully automated

  49. The Data Path To Higher Quality Entry Points You can start your custom engine with just monolingual data improving over time as data becomes available. Data can come from post editing feedback on initial custom engine. Quality constantly improves 1 • High volume, high quality translation memories • Rich Glossaries • Large high quality monolingual data Highest Quality Less Editing 2 • Some high quality translation memories • Some high quality monolingual data • Glossaries 3 • Limited high quality translation memories • Some high quality monolingual data • Glossaries 4 • Limited high quality monolingual data • Glossaries Less High Quality 5 More Editing • Limited high quality monolingual data

More Related