1 / 51

Feasibility of Human-in-the-loop Minimum Error Rate Training

Feasibility of Human-in-the-loop Minimum Error Rate Training. Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6 th , 2009. {. ozaidan. |. ccb. }. @.

yorick
Download Presentation

Feasibility of Human-in-the-loop Minimum Error Rate Training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feasibility of Human-in-the-loopMinimum Error Rate Training Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6th, 2009 { ozaidan | ccb } @ cs.jhu.edu

  2. CCB ’09:

  3. CCB ’09: quixotic things like human-in-the-loop minimum error rate training foolishly impractical especially in the pursuit of ideals ; especially : marked by rash lofty romantic ideas or extravagantly chivalrous action

  4. Log-linear MT in One Slide • MT systems rely on several models. • A candidate is represented as a feature vector: • Corresponding weight vector: • Each candidate is assigned a score: • System selects highest-scoring translation:

  5. Minimum Error Rate Training • Och (2003): weight vector should be chosen by optimizing to evaluation metric of interest (aka MERT phase). • But error surface is ugly. • Och suggests an efficient line optimization method…

  6. Visualizing Och’s Method We want to plot this

  7. Visualizing Och’s Method

  8. Visualizing Och’s Method

  9. Visualizing Och’s Method

  10. Visualizing Och’s Method

  11. Visualizing Och’s Method TER-like sufficient statistics

  12. Visualizing Och’s Method TER-like sufficient statistics

  13. Visualizing Och’s Method TER-like sufficient statistics

  14. Visualizing Och’s Method TER-like sufficient statistics

  15. Visualizing Och’s Method TER-like sufficient statistics Fast!

  16. Visualizing Och’s Method TER-like sufficient statistics Fast! Fast?

  17. BLEU & MERT • The metric most often optimized is BLEU: • Why BLEU? • Usually the reported metric, • it has been shown to correlate well with human judgment, and • it can be computed efficiently.

  18. Problem with BLEU MERT • General critique of BLEU • Chiang et al. (2008): weaknesses in BLEU. • Callison-Burch et al. (2006): not always appropriate to use BLEU to compare systems. • Metric disparity • Actual evaluations have a human component (e.g. GALE uses H-TER). • What is the alternative? H-TER MERT?

  19. H-TER MERT? • In theory, MERT applicable to any metric. • In practice, scoring 1000’s of candidate translations with H-TER is expensive. • H-TER cost estimate: • Assume sentence takes 10 seconds to post-edit, at a cost of $0.10. • 100 candidates for each of 1000 source sentences  35 work days, $10,000. • vs. BLEU: minutes per iteration (and free). per iteration(!)

  20. A Human-BasedAutomatic Metric • We suggest a metric that is: • viable to be used in MERT, yet • based on human judgment. • Viability: relies on prebuilt database; no human involvement during MERT. • Human-based: the database is a repository of human judgments.

  21. Our Metric: RYPT • Main idea: reward syntactic constituents in source that are aligned to “acceptable” substrings in candidate translation. • When scoring a candidate: • Obtain parse tree for source sentence. • Align source words to candidate words. • Count number of subtrees translated in an “acceptable” manner. • RYPT = Ratio of Yes nodes in the Parse Tree.

  22. RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

  23. RYPT (Ratio of Y in Parse Tree) Source Parse Tree Label Y indicates forecasts deemed acceptable translation of prognosen. Y Source Translation (to be scored)

  24. RYPT (Ratio of Y in Parse Tree) Source Parse Tree Label Y indicates forecasts deemed acceptable translation of prognosen. Y Y Source Translation (to be scored)

  25. RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

  26. RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

  27. Is RYPT Good? Empirically Next … • Is RYPT acceptable? • Must show RYPT is reasonable substitute for human judgment. • Is RYPT feasible? • Must show collecting necessary judgments is efficient and affordable.

  28. Feasibility: Reusing Judgments <der patient , the patient , > • For each source sentence, we build a database, where each entry is a tuple: <source substring , candidate substring , judgment> • A judgment is reused across candidates: der patient wurde isoliert . the patient was isolated . the patient isolated . the patient was in isolation . the patient has been isolated .

  29. Feasibility: Reusing Judgments <der patient , of the patient, > • For each source sentence, we build a database, where each entry is a tuple: < source substring , candidate substring , judgment > • A judgment is reused across candidates: der patient wurde isoliert . of the patient was isolated . of the patient isolated . of the patient was in isolation . of the patient has been isolated .

  30. Feasibility: Label Percolation N N N N Y Y Y Y • Minimize label collection even further by percolating labels through the parse tree: • If a node is labeled NO, ancestors likely labeled NO Percolate NO up the tree. • If a node is labeled YES, descendents likely labaled YES Percolate YES down the tree. Y N

  31. Maximizing Label Percolation Too much focus on individual words No percolation Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y • Queries are performed in batch mode. • For maximum percolation, queries should avoid overlapping substrings. • One extreme: select root node. • Other extreme: select all preterminals. Never happens… Y

  32. Query Selection Middle ground Frontier node set with some source maxLen

  33. Query Selection Middle ground Frontier node set with some source maxLen

  34. Query Selection Middle ground Frontier node set with some source maxLen

  35. Query Selection Middle ground Frontier node set with some source maxLen

  36. Query Selection Middle ground Frontier node set with some source maxLen

  37. OK, so how do you obtain these labels?

  38. Amazon Mechanical Turk • We use Amazon Mechanical Turk to collect judgment labels. • AMT: virtual marketplace, allows “requesters” to create and post tasks to be completed by “workers” around the world. • Requester provides HTML template, csv database. • AMT creates individual tasks for workers. • Task = Human Intelligence Task = HIT

  39. HIT Example Source prozent Reference percent Candidate translations % per cent

  40. des zentralen statistischen amtes Source Reference statistics office data from the central statistical office from the central statistics office in the central statistical office in the central statistics office Candidate translations of central statistical office of central statistics office of the central statistical office of the central statistics office

  41. Data Summary Hourly ‘wage’: $1.95 • 3,873 HITs created, each with 3.4 judgments on average  13k labels. • 115 distinct workers put in 30.8 hours. • One label per 8.4 seconds (426 labels/hr). • Cost:$21.43 Amazon fees $53.47 Wages $ 6.54 Bonuses $81.44 161 labels per $

  42. Is RYPT Good? Next … Yes! • Is RYPT acceptable? • Must show RYPT is reasonable substitute for human judgment. • Is RYPT feasible? • Must show collecting necessary judgments is efficient and affordable.

  43. Is RYPT Acceptable? • Is RYPT a reasonable alternative for human judgment? • Our experiment: compare predictive power of RYPT vs. BLEU. • Compare top-1 candidate by BLEU score vs. top-1 candidate by RYPT score. • Which candidate looks better to a human?

  44. RYPT vs. BLEU . RYPT Candidates BLEU . cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7 . . .

  45. RYPT vs. BLEU RYPT’s BLEU’s choice choice vs. cand 5 vs. cand 3 . RYPT Candidates BLEU . cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7 . . . • Which one would be preferred by a human? • Ask a Turker! • Actually, ask 3 Turkers… •  3 judgments • * 250 sentence pairs • 750 judgments

  46. RYPT vs. BLEU • RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. • Majority vote breakdown:

  47. RYPT vs. BLEU • RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. • Majority vote breakdown: Majority vote picks RYPT’s choice Majority vote picks BLEU’s choice 48.0% 16.8% 35.2% 24.0% 13.2% Majority vote strongly prefers RYPT’s choice Majority vote strongly prefers BLEU’s choice Strong preference for X = no votes for Y

  48. BLEU’s Inherent Advantage • When comparing candidate translations, worker was shown the references. • BLEU’s choice, by definition, will have high overlap with the reference. • Annotator might judge BLEU’s choice to be ‘better’ because it ‘looks’ like the reference. • When no references shown, (and restricted to workers in Germany): 46.1% 17.9% 36.0% 45.2% 25.6% 29.2%

  49. See Paper for… • Source-candidate alignment method, which takes advantage of derivation trees given by Joshua (see 3.1). • Percolation coverage and accuracy, and effect of maxLen (see 5.1). • Related work (see 6). • Nießen et al. (2000): DB of judgments. • WMT Workshops: manual evaluation; metric correlation with human judgment. • Snow et al. (2008): AMT is “fast and cheap.”

  50. Future Work • This was a pilot study… • Complete MERT run (already in progress). • Beyond a single iteration. • Using AMT’s API. • Probabilistic approach to labeling nodes. • Treat a node label as a random variable. • Existing labels = observed, others inferred. Stay tuned for our next paper 

More Related