Linguistically Targeted Test Suites Evaluation

Linguistically Targeted Test Suites November 2, 2012 Lori Levin Jason Baldridge Chris Dyer Vijay John Kyle Jerro

Linguistic Core evaluation for Linguistic Core MT • Corpus of naturally occurring sentences in Kinyarwanda and Malagasy • Sentences are annotated with tags showing constructions of interest (relative clauses, passives, etc.) Example: • Conditional, relative clause, headless relative clause, VOS, voice alternation, proximity, adjectival predicate • To really increase farmers’ representation in national politics, it is not enough to increase the number of delegates elected by farmers. • Tsyampynymampitombonyisan'nysolontenafidian'nytantsaharahatiananyhampitomboananysolontenan'nytantsahaeoamin'nysehatranasionaly.

Lexical similarity is not always a good measure of translation quality • Good translations have low scores • when higher order n-grams don’t match • Bad translations persist when function words are undervalued • Errors in tense, definiteness, and negation persist • Lack of error analysis • Well understood constructions like relative clauses and passive voice are not modelled

Underrating good translations From Giménez and Màrquez, 2010, page212 HYP: On Tuesday several missiles and mortar shells fell in southern Israel, but there were no casualties. R1: Several Qassam rockets and mortar shells were fired on southern Israel today Tuesday without victims. R2: Several Qassam rockets and mortars hit southern Israel today without causing any casualties. R3: A number of Quassam rockets and Howitzer missiles fell over southern Israel today, Tuesday, without causing any casualties. R4: Several Qassam rockets and mortar shells fell today, Tuesday on southern Israel without causing any victim. R5: Several Qassam rockets and mortar shells fell today, Tuesday, in southern Israel without causing any casualties. Acceptable to human translators but low BLEU score because of no higher order n-gram matches.

Underrating good translations • From our Malagasy-English system: • Low BLEU score 0.0149826 • HYP: many held for many months but have no right to a lawyer . • REF: many got arrested for months without any right to have access to any lawyer . • High BLEU score 0.510864 • HYP: in a long post , called for freedom for other members of the committee for zon'oombelona i koohyargoodarzi . • REF: koohyargoodarzi in a long post asked for freedom for other members of the committee of human rights .

Overrating bad translationsLack of focus on function words • Google Translate, October 31, 2012 • Chinese to English: Lost tense and missed preposition • I saw the person you talked to • 我看到了你交談的人 • I see the person you are talking • English to Japanese: Trouble with negative determiner “no” • Nostudents bought books. • いかなる生徒は本を買った。 • Any student bought a book.

Not identifying the source of errors • In mature MT systems, many systematic errors occur in well-understood linguistic constructions: • Relative clauses (non-subject gaps) • Google translate, October 31, 2012 • I saw the person you gave a book to. • 私はあなたに本をくれた人を見た。 • I saw a man who gave me the book for you.

Linguistic Evaluation • Evaluation based on syntactic or semantic roles is not reliable in the early stages of development when the output cannot be parsed well. • Och et al. 2003; Giménez and Màrquez 2010

Early Stage Linguistic Core Evaluation • How well are we translating specific constructions? • Preliminary list of constructions of interest in Kinyarwanda and Malagasy: • Relative clauses • Passives and other non-active voices • Clefts and focus constructions • Conditional sentences • Comparatives • VOS word order • Causatives • Applicatives

Early stage linguistically targeted evaluation • Automatic measures of lexical similarity • Which constructions correlate with low scores? • Error analysis conducted by human system developers

Plans • More constructions • Possessives, rates, questions, tense, mood, aspect, etc. • Evaluation metrics based on linguistic structure • Such as lexical similarity of syntactic and semantic functions

Linguistically Targeted Test Suites Evaluation

Linguistically Targeted Test Suites Evaluation

Presentation Transcript

Automated Generation and Execution of Test Suites

Improving Test Suites for Efficient Fault Localization

NIST-developed Test Suites

Towards Linguistically Grounded Ontologies

Improving Test Suites via Operational Abstraction

Industry Test Devices and Test Suites

TAHI IPsec test suites

Computing Linguistically-based Textual Inferences

Complex Test Suites Organization

TAHI IPv6 and IPsec test suites

Targeted

Conformance Test Suites, Extensionally

Test suites for gLite CLI

▶ 구성 • Advanced Test Suites(S/W)

ETSI TTCN-3 Test Suites QUALITY IMPROVEMENT

Toward Linguistically Grounded Ontologies

Deltin Suites

Shada Suites

The LCG Test Suites

The LCG Test Suites: more details

Test Factoring: Focusing test suites on the task at hand

RRM Suites