Recognizing Textual Entailment with LCC’s Groundhog System

Recognizing Textual Entailment with LCC’s Groundhog System Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink and Ying Shigroundhog@languagecomputer.com

Introduction • We were grateful for the opportunity to participate in this year’s PASCAL RTE-2 Challenge • First exposure to RTE came as part of the Fall 2005 AQUAINT “Knowledge Base” Evaluation • Included PASCAL veterans: University of Colorado at Boulder, University of Illinois at Urbana-Champaign, Stanford University, University of Texas at Dallas, and LCC(Moldovan) • While this year’s evaluation represented our first foray into RTE, our group has worked extensively towards performing the types of textual inference that’s crucial for: • Question-Answering • Information Extraction • Multi-Document Summarization • Named Entity Recognition • Temporal and Spatial Normalization • Semantic Parsing

Outline of Today’s Talk • Introduction • Groundhog Overview • Preprocessing • New Sources of Training Data • Performing Lexical Alignment • Paraphrase Acquisition • Feature Extraction • Entailment Classification • Evaluation • Conclusions di marmotta americana 2 Feb: Giorno della Marmotta, the RTE-2 deadline

Training Corpora PositiveExamples NegativeExamples RTE Dev WWW Architecture of the Groundhog System RTE Test Preprocessing Lexical Alignment Feature Extraction Named Entity Recognition Paraphrase Acquisition Entailment Classification Temporal Normalization Temporal Ordering Syntactic Parsing Semantic Parsing Name Aliasing YES NO Name Coreference Semantic Annotation

A Motivating Example Example 139 Task=SUM, Judgment=YES, LCC=YES, Conf = +0.8875 • Questions we need to answer: • What are the “important” portions that should be considered by a system? Can lexical alignment be used to identify these strings? • How do we determine that the same meaning is being conveyed by phrases that may not necessarily be lexically related? Can phrase-level alternations (“paraphrases”) help? • How do we deal with the complexity that reduces the effectiveness of syntactic and semantic parsers? Annotations? Rules? Compression? Text The Billsnow appear ready to hand the reins over to one of their two-top picks from a year ago in quarterbackJ.P. Losman, who missed most of last season with a broken leg. [The Bills]Arg0 now appear ready to hand [the reins]Arg1 over to [one]Arg2 of their two-top picks from a year ago in quarterbackJ.P. Losman, who missed most of last season with a broken leg. The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. Hypothesis The Bills plan to give the starting job to J.P. Losman. The Billsplan to give the starting job toJ.P. Losman.

Preprocessing • Groundhog starts the process of RTE by annotating t-h pairs with a wide range of lexicosemantic information: • Named Entity Recognition • LCC’s CiceroLite NER software is used to categorize over 150+ different types of named entities: [The Bills]SPORTS_ORG [now]TIMEX appear ready to hand the reins over to one of their two-top picks from [a year ago]TIMEX in quarterback [J.P. Losman]PERSON, who missed most of [last season]TIMEX with [a broken leg]BODY_PART. [The Bills]SPORTS_ORG plan to give the starting job to [J.P. Losman]PERSON. • Name Aliasing and Coreference • Lexica and grammars found in CiceroLite are used to identify coreferential names and to identify potential antecedents for pronouns [The Bills]ID=01 now appear ready to hand the reins over to [oneof [their]ID=01two-top picks]ID=02from a year ago in [quarterback]ID=02 [J.P. Losman]ID=02, [who]ID=02 missed most of last season with a broken leg. [The Bills]ID=01 plan to give the starting job to [J.P. Losman]ID=02.

Preprocessing • Temporal Normalization and Ordering • Heuristics found in LCC’s TASER temporal normalization system is then used to normalize time expressions to their ISO 9000 values and to compute the relative order of time expressions within a context The Bills [now]2006/01/01appear ready to hand the reins over to one of their two-top picks from [a year ago]2005/01/01-2005/12/31in quarterback J.P. Losman, who missed most of [last season]2005/01/01-2005/12/31with a broken leg. The Bills plan to give the starting job to J.P. Losman. • POS Tagging and Syntactic Parsing • We use LCC’s own implementation of the Brill POS tagger and the Collins Parser in order to syntactically parse sentences and to identify phrase chunks, phrase heads, relative clauses, appositives, and parentheticals.

Preprocessing • Semantic Parsing • Semantic parsing is performed using a Maximum Entropy-based semantic role labeling system trained on PropBank annotations The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. Arg0 ArgM The Bills now appear ready Arg2 Arg1 Arg0 one of their two-top picks hand the reins who missed most of last season a broken leg Arg0 Arg1 Arg3

Preprocessing • Semantic Parsing • Semantic parsing is performed using a Maximum Entropy-based semantic role labeling system trained on PropBank annotations The Bills plan to give the starting job to J.P. Losman. Arg0 The Bills plan Arg2 Arg1 Arg0 J.P. Losman give the starting job

Preprocessing • Semantic Annotation • Heuristics were used to annotate the following semantic information: • Polarity: Predicates and nominals were assigned a negative polarity value when found in the scope of an overt negative marker (no, not, never) or when associated with a negation-denoting verb (refuse). Never before had ski racing [seen]FALSE the likes of Alberto Tomba. Members of Iraq's Governing Council refusedto [sign]FALSE an interim constitution. • Factive Verbs: Predicates such as acknowledge, admit, and regret conventionally imply the truth of their complement; complements associated with a list of factive verbs were always assigned a positive polarity value. Both owners and players admitthere is [unlikely]TRUE to be much negotiating.

Preprocessing • Semantic Annotation (Continued) • Non-Factive Verbs: We refer to predicates that do not imply the truth of their complements as non-factive verbs. • Predicates found as complements of the following contexts were marked as unresolved: • non-factive speech act verbs (deny, claim) • psych verbs (think, believe) • verbs of uncertainty or likelihood (be uncertain, be likely) • verbs marking intentions or plans (scheme, plot, want) • verbs in conditional contexts (whether, if) Congress approved a different version of the COCOPA Law, which did not include the autonomy clauses, claiming they [were in contradiction]UNRESOLVED with constitutional rights. Defence Minister Robert Hill says a decision would need to be made by February next year, ifAustralian troops [extend]UNRESOLVED their stay in southern Iraq.

Preprocessing • Semantic Annotation (Continued) • Supplemental Expressions: Following (Huddleston and Pullum 2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis. Nominal Appositives Shia pilgrims converge on Karbala to mark the death of Hussein, the prophet Muhammad’s grandson, 1300 years ago. Shia pilgrims converge on Karbala to mark the death of Hussein 1300 years ago AND Hussein is the prophet Muhammad’s grandson. Epithets / Name Aliases Ali al-Timimi had previously avoided prosecution, but now the radical Islamic cleric is behind bars in an American prison. Ali al-Timimi had previously avoided prosecution but now the radical Islamic cleric is behind bars... AND Ali al-Timimi is a radical Islamic cleric.

Preprocessing • Semantic Annotation (Continued) • Supplemental Expressions: Following (Huddleston and Pullum 2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis. As-Clauses The LMI was set up by Mr. Torvalds with John Hall as a non-profit organization to license the use of the word Linux. The LMI was set up by Mr. Torvalds with John Hall as non-profit organization to license the use of the word Linux ANDthe LMI is a non-profit organization Non-Restrictive Relative Clauses The Bills now appear ready to ... quarterback J.P. Losman, who missed most of last season with a broken leg. The Bills now appear ready to ... quarterback J.P. Losman, who missed most of last season with a broken leg AND J.P. Losman missed most of last season...

Lexical Alignment • We believe that these lexicosemantic annotations – along with the individual forms of the words – can provide us with the input needed to identify corresponding tokens, chunks, or collocations from the text and the hypothesis. hand give Alignment Probability: 0.74 Unresolved, WN Similar Unresolved, WN Similar The Bills The Bills Alignment Probability: 0. 94 Arg0, ID=01, Organization Arg0, ID=01, Organization J.P. Losman J.P. Losman Alignment Probability: 0. 91 ID=02, Person Arg2, ID=02, Person the reins the starting job Alignment Probability: 0. 49 Arg1 Arg1

Lexical Alignment • In Groundhog, we used a Maximum Entropy classifier to compute the probability that an element selected from a text corresponds to – or can be aligned with – an element selected from a hypothesis. • Three-step Process: • First, sentences were decomposed into a set of “alignable chunks” that were derived from the output of a chunk parser and a collocation detection system. • Next, chunks from the text (Ct) and hypothesis (Ch) were assembled into an alignment matrix (CtCh). • Finally, each pair of chunks were then submitted to a classifier which output the probability that the pair represented a positive example of alignment.

Lexical Alignment • Four sets of features were used: • Statistical Features: • Cosine Similarity • (Glickman and Dagan 2005)’s Lexical Entailment Probability • Lexicosemantic Features: • WordNet Similarity (Pedersen et al. 2004) • WordNet Synonymy/Antonymy • Named Entity Features • Alternations • String-based Features • Levenshtein Edit Distance • Morphological Stem Equality • Syntactic Features • Maximal Category • Headedness • Structure of entity NPs (modifiers, PP attachment, NP-NP compounds)

Training the Alignment Classifier • Two developers annotated a selection of held-out set of 10,000 alignment chunk pairs from the RTE-2 Development Set as either positive or negative examples of alignment. • Performance for two different classifiers on a randomly selected set of 1000 examples from the RTE-2 Dev Set is presented below: • While both classifiers performed relatively satisfactorily, F-measure varied significantly (p < 0.05) on different test sets.

Creating New Sources of Training Data • In order to perform more robust alignment, we experimented with gathering two techniques for gathering training data: • Positive Examples: • Following (Burger and Ferro 2005), we created a corpus of 101,329 positive examples of entailment examples by pairing the headline and first sentence from newswire documents. First Line: Sydney newspapers made a secret bid not to report on the fawning and spending made during the city’s successful bid for the 2000 Olympics, former Olympics Minister Bruce Baird said today. Headline: Papers Said To Protect Sydney Bid • Examples filtered extensively in order to select only those examples where the headline and the first line both synopsized the content of a document • In a evaluation set of 2500 examples, annotators found 91.8% to be positive examples of “rough” entailment

Creating New Sources of Training Data • Negative Examples: • We gathered 119,113 negative examples of textual entailment by: • Selecting sequential sentences from newswire texts that featured a repeat mention of a named entity (98,062 examples) Text:One player losing a close friend is Japanese pitcher Hideki Irabu, who was befriended by Wells during spring training last year. Hypothesis: Irabu said he would take Wells out to dinner when the Yankees visit Toronto. • Extracting pairs of sentences linked by discourse connectives such as even though, although, otherwise, and in contrast (21,051 examples) Text:According to the professor, present methods of cleaning up oil slicks are extremely costly and are never completely efficient. Hypothesis: [In contrast], Clean Mag has a 1000 percent pollution retrieval rate, is low cost, and can b recycled.

Training the Alignment Classifier • For performance reasons, the hillclimber trained on the 10K human-annotated pairs was used to annotate a selection of 450K chunk pairs selected equally from these two corpora. • These annotations were then used to train a final MaxEnt classifier that was used in our final submission. • Comparison of the three alignment classifiers is presented below for the same evaluation set of 1000 examples:

Paraphrase Acquisition • Groundhog uses techniques derived from automatic paraphrase acquisition (Dolan et al. 2004, Barzilay and Lee 2003, Shinyama et al. 2002) in order to identify phrase-level alternations for each t-h pair. • Output from an alignment classifier can be used to determine a “target region” of high correspondence between a text and a hypothesis: The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. The Billsnowappear ready to hand the reins over to one of their two-top picks from a year ago in quarterbackJ.P. Losman, who missed most of last season with a broken leg. The Bills plan to give the starting job to J.P. Losman. The Billsplan to give the starting job toJ.P. Losman. • If paraphrases can be found for the “target regions” of both the text and the hypothesis, we may have strong evidence that the two sentences exist in an entailment relationship.

Paraphrase Acquisition • For example, if a passage (or set of passages) can be found that are paraphrases of both a text and a hypothesis, those paraphrases can be said to encode the meaning that is common between the t and the h. ... appear ready to hand the reins over to ... The Bills J.P. Losman ... plan to give the starting job to ... ... may go with quarterback ... ...could decide to put their trust in... ... might turn the keys of the offense over to ... • However, not all sentences containing both aligned entities will be true paraphrases... ... benched Bledsoe in favor of ... ... is molding their QB of the future ... ... are thinking about cutting ...

Paraphrase Acquisition • Like Barzilay and Lee (2003), our approach focuses on creating clusters of potential paraphrases acquired automatically from the WWW. • Step 1. The two entities with the highest alignment confidence from each t-h pair were selected from each example. • Step 2. Text passages containing both aligned entities (and a context window of m words) were extracted from each original t and h. • Step 3. The top 500 documents containing each pair of aligned entities are retrieved from Google; only the sentences that contain both entities are kept. • Step 4. Text passages containing the aligned entities are extracted from the sentences collected from the WWW. • Step 5. WWW passages and original t-h passages are then clustered using the complete-link clustering algorithm outlined in Barzilay and Lee (2003); Clusters with less than 10 passages are discarded, even if they include the original t-h passage.

Entailment Classification • As with other approximation-based approaches to RTE (Haghighi et al. 2005, MacCartney et al. 2006), we use a supervised machine learning classifier in order to determine whether an entailment relationship exists for a particular t-h pair. • Experimented with a number of machine learning techniques: • Support Vector Machines (SVMs) • Maximum Entropy • Decision Trees • February 2006: Decision Trees outperformed MaxEnt, SVMs • April 2006: MaxEnt comparable to Decision Trees, SVMs still lag behind

Entailment Classification • Information from the previous three components are used to extract 4 types of features to inform this entailment classifier: • Selected examples of features used: • Alignment Features: • Longest Common Substring: Longest contiguous string common to both t and h • Unaligned Chunk: Number of chunks in h not aligned with chunks in t • Dependency Features: • Entity Role Match: Aligned entities assigned same role • Entity Near Role Match: Collapsed semantic roles commonly confused by semantic parser (e.g. Arg1, Arg2 >> Arg1&2; ArgM, etc.) • Predicate Role Match: Roles assigned by aligned predicates • Predicate Role Near Match: Compared collapsed set of roles assigned by aligned predicates

Entailment Classification • Classifier Features (Continued) • Paraphrase Features • Single Paraphrase Match: Paraphrase from a surviving cluster matches either the text or the hypothesis • Did we select the correct entities at alignment? • Are we dealing with something that can be expressed in multiple ways? • Both Unique Paraphrase Match: Paraphrase P1 matches t, while paraphrase P2 matches h; P1P2 • Category Match: Paraphrase P1 matches t, while paraphrase P2 matches h; P1 and P2 found in same surviving cluster of paraphrases. • Semantic Features • Truth-Value Mismatch: Aligned predicates differ in any truth value (true, false, unresolved) • Polarity Mismatch: Aligned predicates assigned truth values of opposite polarity

Arg0 hand Arg1 -- Arg0 give Arg2 Arg1 ... have gone with quarterback ... ... has turned the keys of the offense over to ... Entailment Classification • Alignment Features:What elements align in the t or h? J.P. Losman The Bills hand the reins J.P. Losman The Bills give the starting job Good Alignment 0.94 Passable Alignment 0.79 Marginal Alignment 0.49 Good Alignment 0.91 • Dependency Features: Are the same dependencies assigned to corresponding entities in the t and h? • Paraphrase Features: Were any paraphrases found that could be paraphrases of portions of the t and the h? LikelyEntailment! • Semantic Features: Were predicates assigned the same truth values? hand unresolved give unresolved

Another Example • Not all examples, however, include as many complementary features as Example 139: Example 734 Task=IR, Judgment=NO, LCC=NO, Conf = -0.8344 Text In spite of that, the government’s “economic development first” priority did not initially recognize the need for preventative measures to halt pollution, which may have slowed economic growth. Hypothesis The government took measures to reduce pollution.

the need forpreventative measures the government’s priority not recognize halt pollution the government took measures reduce pollution Example 734 • Even though this pair has a number of points of alignment, annotations suggest that there significant discrepancies between the sentences. Partial Alignment, non-headArg Role Match NE Category MismatchPassable Alignment 0.39 POS AlignmentPolarity MismatchNon-SynonymousPoor0.23 Partial Alignment, non-head Arg Role MatchPassable Alignment0.41 DegreePOS MatchGood 0.84 Lemma matchArg Role MatchGood 0.93 UnlikelyEntailment! • In addition, few “paraphrases” could be found that clustered with passages extracted from either the t or the h. ... not recognize need for measures to halt ... ... has allowed companies to get away with ... the govt’s priority pollution ... is looking for ways to deal with ... the government ... wants to forget about ... ... took measures to reduce ...

Evaluation: 2006 RTE Performance • Groundhog correctly recognized entailment in 75.38% of examples in this year’s RTE-2 Test Set: • Performance differed markedly across the 4 subtasks: while the system netted 84.5% of the examples in the summarization set, Groundhog only categorized 69.5% of the examples in the question-answering set. • This has something to do with our training data: • Headline corpus features a large number of “sentence compression”-like examples; when Groundhog is trained on a balanced training corpus, performance on SUM task falls to 79.3%.

Evaluation: Role of Training Data • Training data did play an important role in boosting our overall accuracy on the 2006 Test Set: performance increased from 65.25% to 75.38% when then entire training corpus was used. • Refactoring features has allowed us to obtain some performance gains with smaller training sets, however: our performance when only using the 800 examples from the 2006 Dev Set has increased by 5.25%. Performance increase appears to be tapering off as amount of training data increases...

Evaluation: Role of Features in Entailment Classifier • While best results were obtained by combining all 4 sets of features used in our entailment classifier, largest gains were observed by adding Paraphrase features: + Alignment + Dependency + Paraphrase + Semantic 66.25% 71.25% 75.38% 58.00% +Paraphrase 69.13% 73.62% 65.88% +Dependency 68.00% 62.50% +Alignment 65.25%

Conclusions • We have introduced a three-tiered approach for RTE: • Alignment ClassifierIdentifies “aligned” constituents using a wide range of lexicosemantic features • Paraphrase Acquisition Derives phrase-level alternations for passages contained high-confidence aligned entities • Entailment ClassifierCombines lexical, semantic, and syntactic information with phrase-level alternation information in order to make an entailment decision • In addition, we showed that it is possible – by relaxing of the notion of strict entailment – in order to create training corpora that can prove effective in training systems for RTE • 200K+ examples (100K positive, 100K negative)

Recognizing Textual Entailment with LCC’s Groundhog System