Relation Extraction CSCI-GA.2591

NYU Relation ExtractionCSCI-GA.2591 Ralph Grishman

ACE Relations An ACE relation mention connects two entity mentions in the same sentence: • the CEO of Microsoft OrgAff:employment(the CEO of MIcrosoft, Microsoft) • in the West Bank, a passenger was wounded Phys:Located(a passenger, the WestBank) ACE 2005 had 6 types of relations and 18 subtypes • most papers report on types only Most relations are local … • in roughly 70% of relations, arguments are adjacent or separated by one word • so chunking is important but full parsing is not critical

Benchmarks • ACE 2003 / 2003 / 2005 corpora • generally assuming perfect entity mentions on input • some work assumes only position (and not semantic type) is given • Semeval-2010 task 8 • carefully selected examples of 10 relations • a classification task

Using MaxEnt • First description of an ACE relation extractor • IBM system [Kambhatla ACL 2004] • Used features: • words • entity type • mention level • overlap • dependency tree • parse tree • used 2003 ACE data • F = 55 (perfect mentions) 23 (system mentions) • good system mentions are important

Lots of features • Singapore system [Zhou et al. ACL 2005] used a very rich feature set, including • 11 chunk-based features • family-relative feature • 2 country-name features • 7 dependency-based features • . . . • highly tuned to ACE task • F = 68 (relation type) F = 55 (subtype) • reports several % gain over IBM • used perfect mentions • further extended at NYU, on ACE 2004: F=70.1

Kernel methods and SVMs • As an alternative to a feature-based model, one can provide a kernel function: a similarity function between pairs of the objects being classified • kernel can be used directly by a kNN nearest neighbor classifier • or can be used in training an SVM [Support Vector Machine]

SVM • The SVM, when trained, creates a separating hyperplane • if data is fully separable, all data on one side of the hyperplane are classified +, on the other side – • inherently binary classifier

Benefit of kernel methods • provides a natural way of handling structured input of variable size: sequences and trees • feature-based system may require a large number of features for the same effect

Shortest-path kernel • [Bunescu & Mooney EMNLP 2005] • Sept 2002 corpus • Based on dependency path between arguments • Kernel function between two paths x and y of lengths m and n • c = degree of match (lexical / POS) • Train SVM • F = 52.5

Tree kernel • To take account of more of the tree than the dependency path, use PET (path-enclosed tree) • PET = Portion of tree enclosed by shortest path • Using entire sentence tree introduces too much irrelevant data • Use a tree kernel which recursively compares the two trees • For example, counts number of shared subtrees • Best kernel is a composite kernel: • tree kernel + entity kernel

Lexical Generalization • Test data will include words not seen in training • Remedies • Use lemmas • Use Brown clusters • Use word embedings • Can be used with feature-based or kernel-based methods

FCM Feature-Rich Compositional Embedding Models • Combines word embedding and hand-made discrete features: • where • e is the word embedding vector • f is a vector of hand-coded features • T is a matrix of weights • If e is fixed during training, this is a feature-rich log linear model

Neural Network • neural networks • provide a richer model than logLinear • reduce the need for feature engineering • although it may help to add features to embeddings • but are slow to train and hard to inspect • several types of networks have been used • convolutionalNNs • recurrent NNs • an ensemble of different NN types appears most effective • may even include log linear model in ensemble

Some comparisons • ACE 2005, train nw+bn, test bc, • perfect mentions, including entity types • LogLinearsystem 57.8 • FCM 61.9 • hybrid FCM 63.5 • CNN 63.0 • NN ensemble 67.0 • The richer model of even a simple NN beats a log linear (maxent system) • [Nguyen and Grishman, IJCAI Workshop 2016]

Comparing scores Using subset of ACE 2005 (news) Feature-based system Perfect mention position but no type info • Baseline 51.4 • Single Brown Cluster 52.3 • Multiple clusters 53.7 • Word Embedding (WE) 54.1 • Multiple clusters + WE 55.5 • Mult. clusters + WE + regularization 59.4 Moral: lexical generalization & regularization are worthwhile (probably for all ACE tasks) [Nguyen & Grishman ACL 2014]

Distant Supervision • We have focused on supervised methods, which produce the best performance • If we have a large data base with instances of the relations of interest, we can use distant supervision • Use data base to tag corpus • If DB has relation R(x,y),tag all sentences in corpus containing x and y as examples of R • Train model from tagged corpus

Distant Supervision • By itself, distant supervision is too noisy • If the same pair <x, y> is connected by several relations, which one to we label? • But it can be combined with selective manual annotation to produce a satisfactory result

Relation Extraction CSCI-GA.2591