Rapid Training of Information Extraction with Local and Global Data Views • Committee • Prof. Ralph Grishman Prof. Satoshi Sekine Prof. HengJi Prof. Ernest Davis Prof. Lakshminarayanan Subramanian Dissertation Defense Ang Sun Computer Science Department New York University April 30, 2012
Outline • Introduction • Relation Type Extension: Active Learning with Local and Global Data Views • Relation Type Extension: Bootstrapping with Local and Global Data Views • Cross-Domain Bootstrapping for Named Entity Recognition • Conclusion
Part I Introduction
Tasks • Named Entity Recognition (NER) • Relation Extraction (RE) • Relation Extraction between Names • Relation Mention Extraction
NER NER Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft.
RE • Relation Extraction between Names Employment Adam ABC Inc. RE NER Adam, a data analyst for ABC Inc.
RE • Relation Mention Extraction Entity Extraction Adam, a data analyst for ABC Inc.
RE • Relation Mention Extraction Employment a data analyst ABC Inc. RE Adam, a data analyst for ABC Inc.
Prior Work – Supervised Learning • Learn with labeled data • < Bill Gates, PERSON > • < <Adam, ABC Inc. >, Employment >
Prior Work – Supervised Learning Expensive!
Prior Work – Supervised Learning • Expensive • A trained model is typically domain-dependent • Porting it to a new domain usually involves annotating data from scratch Domains
Prior Work – Supervised Learning Annotation is tedious! 15 minutes 1 hour 2 hours
Prior Work – Semi-supervised Learning • Learn with both • labeled data • Unlabeled data • The learning is an iterative process 1.Train an initial model with labeled data 2. Apply the model to tag unlabeled data 3. Select good tagged examples as additional training examples 4. Re-train the model 5. Repeat from Step2. Small Large
Prior Work – Semi-supervised Learning • Problem 1: Semantic Drift • Example1: • Learner for PERSON names ends up learning flower names. • Because women's first names intersect with names of flowers (Rose,...) • Example 2: • Learner for LocatedIn relation patterns ends up learning patterns for other relations (birthPlace, governorOf, …)
Prior Work – Semi-supervised Learning • Problem 2: Lacks a good stopping criterion • Most systems • either use a fixed number of iterations • or use a labeled development set to detect the right stopping point
Prior Work – Unsupervised Learning • Learn with only unlabeled data • Unsupervised Relation Discovery • Context based clustering • Group pairs of named entities with similar context to the same relation cluster
Prior Work – Unsupervised Learning • Unsupervised Relation Discovery (Hasegawa et al., (04))
Prior Work – Unsupervised Learning • Unsupervised Relation Discovery • The semantics of clusters are usually unknown • Some clusters are coherent can consistently label them • Some are mixed, containing different topics difficult to label them
PART II Relation Type Extension: Active Learning with Local and Global Data Views
Relation Type Extension • Extend a relation extraction system to new types of relations Target Labeled Multi-class Setting: Target relation: one of the ACE relation types Labeled data: 1) a few labeled examples of the target relation (possibly by random selection). 2) all labeled auxiliary relation examples. Unlabeled data: all other examples in the ACE corpus
Relation Type Extension • Extend a relation extraction system to new types of relations Target Un- labeled Binary Setting: Target relation: one of the ACE relation types Labeled data: a few labeled examples of the target relation (possibly by random selection). Unlabeled data: all other examples in the ACE corpus
LGCo-Testing • LGCo-Testing := co-testing with local and global views • The general idea • Train one classifier based on the local view (the sentence that contains the pair of entities) • Train another classifier based on the global view (distributional similarities between relation instances) • Reduce annotation cost by only requesting labels of contention data points
Syntactic Parsing Tree • The local view Token Sequence <e1>President Clinton</e1>traveled to <e2>the Irish border</e2> for an evening ceremony.
Dependency Parsing Tree • The local view <e1>President Clinton</e1>traveled to <e2>the Irish border</e2> for an evening ceremony.
The local view • The local view classifier • Binary Setting: MaxEnt binary classifier • Multi-class Setting: MaxEntmulticlass classifier
The General Idea • The global view Corpus of 2,000,000,000 tokens Compile corpus to database of 7-grams Represent each relation instance as a relational phrase Compute distributional similarities between phrases in the 7-grams database Build a relation classifier based on the K-nearest neighbor idea * * * * * * * (7-grams) Relation Instance Relational Phrase <e1>Clinton</e1>traveled to <e2>the Irish border</e2> for … traveled to … <e2><e1>his</e1>brother</e2> said that …. his brother
> * * * traveled to * * 3 's headquarters here traveled to the U.S. 4 laundering after he traveled to the country 3 , before Parachatraveled to the United 3 have never before traveled to the United 3 had in fact traveled to the United 4 two Cuban grandmothers traveled to the United 3 officials who recently traveled to the United 6 President Lee Teng-huitraveled to the United 4 1996 , Clinton traveled to the United 4 commission members have traveled to the United 4 De Tocqueville originally traveled to the United 4 Fernando Henrique Cardoso traveled to the United 3 Elian 's grandmothers traveled to the United • Compute distributional similarities <e1>President Clinton</e1>traveled to <e2>the Irish border</e2> for an evening ceremony.
> * * * arrived in * * 4 Arafat , who arrived in the other of sorts has arrived in the new 5 inflation has not arrived in the U.S. 3 Juan Miguel Gonzalez arrived in the U.S. it almost certainly arrived in the New 4 to Brazil , arrived in the country 4 said Annan had arrived in the country 21 he had just arrived in the country 5 had not yet arrived in the country 3 when they first arrived in the country 3 day after he arrived in the country 5 children who recently arrived in the country 4 Iraq Paul Bremer arrived in the country 3 head of counterterrorism arrived in the country 3 election monitors have arrived in the country • Compute distributional similarities <e1>Ang</e1>arrived in <e2>Seattle</e2> on Wednesday.
Compute distributional similarities • Represent each phrase as a feature vector of contextual tokens • Compute cosine similarity between two feature vectors • Feature weight? PresidentClinton traveled to the Irish border <L2_President, L1_Clinton, R1_the, R2_Irish, R3_Border>
Feature Weight Use Frequency ?
tf the number of corpus instances of P having feature f divided by the number of instances of P Feature Weight Use tf-idf idf the total number of phrases in the corpus divided by the number of phrases with at least one instance with feature f
Feature Weight Use tf-idf
Compute distributional similarities Sample of similar phrases.
The global view classifier k-nearest neighbor classifier: classify an unlabeled example based on closest labeled examples <e1>President Clinton</e1>traveled to <e2>the Irish border</e2> PHYS-LocatedIn … <e2><e1>his</e1>brother</e2> said that … PER-SOC <e1>Ang Sun</e1>arrived in <e2>Seattle</e2> on Wednesday. ? PHYS-LocatedIn sim(arrived in, traveled to) = 0.763 sim(arrived in, his brother) = 0.012
LGCo-Testing Procedure in Detail Use KL-divergence to quantify the disagreement between the two classifiers • KL-divergence: • 0 for identical distributions • max when distributions are peaked and prefer different class labels • Rank instances by descending order of KL-divergence • Pick the top 5 instances to request human labels during a single iteration
Active Learning Baselines • RandomAL • UncertaintyAL • Local view classifier • Sample selection: • UncertaintyAL+ • Local view classifier (with phrase cluster features) • Sample selection: • SPCo-Testing • Co-Testing (sequence view classifier and parsing view classifier) • Sample selection: KL-divergence
Results for PER-SOC (Multi-class Setting) Supervised 36K instances 180 Hours LGCo-Testing 300 instances 1.5 Hour • Annotation speed: • 4 instances per minute • 200 instances per hour (annotator takes a 10-mins break in each hour) • Results for other types of relations have similar trends (in both binary and multiclass settings)
Comparing LGCo-Testing with the Two Settings F1 difference (in percentage) = F1 of active learning minus F1 of supervised learning the reduction of annotation cost by incorporating auxiliary types is more pronounced in early learning stages (#labels < 200) than in later ones
Part I Part III Relation Type Extension: Bootstrapping with Local and Global Data Views
Basic Idea • Consider a bootstrapping procedure to discover semantic patterns for extracting relations between named entities
Basic Idea • It starts from some seed patterns which are used to extract named entity (NE) pairs , which in turn result in more semantic patterns learned from the corpus.
Basic Idea • Semantic drift occurs because • a pair of names may be connected by patterns belonging to multiple relations • the bootstrapping procedure is looking at the patterns in isolation
Unguided Bootstrapping Guided Bootstrapping • NE Pair Ranker • Use localevidence • Look at the patterns in • isolation • NE Pair Ranker • Use global evidence • Take into account the clusters (Ci) • of patterns
Unguided Bootstrapping • Initial Settings: • The seed patterns for the targetrelationR have precision 1 and all other patterns 0. • All NE pairs have confidence of 0
Unguided Bootstrapping • Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs • if many of the k patterns connecting the two names are high-precision patterns • then the name pair should have a high confidence. • The confidence of NE pairs is estimated as • Problem: over-rate NE pairs which are connected by patterns belonging to multiple relations
Unguided Bootstrapping • Step 2: Use NE pairs to search for new patterns and rank patterns • Similarly, for a pattern p, • if many of the NE pairs it matches are very confident • thenp has many supporters and should have a high ranking • Estimation of the confidence of patterns sum of the support from the |H| pairs the number of unique NE pairs matched by p
Unguided Bootstrapping • Step 2: Use NE pairs to search for new patterns and rank patterns • Sup(p) is the sum of the support it can get from the |H| pairs • The precision of p is given by the average confidence of the NE pairs matched by p • It normalizes the precision to range from 0 to 1 • As a result the confidence of each NE pair is also normalized to between 0 and 1
Unguided Bootstrapping • Step 3: Accept patterns • accept the K top ranked patterns in Step 2 • Step 4: Loop or stop • The procedure now decides whether to repeat from Step 1 or to terminate. • Most systems simply do NOT know when to stop
Guided Bootstrapping • Pattern Clusters--Clustering steps: • Extract features for patterns • Compute the tf-idf value of extracted features • Compute the cosine similarity between patterns • Build a pattern hierarchy by complete linkage Sample features for “X visited Y” as in “Jordan visited China”