Co-training Internal and External Extraction Models

Co-training Internal and External Extraction Models By Thomas Packer

Bootstrapped Knowledge and Language Acquisition • Tom Mitchell’s Co-training Theory • “Combining Labeled and Unlabeled Data with Co-training”, Avrim Blum and Tom Mitchell, 1998. • Tom Mitchell’s Coupled Bootstrap Learning • “Coupling Semi-Supervised Learning of Categories and Relations”, Andrew Carlson, Justin Betteridge, Estevam Rafael HruschkaJr. and Tom M. Mitchell, 2009. • David Yarowsky • “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”, 1995.

Source Document Types • Semi-structured, noisy OCR’d historical documents: • (This presentation.) • Semi-structured, clean(-ish) HTML web pages: • Using multiple ontology constraints (Tom Mitchell, Andrew Carlson paper). • Adding the learning and utilizing of cardinality constraints.

OCR Documents

OCR Documents 380,641,672,686 WOMEN'S 670,641,893,686 HOME 891,641,1316,685 MISSIONARY 1314,639,1622,686 SOCIETY 909,886,1091,931 Officers 192,969,450,1029 Presidenlt 1032,972,1077,1011 M 1086,986,1135,1011 RS 1142,972,1391,1019 CHARLES 1388,974,1464,1013 A 1475,973,1692,1026 JEWELL 207,1037,309,1077 Vice 308,1038,597,1077 Presidenl

OCR Documents WOMEN'S HOME MISSIONARY SOCIETY Officers Presidenlt M RS CHARLES A JEWELL Vice Presidenl MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Recording Secretarv Miss JOSEPHINE WHITE Corresponding S retary Mss JULIA A GRAVES Treasurer Ms H B LANGDON Chairman IWorkComtmittee Miss MARY H ADAMS Chairman of 31emb 'ership Miss ELIZA F Mix Chairman of Purchlasizng Cont 'MRs MIARY C ST )NEC Chairman of Socia I ConnilLt'e MIRS AI I ERT H PITKIN Secretary's Report This Society is auxiliary to the Women's Home Missionary Union of Connecticut Its membership is 120 and its active season extends from November to April Meetings are held semi-monthly on Friday afternoons from 2 until 5 o'clock The time is occupied in sewing hearing letters from the home missionary field transacting business and in social intercourse often ending with tea

Co-trainable Extraction Models • Internal Model: • Decision list. • Maps word to label with certain percentage confidence. • “James”  ‘Given Name’ 0.9 • External Model: • Decision list. • Map collocation patterns to labels with certain percentage confidence. • Left token is ‘Given Name’, right token is ‘Surname’, current token has length=1  ‘Initial’ 0.95

Bootstrapping Approach • Initialize empty models (internal and external). • Manually create seed ontology, e.g. list of first names, last names, etc. • Process documents, extracting instances and features. • Loop: • Label words with top-precision labels based on current models. • Propose new model elements based on newly labeled tokens. • Update model parameters based on label statistics.

OCR Documents M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN • Seed models: • Prefix: “Mrs”, Miss”, “Mr” • Initials: “A”, “B”, “C”, … • Given Name: “Charles”, “Francis”, Herbert” • Surname: “Goodrich”, Wells”, White” • Stopword: “Jewell”, “Graves” • Updates: • Prefix: first token in line • Given Name: between ‘Prefix’ and ‘Initial’ • Surname: between initial and </S>

Evaluation • Measure and compare (trade-off): • Precision • Recall • Human time • Compare bootstrapping to baselines: • Simple dictionary matching • Dictionary + hand-coded patterns (regular expressions matching labels) • Possibly combining evidence from multiple matching lines in the decision list (e.g. noisy-OR, naïve Bayes).

Questions

Co-training Internal and External Extraction Models

Co-training Internal and External Extraction Models

Presentation Transcript

Internal and External Conflict

The External and Internal Environment

Internal and External Parasites

External and Internal Analyses

Characterisation: internal and external

Internal and External Recruitment

Internal and External Variance

External and internal forces

Internal and External Variance

Respiration – external and internal

External and Internal Dose Calculation

Internal and External Conflict

Internal and External Forces

Characterisation: internal and external

Fish Internal and External Anatomy

Internal – External Order

IDENTIFY EXTERNAL AND INTERNAL / TUTORIALOUTLETDOTCOM

Internal and External Variance

Internal and External Conflict

Internal – External Order

INTERNAL AND EXTERNAL BUSINESS ENVIRONMENT