1 / 11

Co-training Internal and External Extraction Models

Co-training Internal and External Extraction Models. By Thomas Packer. Bootstrapped Knowledge and Language Acquisition. Tom Mitchell’s Co-training Theory “ Combining Labeled and Unlabeled Data with Co-training”, Avrim Blum and Tom Mitchell , 1998.

gayle
Download Presentation

Co-training Internal and External Extraction Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Co-training Internal and External Extraction Models By Thomas Packer

  2. Bootstrapped Knowledge and Language Acquisition • Tom Mitchell’s Co-training Theory • “Combining Labeled and Unlabeled Data with Co-training”, Avrim Blum and Tom Mitchell, 1998. • Tom Mitchell’s Coupled Bootstrap Learning • “Coupling Semi-Supervised Learning of Categories and Relations”, Andrew Carlson, Justin Betteridge, Estevam Rafael HruschkaJr. and Tom M. Mitchell, 2009. • David Yarowsky • “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”, 1995.

  3. Source Document Types • Semi-structured, noisy OCR’d historical documents: • (This presentation.) • Semi-structured, clean(-ish) HTML web pages: • Using multiple ontology constraints (Tom Mitchell, Andrew Carlson paper). • Adding the learning and utilizing of cardinality constraints.

  4. OCR Documents

  5. OCR Documents 380,641,672,686 WOMEN'S 670,641,893,686 HOME 891,641,1316,685 MISSIONARY 1314,639,1622,686 SOCIETY 909,886,1091,931 Officers 192,969,450,1029 Presidenlt 1032,972,1077,1011 M 1086,986,1135,1011 RS 1142,972,1391,1019 CHARLES 1388,974,1464,1013 A 1475,973,1692,1026 JEWELL 207,1037,309,1077 Vice 308,1038,597,1077 Presidenl

  6. OCR Documents WOMEN'S HOME MISSIONARY SOCIETY Officers Presidenlt M RS CHARLES A JEWELL Vice Presidenl MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Recording Secretarv Miss JOSEPHINE WHITE Corresponding S retary Mss JULIA A GRAVES Treasurer Ms H B LANGDON Chairman IWorkComtmittee Miss MARY H ADAMS Chairman of 31emb 'ership Miss ELIZA F Mix Chairman of Purchlasizng Cont 'MRs MIARY C ST )NEC Chairman of Socia I ConnilLt'e MIRS AI I ERT H PITKIN Secretary's Report This Society is auxiliary to the Women's Home Missionary Union of Connecticut Its membership is 120 and its active season extends from November to April Meetings are held semi-monthly on Friday afternoons from 2 until 5 o'clock The time is occupied in sewing hearing letters from the home missionary field transacting business and in social intercourse often ending with tea

  7. Co-trainable Extraction Models • Internal Model: • Decision list. • Maps word to label with certain percentage confidence. • “James”  ‘Given Name’ 0.9 • External Model: • Decision list. • Map collocation patterns to labels with certain percentage confidence. • Left token is ‘Given Name’, right token is ‘Surname’, current token has length=1  ‘Initial’ 0.95

  8. Bootstrapping Approach • Initialize empty models (internal and external). • Manually create seed ontology, e.g. list of first names, last names, etc. • Process documents, extracting instances and features. • Loop: • Label words with top-precision labels based on current models. • Propose new model elements based on newly labeled tokens. • Update model parameters based on label statistics.

  9. OCR Documents M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN • Seed models: • Prefix: “Mrs”, Miss”, “Mr” • Initials: “A”, “B”, “C”, … • Given Name: “Charles”, “Francis”, Herbert” • Surname: “Goodrich”, Wells”, White” • Stopword: “Jewell”, “Graves” • Updates: • Prefix: first token in line • Given Name: between ‘Prefix’ and ‘Initial’ • Surname: between initial and </S>

  10. Evaluation • Measure and compare (trade-off): • Precision • Recall • Human time • Compare bootstrapping to baselines: • Simple dictionary matching • Dictionary + hand-coded patterns (regular expressions matching labels) • Possibly combining evidence from multiple matching lines in the decision list (e.g. noisy-OR, naïve Bayes).

  11. Questions

More Related