1 / 40

Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction. Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA 19 Sept 2004. Talk Outline.

guy-jenkins
Download Presentation

Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Ensembles ofFirst-Order Clauses for Recall-Precision CurvesA Case Study inBiomedical Information Extraction Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA 19 Sept 2004

  2. Talk Outline • Inductive Logic Programming • Biomedical Information Extraction • Our Gleaner Approach • Aleph Ensembles • Evaluation and Results • Future Work

  3. Inductive Logic Programming • Machine Learning • Classify data into categories • Divide data into train and test sets • Generate hypotheses on train set and then measure performance on test set • In ILP, data are Objects … • person, block, molecule, word, phrase, … • and Relations between them • grandfather, has_bond, is_member, …

  4. Learning daughter(A,B) • Positive • daughter(mary, ann) • daughter(eve, tom) • Negative • daughter(tom, ann) • daughter(eve, ann) • daughter(ian, tom) • daughter(ian, ann) • … • Background Knowledge • mother(ann, mary) • mother(ann, tom) • father(tom, eve) • father(tom, ian) • female(ann) • female(mary) • female(eve) • male(tom) • male(ian) Ann Mother Mother Tom Mary Father Father Eve Ian • Possible Rules • daughter(A,B) :- true. • daughter(A,B) :- female(A). • daughter(A,B) :- female(A), male(B). • daughter(A,B) :- female(A), father(B,A). • daughter(A,B) :- female(A), mother(B,A). • …

  5. ILP Domains • Object Learning • Trains, Carcinogenesis • Link Learning • Binary predicates

  6. Biomedical Information Extraction *image courtesy of National Human Genome Research Institute

  7. Yeast Protein Database

  8. Biomedical Information Extraction • Given: Medical Journal abstracts tagged with protein localization relations • Do: Construct system to extract protein localization phrases from unseen text NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism.

  9. sentence noun phrase noun phrase noun phrase verb phrase noun phrase prep phrase … noun verb article adj noun prep Biomedical Information Extraction NPL3 encodes a nuclear protein with … alphanumeric marked location

  10. S C B A article Sample Extraction Structure • Find structures using ILP contains alphanumeric P noun L noun contains marked location contains no between halfX verb contains alphanumeric

  11. Protein Localization Extraction • Hand-labeled dataset(Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB ofbackground knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates

  12. Our Generate-and-Test Approach Parsed sentence (NP’s non-blue) NPL3 encodes a nuclear protein with … Candidates generated rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc)rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc)

  13. Some Ranking Predicates • High-scoring words in protein phrases • repressor, ypt1p, nucleoporin • High-scoring words in location phrases • cytoskeleton, inner, predominately • High-scoring BETWEEN prot & loc • cofraction, mainly, primarily, …, locate • Stemming seemed to hurt here … • Warning: must do PER fold

  14. Some Biomedical Predicates • On-Line Medical Dictionary • natural source for semantic classes • eg, word occurs in category ‘cell biology’ • Medical Subject Headings (MeSH) • canonized method for indexing biomedical articles • ISA hierarchy of words and subcategories • Gene Ontology (GO) • another ISA hierarchy of biological knowledge

  15. Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases protein_before_location(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) Some More Predicates

  16. Link Learning • Large skew toward negatives • 500 relational objects • 5000 positive links means 245,000 negative links • Difficult to measure success • Always negative classifier is 98% accurate • ROC curves look overly optimistic • Enormous quantity of data • 4,285,199,774 web pages indexed by Google • PubMed includes over 15 million citations

  17. Our Approach • Develop fast ensemble algorithms focused on recalland precision evaluation • Key Ideas of Gleaner • Keep wide range of clauses • Create separate theories for different recall ranges • Evaluation • Area Under Recall Precision Curve (AURPC) • Time = Number of clauses considered

  18. TP TP + + TP TP FN FP Gleaner - Background • Prediction vs Actual • Positive or Negative • True or False • Focus on positive examples • Recall = • Precision = prediction TP TN actual FN FP

  19. Gleaner - Background • Seed Example • A positive example that our clause must cover • Bottom Clause • All predicates which are true about seed example • Rapid Random Restart (Zelezny et al ILP 2002) • Stochastic selection of starting clause • Time-limited local heuristic search • We store variety of clauses (based on recall)

  20. Gleaner - Learning • Create B Bins • Generate Clauses • Record Best • Repeat for K seeds Precision Recall

  21. Gleaner - Combining • Combine K clauses per bin • If at least L of K clauses match, call example positive • How to choose L ? • L=1 then high recall, low  precision • L=K then low  recall, high precision • Our method • Choose L such that ensemble recall matches bin b • Bin b’s precision should be higher than any clause in it • We should now have set of high precisionrule sets spanning space of recall levels

  22. How to use Gleaner • Generate Curve • User Selects Recall Bin • Return ClassificationsWith Precision Confidence Precision Recall = 0.50 Precision = 0.70 Recall

  23. Aleph - Learning • Aleph learnstheories of clauses(Srinivasan, v4, 2003) • Pick positive seed example, find bottom clause • Use heuristic search to find best clause • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Theory produces one recall-precision point • Learning complete theories is time-consuming • Can produce ranking with ensembles

  24. Aleph Ensembles • We compare to ensembles of theories • Algorithm (Dutra et al ILP 2002) • Use K different initial seeds • Learn K theories containing C clauses • Rank examples by the number of theories • Need to balance C for high performance • Small C leads to low recall • Large C leads to converging theories

  25. Aleph Ensembles (100 theories)

  26. Evaluation Metrics • Area Under Recall-Precision Curve (AURPC) • All curves standardized to cover full recall range • Averaged AURPC over 5 folds • Number of clauses considered • Rough estimate of time • Both are “stop anytime” parallel algorithms 1.0 Precision Recall 1.0

  27. AURPC Interpolation • Convex interpolation in RP space? • Precision interpolation is counterintuitive • Example: 1000 positive & 9000 negative 750 4750 0.75 0.53 0.75 0.14 Example Counts ROC Curves RP Curves

  28. AURPC Interpolation

  29. Experimental Methodology • Performed five-fold cross-validation • Variation of parameters • Gleaner (20 recall bins) • # seeds = {25, 50, 75, 100} • # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} • Ensembles (0.75 minacc, 35,000 nodes) • # theories = {10, 25, 50, 75, 100} • # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

  30. Results: Testfold 5 at 1,000,000 clauses Gleaner Ensembles

  31. Results: Gleaner vs Aleph Ensembles

  32. Further Results

  33. Conclusions • Gleaner • Focuses on recall and precision • Keeps wide spectrum of clauses • Good results in few cpu cycles • Aleph ensembles • ‘Early stopping’ helpful • Require more cpu cycles • AURPC • Useful metric for comparison • Interpolation unintuitive

  34. Future Work • Improve Gleaner performance over time • Explore alternate clause combinations • Better understanding of AURPC • Search for clauses that optimize AURPC • Examine more ILP link-learning datasets • Use Gleaner with other ML algorithms

  35. Acknowledgements • USA NLM Grant 5T15LM007359-02 • USA NLM Grant 1R01LM07050-01 • USA DARPA Grant F30602-01-2-0571 • USA Air Force Grant F30602-01-2-0571 • Condor Group • David Page • Vitor Santos Costa, Ines Dutra • Soumya Ray, Marios Skounakis, Mark Craven Dataset available at (URL in proceedings) ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location

  36. Deleted Scenes • Clause Weighting • Gleaner Algorithm Director Commentary on off

  37. Take-Home Message • Definition of Gleaner • One who gathers grain left behind by reapers • Gleaner and ILP • Many clauses constructed and evaluated in ILP hypothesis search • We need to make better use of those that aren’t the highest scoring ones • Thanks, Questions?

  38. Clause Weighting • Single Theory Ensemble • rank by how many clauses cover examples • Weight clauses using tuneset statistics • CN2 (average precision of matching clauses) • Lowest False Positive Rate Score • Cumulative • F1 score Recall • Precision Diversity

  39. Clause Weighting

  40. Gleaner Algorithm • Create B equal-sized recall bins • For K different seeds • Generate rules using Rapid Random Restart • Record best rule (precision x recall)found for each bin • For each recall bin B • Find threshold L of K clauses such thatrecall of “at least L of K clauses match examples”= recall for this bin • Find recall and precision on testset using each bin’s “at least L of K” decision process

More Related