600 likes | 774 Views
Research Opportunities in Biomedical Text Mining. Kevin Bretonnel Cohen Biomedical Text Mining Group Lead. More projects than people. Ongoing: Coreference resolution Software engineering perspectives on natural language processing Odd problems of full text
Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text Mining Group Lead
More projects than people • Ongoing: • Coreference resolution • Software engineering perspectives on natural language processing • Odd problems of full text • Tuberculosis and translational medicine • Discourse analysis annotation • In need of fresh blood: • Metagenomics/Microbiome studies • Translational medicine from the clinical side • Summarization • Negation • Question-answering: Why? • Nominalizations • Metamorphic testing for natural language processing
Metagenomics/microbiome studies • Experiments not interpretable/comparable without large amounts of metadata • Metadata in various places • (fielded) • GenBank isolation_source field • GOLD description fields • Journal articles (full text)
Metagenomics/microbiome studies • Various standards: • MIMS • MIMARKS (Nat Biotech, forthcoming) • Ontology terms • Continuous variables • ??
Metagenomics/microbiome studies • “Metagenomic sequence data that lack an environmental context have no value.” • Crucial to replication, analysis • Do microbial gene richness and evenness patterns (at some specific sampling density) correlate with other environmental characteristics? • Which microbial phylotypes or functional guilds co-occur with high statistical probability in different environments? • Do specific phylotypes track particular geographic or physico-chemical clines (latitudes, isotherms, isopycnals, etc.)? • Do specific microbial community ORFs (functionally identified or not) track specific bioenergetic gradients (solar, geothermal, digestive tracts, etc.)? • What is the percentage of genes with a given role, as a function of some physical feature, e.g. the average temperature of the sample sites? • Do microbial community protein families, amino acid content, or sequence motifs vary systemically as a function of habitat of origin? Are specific protein sequence motifs characteristic of specific habitats? • What is the “resistome” in soil? (Phenotype) • Habitat change over time, host-to-host variation, within-host variation—biodefense and forensics applications
Metagenomics/microbiome studies • Investigation type: eukaryote, bacteria, virus, plasmid, organelle, metagenome • Experimental factor: Experimental Factor Ontology, Ontology for Biomedical Investigations • Latitude, longitude, depth, elevation, humidity, CO2, CO, salinity, temperature, … • Geographic location (country, region, sea) from Gaz Ontology • Collection date/time • Environment, biome and features, material: Environment Ontology • Trophic level; aerobe/anaerobe • Sample collection device or method • Sample material processing: Ontology for Biomedical Investigations • Amount or size of sample • Targeted gene or locus name • PCR primer, conditions • Sequencing method • Chemicals administered: ChEBI • Diseases: Disease Ontology • Body site • Phenotype: PATO
Metagenomics/microbiome studies • Where do you find this stuff? • Text fields in databases • Isolation_source in GenBank • Description in GOLD • TBD in microbiome studies, but hopefully coming • Full text of journal articles • Marine secondary products corpus coming (pharmacogenomics connection) • Problem of tables • Multiple sentences, coreference Timeline: July 2011 Timeline: July 2011
Translational medicine from the clinical side • Factors affecting inclusion/exclusion from clinical trials • Sharpening phenotypes (7% of patients in Schwarz’s PIF study) • ICD9-CM prioritization • Gazillions of named entity recognition problems (drugs, assays, signs, symptoms, vital signs, …)
Translational medicine from the clinical side • History: foundational • Practice: difficult—access issues • Technical problems related to data availability (e.g. will you have enough for machine learning?) • TREC EMR track: yes • i2b2 obesity data: probably • Time for a renaissance • Strategy: break in via TREC; deadline: summer/fall
Summarization • Task: Given one or more documents, produce a shorter version that preserves information • Difficulties (multi-document): Duplication, aggregation, presentation • Holy grail: abstraction
Extraction vs. abstraction “An abstract is a summary at least some of whose material is not present in the input.” (Mani) • Extraction: • “Extract” strings from the input text • Assemble them to produce the summary • Abstraction: • Find meaning • Produce text that communicates the meaning “An extract is a summary consisting entirely of material copied from the input.” (Mani)
Extract, or abstract? Abstract
Relationship between summarization and generation • Natural language generation: producing textual output • Coherence (good) • Redundancy (bad) • Unresolvable anaphora (bad) • Gaps in reasoning (bad) • Lack of organization (bad)
GENE: BRCA1 SPECIES: Hs. DISEASE_ASSOC.: Breast cancer BRCA1 is found in humans. BRCA1 plays a role in breast cancer. BRCA1 is found in humans. It plays a role in breast cancer. Summarization and generation
A multi-document summary Caenorhabditis elegans p53: role in apoptosis, meiosis, and stress resistance. Bcl-2 and p53: role in dopamine-induced apoptosis and differentiation. P53 role in DNA repair and tumorigenesis.
Another multi-document summary P53: role in apoptosis, meiosis, and stress resistance, dopamine-induced apoptosis and differentiation, DNA repair and tumorigenesis.
Another multi-document summary P53 has a role in apoptosis, meiosis, and stress resistance. It also has a role in dopamine-induced apoptosis and differentiation, DNA repair, and tumorigenesis.
Summarization and generation • Examples of non-coherent summaries that wouldn’t be bad… • A table • A table of contents? • An index? • A diagram? Timeline: no pressure
Negation • Classic problem • Reasonably well-studied in clinical domain (NegEx), but heavily restricted by semantic class • Biological domain: 0.20-0.43 F-measure • Pattern-learning for OpenDMAP, machine learning, semantic role labelling…
Semantic role labelling Arg1: experiencer Arg2: origin Arg3: distance Arg4: destination Figure adapted from Haghighi et al. (2005) Timeline: no pressure
Question-answering: Why? • Why did David Koresh ask for a typewriter? • Why did I have a Clif bar for breakfast? versus Why did I have a Clif bar for breakfast instead of cereal? • Need for data set collection • Need novel methods—pattern-matching doesn’t work well
Question-answering: Why? • Overall performance is poor • 0.00 MRR versus 0.69 on birthyear (Ravichandran and Hovy 2002) • 0.33 MRR versus 0.75 on location (Ravichandran and Hovy 2002) • 45% at least partially correct (Higashinaka and Isozaki (2007) • 0.35 mean reciprocal rank (2010, Verberne et al.) • Pattern-based approaches outperformed by ML
Question-answering: Why? • …why-questions are one of the most complex types. This is mainly because the answers to why-questions are not named entities (which are in general clearly identifiable), but text passages giving a (possibly implicit) explanation (Maybury 2002 in Verberne 2007) • Answers to why-questions cannot be stated in a single phrase but they are passages of text that contain some form of (possibly implicit) explanation (Verberne et al. 2010)
Question-answering: Why? • How can we improve on machine learning methods? • Don’t try—improve pattern learning, instead • Apply what we’re learning about inference and knowledge representation from Hanalyzer-related work • Improved recognition of semantic classes in text (more on this later)
Nominalization • Nominalization: noun derived from a verb • Verbal nominalization: activation, inhibition, induction • Argument nominalization: activator, inhibitor, inducer, mutant
Nominalizations are dominant in biomedical texts Data from CRAFT corpus
Relevant points for text mining • Nominalizations are an obvious route for scaling up recall • Nominalizations are more difficult to handle than verbs… • …but can yield higher precision (Cohen et al. 2008)
Alternations of nominalizations: positions of arguments • Any combination of the set of positions for each argument of a nominalization • Pre-nominal: phenobarbital induction, trkAexpression • Post-nominal: increases of oxygen • No argument present: Induction followed a slower kinetic… • Noun-phrase-external: this enzyme can undergo activation
Result 1: attested alternations are extraordinarily diverse • Inhibition, a 3-argument predicate—Arguments 0 and 1 only shown
Implications for system-building • Distinction between absent and noun-phrase-external arguments is crucial and difficult, and finite state approaches will not suffice; merging data from different clauses and sentences may be useful • Pre-nominal arguments are undergoer by ratio of 2.5:1 • For predicates with agent and patient, post/post and pre/post patterns predominate, but others are common as well
What can be done? • External arguments: • semantic role labelling approach • …but, very important to recognize the absent/external distinction, especially with machine learning • pattern-based approach • …but, approaches to external arguments (RLIMS-P) are so far very predicate-specific
What can be done? • Pre-nominal arguments: • apply heuristic that we have identified based on distributional characteristics • for most frequent nominalizations, manual encoding may be tractable Timeline: no pressure
Metamorphic testing for NLP • Metamorphic testing motivation: situations where input/output space is intractably large and it’s not clear what would constitute right answers • Use domain knowledge to specify broad categories of changes to output that should occur with broad categories of changes to input
Metamorphic testing for NLP • Gene regulatory networks: • Add an unconnected node—G should be subsumed by G’ • SeqMap: • Given a reference string p and a set of sequence reads T = {t1, t2, ..., tn}, and a genome p, we form a new genome p' by deleting an arbitrary portion of either the beginning or ending of p. After mapping T to both p and p' independently, all reads in T that are unmappable to p should also be unmappable to p'.
Metamorphic testing for NLP • Non-linguistic • Add non-informative feature, see if feature selection screens it out • Subtract informative features, see if performance goes down • Linguistic • ? Timeline: no pressure
Wide range of projects over the past few years • Named entity recognition: • Shuhei Kinoshita, K. Bretonnel Cohen, Philip V. Ogren, and Lawrence Hunter (2005). BioCreative Task 1A: entity identification with a stochastic tagger. BMC Bioinformatics 6(Suppl. 1):S4. • Information extraction: • William A. Baumgartner, Jr., ...K. Bretonnel Cohen, and Lawrence Hunter (submitted) Leveraging concept recognition to extract protein interaction relations from biomedical text. Genome Biology. • Summarization: • Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter (2006) Finding GeneRIFs via Gene Ontology annotations. Pacific Symposium on Biocomputing 11:52-63. • Word sense disambiguation: • William A. Baumgartner, Jr., ...K. Bretonnel Cohen, and Lawrence Hunter (2007) An integrated approach to concept recognition in biomedical text. Proceedings of BioCreative II.. • Question-answering/IR: • J. Gregory Caporaso, William A. Baumgartner Jr., Hyunmin Kim, Zhiyong Lu, Helen L. Johnson, Olga Medvedeva, Anna Lindemann, Lynne Fox, Elizabeth White, K. Bretonnel Cohen, and Lawrence Hunter (2006) Concept recognition, information retrieval, and machine learning in genomics question-answering (2006) Proceedings of the Fifteenth Text Retrieval Conference. • Document classification/IR: • J. Gregory Caporaso, William A. Baumgartner Jr., K. Bretonnel Cohen, Helen L. Johnson, Jesse Paquette, and Lawrence Hunter (2005) Concept recognition and the TREC Genomics tasks. Proceedings of the Fourteenth Text Retrieval Conference, National Institute of Standards and Technology. • Computational lexical semantics: • Philip V. Ogren, K. Bretonnel Cohen, George K. Acquaah-Mensah, Jens Eberlein, and Lawrence Hunter (2004). The compositional structure of Gene Ontology terms. Pacific Symposium on Biocomputing 2004, pp. 214-225. • Philip V. Ogren, K. Bretonnel Cohen, and Lawrence Hunter (2005). Implications of compositionality in the Gene Ontology for its curation and usage.Pacific Symposium on Biocomputing 2005, pp. 174-185. • Helen L. Johnson, K. Bretonnel Cohen, William A. Baumgartner Jr., Zhiyong Lu, Michael Bada, Todd Kester, Hyunmin Kim, and Lawrence Hunter (2006) Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. Pacific Symposium on Biocomputing 11:28-39. • Corpus linguistics: • Cohen, K. Bretonnel; Lynne Fox; Philip Ogren; and Lawrence Hunter (2005). Empirical data on corpus design and usage in biomedical natural language processing. AMIA 2005 symposium proceedings, pp. 156-160. • K. Bretonnel Cohen, Lynne Fox, Philip V. Ogren, and Lawrence Hunter (2005). Corpus design for biomedical natural language processing. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases, pp. 38-45. Association for Computational Linguistics.
Other recent projects • Characterizing biomedical language • Open Access versus traditional journals • Full text versus abstracts • Nominalization and alternations • Biological event extraction • Ontology quality assurance • Evaluation from many angles—shared task organization and participation; many angles on testing • SciKnowMine and BASILISK evaluation (with Ellen Riloff) • GO term recognition (with Michael and Karin) • Grant-writing
Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane.
Coreference resolution • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane.
Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Sophia Loren, she, The actress, her, she
Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Bono, the U2 singer
How do humans do this? • Linguistic factors: • Kevin saw Larry. He liked him. • Knowledge about the world: • Sophia Loren will always be grateful to Bono. The actress… • Sophia Loren will always be grateful to Bono. The singer… • Sophia Loren will always be grateful to Bono. The storm… • A combination of world knowledge and linguistic factors: • Sophia Loren says she will always be grateful to Bono… • Sophia Loren says he will always be grateful to Bono…
Computers are bad at this • Linguistic features don’t always help. • Each child ate a biscuit. They were delicious. • Each child ate a biscuit. They were delighted. • Programming enough knowledge about the world into a computer has proven to be very difficult.
Our approach • Matching semantic categories helps • BRCA1, the gene • Cell proliferation, leukocyte proliferation • Minimal work on using ontologies • WordNet (General English, mostly) • Replacing ontology with web search • We’re going to use ontologies, and more than anyone • First step: broad semantic class assignment
Our approach • Broad semantic class assignment • Coreference resolution benefits from knowing whether semantic classes match • Semantic class ≈ what ontology you should belong to • Looking at headwords, frequent words, informativeness measures Timeline (coref, not semantic class assignment): this spring
Software engineering perspectives on natural language processing
Two paradigms of evaluation • Traditional approach: use a corpus • Expensive • Time-consuming to produce • Redundancy for some things… • …underrepresentation of others (Oepen et al. 1998) • Slow run-time (Cohen et al. 2008) • Non-traditional approach: structured test suite • Controls redundancy • Ensures representation of all phenomena • Easy to evaluate results and do error analysis • Used successfully in grammar engineering
Structured test suite Canonical Non-canonical GO:0000133 Polarisomes GO:0000108 Repairosomes GO:0000786 Nucleosomes GO:0001660 Fevers GO:0001726 Ruffles GO:0005623 Cells GO:0005694 Chromosomes GO:0005814 Centrioles GO:0005874 Microtubules • GO:0000133 Polarisome • GO:0000108 Repairosome • GO:0000786 Nucleosome • GO:0001660 Fever • GO:0001726 Ruffle • GO:0005623 Cell • GO:0005694 Chromosome • GO:0005814 Centriole • GO:0005874 Microtubule