Biomedical natural language processing and text mining

Biomedical natural language processing and text mining

What is natural language processing? NLP, text mining, computational linguistics • Computational modeling of human language • Access to knowledge in linguistic form • Information retrieval • Information extraction • Document classification • Machine translation • Summarization • …

Why Biomedical NLP?

Exponential knowledge growth • 1,170 peer-reviewed gene-related databases in 2009 NAR db issue • 804,399 PubMed entries in 2008 (> 2,200/day) • Breakdown of disciplinary boundaries; more of it relevant to each of us • “Like drinking from a firehose” – Jim Ostell Slide from Larry Hunter

ExperimentalData Databases Ontologies Genbank SwissProt Literature Collections MEDLINE The Biological Data Cycle ExpertCuration Bottleneck: getting knowledge from literature to databases Solution: text mining 1

MEDLINE Model Organism Curation Pipeline 3. Curate genes from paper 2. List genes for curation 1. Select papers From Hirschman et al. BMC Bioinformatics 2005 6(Suppl 1):S1 1

The world’s best justification for BioNLP Baumgartner et al. (2007b)

Scientific Publishing & Semantics • Content enrichment • Direct access to (relevant) external data • Structured digital abstracts • Enables • Interactivity • targeted searches • relevance linking • formalizing content; actionable data

Text mining improves biological data analysis • Leverage information from the literature in the biological data mining process • Homology searches: • Filter unlikely sequence alignments through assessment of literature similarity • Score literature similarity independently of sequence similarity, and combine into unified score • Subcellullar localization • Build literature term vectors based on PubMed/MEDLINE abstracts or SWISS-PROT textual annotations • Gene expression clusters: • Assign biological explanations through extraction of significant literature terms for genes in cluster • Measure literature correlations independently, and combine with microarray correlations before clustering

Evaluation of NLP systems • Precision (aka specificity) and recall (aka sensitivity). Tradeoffs between them. • Against a “gold standard” of human generated representations of texts • Humans don’t always agree, therefore calculate inter-annotator agreement • Post-hoc judgments (particularly of IR relevance) • “Shared task” paradigm • TREC Genomics (IR) • BioCreative (IE)

Evaluation of NLP systems • Precision: • True positives / (True positives + False positive) • Recall: • True positives / (True positives + False negatives) • F-measure: “harmonic mean” of precision and recall

Evaluation of NLP systems • Formal definition: • Typical definition: β = 1, so… (1 + β2) * precision * recall Fβ = (β2 * precision) + recall

Evaluation of NLP systems • Typical definition: • …or just F: β is usually assumed to be 1 2 * precision * recall F1 = precision + recall

Evaluation of NLP systems • β allows you to weight precision and recall differently • Increasing β weights recall more highly • Decreasing β weights precision more highly • Rarely used, but designated by value of β, e.g. F0.5 or F2

Chang et al.’s improvement on PSI-BLAST (2001) Ng (2006)

Significant improvement in precision

Goal: Predict subcellular localization to understand function • Signal peptides and other sequences are indicative of localization • Machine learning based predictors are moderately accurate • Try adding text…

Subcellular localization (Stapley et al. 2002, Eskin and Agichtein 2004) Single SVM Build specialized amino acid and text kernels, then build combined kernel Ng (2006)

Text improves clustering of gene expression profiles, too • Create per-gene distance matrices based on expression data • Create per-gene distance matrices based on literature data • Combine using Fisher’s omnibus • …then cluster

Matrix merging (Glenisson et al. 2003) Ng (2006)

More sophisticated text analysis can improve these results See the YouTube Hanalyzer demo fora better sense of the process Leach et al. (2009)

APPLICATIONS

TextPresso

Chilibot (www.chilibot.net) Chen and Sharp (2004)

Chilibot Chen and Sharp (2004)

iHop (http://www.ihop-net.org/UniPub/iHOP)

Reflect (www.reflect.ws) • Firefox plug-in • Recognises proteins and small molecules mentioned in a web page, and links them to information-rich summaries. Karin Verspoor

GoPubMed Doms, A. et al. Nucl. Acids Res. 2005 33:W783-W786; doi:10.1093/nar/gki470

Biomedical Language Processing

“There is little reason for the data on which a linguist works to have the right to name that work.” Surely Shuy jests...

Tokenization is different • Commas • 2,6-diaminohexanoic acid • tricyclo(3.3.1.13,7)decanone • Hyphens • “Syntactic”(Calcium-dependent, Hsp-60) • Knocked-out gene: lush-- flies • Negation: -fever • Electric charge: Cl- • PMID: 10516078 B-cell-CD4(+)-T-cell interactions

Named Entity Recognition is different • Genes have names? to, the, there, a, I, … sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A [SEMA5A] Karin Verspoor

It really is different on every level • Corpus construction • Semantic representation … Ultimately, we need specific knowledge of the domain to do a good job with the language.

Linguistic Levels of Analysis From Hunter & Cohen, Biomedical Language Processing: What’s Beyond PubMed?, Molecular Cell 21, 589–594, 2006 DOI 10.1016/j.molcel.2006.02.012

SUBTASKS AND TOOLS

Information Retrieval • Retrieving from a collection of indexed documents • Indices based on • Words (perhaps without “stop words”) • Stems (e.g. expresses, expressed, expression ⇒ express) • Synonyms and expansions • Meta-data fields (e.g. author, title) • Keywords or “controlled vocabularies” (e.g. MeSH) • Retrieval rankings based on • Number of matching terms • TF*IDF • Independent document characteristics (citations, links, etc.) • Familiar as Google, PubMed, etc. Karin Verspoor

TF*IDF • Term frequency * Inverse Document Frequency • TF = how many times a term appears in a document • IDF = reciprocal of number of times a term appears in all documents • Measure of how informative a term is • Occurrence of rare term is more informative than that of a widely used term • Terms used frequently in a document are more informative that terms used only once • Lots of variants Karin Verspoor

Documents as queries • Use a whole document to define a query (find things similar to…) • Represent the document as: • “Bag of words” • Binary or frequency based vector of words or stems • Can add bigrams or trigrams • Reduced dimensionality (Latent Semantic Analysis) • Calculate distance to all other documents in a collection (various metrics) Karin Verspoor

Named entity recognition HSP60 Hsp-60 heat shock protein 60 Cerberus wingless Ken and Barbie the 3

Entity normalization Entity normalization: find concepts in text and map them to unique identifiers A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3

Entity normalization • Perfect named entity recognition finds 5 mentions; they correspond to just 2 genes: • FBgn0000592 (esterase 6) • FBgn0026412 (leucine aminopeptidase) A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3

Entity normalization • Partial list of synonyms for FBgn0000592: • Esterase 6 • Carboxyl ester hydrolase • CG6917 • Est6 • Est-D • Est-5 3

V-SNARE Vesicle SNARE SNAP Receptor Soluble NSF Attachment Protein N-Ethylmaleimide-Sensitive Fusion Protein Maleic acid N-ethylimide Biological Nomenclature: “V-SNARE” Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE)

Information/relation extraction Information extraction: relationships between things BINDING_EVENT Binder: Bound: 2

Information/relation extraction Met28 binds to DNA. BINDING_EVENT Binder: Met28 Bound: DNA 2

Document clustering • For browsing large numbers of relevant documents • In biomedicine, unlike most Google searches, the goal is not one relevant document, but many • Statistical measures of document distance • Cosine distance over term (or stem) vectors • PubMed document neighbors (TF*IDF clustering) • Latent Semantic Analysis (LSA) • Knowledge-based approaches: • Mapping documents to a predefined set of types • Use information extraction as basis for clustering Karin Verspoor

Automated summarization • Useful for browsing retrieved documents • Multidocument summarization can characterize document clusters • Select the “best” sentence/passage • Based on appearance of query terms (a la Google) • Other useful criteria: • Cues (“we conclude”, “demonstrating that”…) • Presence of supporting data (“Figure 6 shows that…”) • Sentence position (last sentence of abstract) • Frequency in multiple documents Karin Verspoor

Document zoning • Different “sections” or zones of a document • Introduction vs. methods vs. references, etc. • Many want to focus on (or exclude) certain zones from search or other processing • No straightforward way to identify zones • Journals often have their own structures • Section titles, HTML/XML/SGML formatting helps (PubMedCentral DTD) • Treat as discrimination problem Karin Verspoor

Extracting factual information from text • Information extraction (IE) involves parsing text for patterns encoding particular facts • Biomedical literature is full of useful information potentially amenable to IE (e.g. consequences of mutations) • BioCreative 2006/2009 Competitions on extracting protein-protein interaction statements from literature • Subtasks: • Entity identification / normalization • Finding relationships • Filling in predefined schemata Karin Verspoor

Biomedical natural language processing and text mining

Biomedical natural language processing and text mining

Presentation Transcript

Introduction to Natural Language Processing and Text Mining and The basic building blocks

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Introduction to Text Mining and Natural Language Processing ...

Natural Language Processing

Text Mining in Biomedical Research

Natural Language Processing

Natural Language Processing

Natural Language Processing

Biomedical Text Processing and HVP/INSIGHT

Natural Language Processing

Natural Language Processing

Natural Language Processing

Biomedical text mining

Natural Language Processing

Natural Language Processing

Natural Language Processing

The NLP TOOLTORIAL: Tools for Natural Language Processing and Text Mining

Natural Language Processing