Research in the verspoor lab
1 / 41

Research in the Verspoor Lab - PowerPoint PPT Presentation

  • Uploaded on

Research in the Verspoor Lab. Text Mining. Information extraction from the biomedical literature Entity recognition and normalization Relation and event extraction Last time, I promised that we would look at: Ontologies as constraints for information extraction. Making BioNLP relevant.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Research in the Verspoor Lab' - edita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Text mining
Text Mining

  • Information extraction from the biomedical literature

    • Entity recognition and normalization

    • Relation and event extraction

  • Last time, I promised that we would look at:

    • Ontologies as constraints for information extraction

Making bionlp relevant
Making BioNLP relevant

  • Recognition of OBO terms, relations

  • CRAFT corpus (first release later this year)

Opendmap extracts typed relations from the literature
OpenDMAP extracts typed relations from the literature

  • Concept recognition tool

    • Connect ontological terms to literature instances

    • Built on Protégé knowledge representation system

  • Language patterns associated with concepts and slots

    • Patterns can contain text literals, other concepts, constraints (conceptual or syntactic), ordering information, or outputs of other processing.

    • Linked to many text analysis engines via UIMA

  • Best performance in BioCreative II IPS task

  • >500,000 instances of three predicates (with arguments) extracted from Medline Abstracts

  • [Hunter, et al., 2008]














protein protein interaction:

interactor1: cyclin E2

interactor2: cdk2




Cyclin E2 interacts with Cdk2 in a functional kinase complex.


Protein protein interaction :=

[int1] interacts with [int2]



CLASS: protein protein interaction

SLOT: interactor1

TYPE: molecule

SLOT: interactor2

TYPE: molecule


{c-interact} := [interactor1] interacts with [interactor2]

{c-interact} := [interactor1] is bound by [interactor2]



Biocreative ii example
BioCreative II Example

  • Some BioCreative patterns for interact

    {c-interact} := [interactor1]{w-is}{w-interact-verb1}{w-preposition} the? [interactor2];

    {w-is} := is, are, was, were;

    {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates, co-immunoprecipitated, co-localize, co-localizes, co-localized;

    {w-preposition} := among, between, by, of, with, to;

  • Matched text:

    PMID 16494873, SENT_ID 16494873_114

    Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9wasco-immunoprecipitatedwithSOX10}.

    INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT

    INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT

    {c-interact}:= [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2

Biocreative results
BioCreative Results

  • 359 full-text articles in the test set

  • 385 interaction assertions produced

  • Performance averaged per article (to avoid dominance of a few assertion-heavy articles)

    P = 0.39, R = 0.31,F = 0.29

  • Best result in the evaluation!

    • F score 10% higher than next-scoring system

    • F score > 3 standard deviations above mean

    • Recall 20% higher than next-scoring system

Biocreative conclusions
BioCreative conclusions

  • Information extraction in biomedical text is hard

    • Linguistic variability in how concepts are expressed

    • Complex concepts with multiple “slots”

  • OpenDMAP advances the state of the art

    • Use of an ontology grounds the search for information

    • Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)

Bionlp 09 methods
BioNLP’09: Methods

Protein_transport :=

[TRANSPORTED-ENTITY] translocation



Bax translocationto mitochondriafrom the cytosol

Bax translocationfrom the cytosolto the mitochondria

Slide credit: Kevin B. Cohen

Bionlp 09 methods1
BioNLP’09: Methods

Protein_transport :=

[TRANSPORTED-ENTITY] translocation




(Sequence Ontology)

Cellular Component

(Gene Ontology)

Slide credit: Kevin B. Cohen

Bionlp 09 methods2
BioNLP’09: Methods

Slide credit: Kevin B. Cohen

Bionlp 09 methods3
BioNLP’09: Methods

  • All event types represented as frames

    • Elements from ontology constrain every slot


      AtLoc: instance of biological_entity

      Cause: instance of protein

      CSite: instance of biological_concept or polypeptide_region

      Event_action: instance of trigger_word or detection_method

      Site: instance of biological_concept or polypeptide_region

      Theme: instance of protein or biological_process

      ToLoc: instance of biological_entity

Sequence Ontology

Molecular Interaction Ontology

Gene Ontology

Cell Cycle Ontology

Slide credit: Kevin B. Cohen

Bionlp 09 methods4
BioNLP’09: Methods

Partial view of ontology—reality is a little bit less clean

Slide credit: Kevin B. Cohen

Bionlp 09 methods5
BioNLP’09: Methods

BTO: BRENDA Tissue Ontology

CCO: Cell Cycle Ontology

CTO: Cell Type Ontology

GO: Gene Ontology

SO: Sequence Ontology

Slide credit: Kevin B. Cohen

Bionlp 09 methods6
BioNLP’09: Methods

  • Manual pattern-writing

    • Before availability of training data: based on native speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004)

    • After release of training data: based on examination of corpus data, targeting high-frequency predicates only

    • Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement

    • Protein binding rules re-used from BioCreative II protein-protein interaction task

    • Eschewed use of wildcards

Slide credit: Kevin B. Cohen

Bionlp 09 results
BioNLP’09: Results

Task 1: P 10 points higher than second-highest

Task 2: P 14 points higher than second-highest

Task 3: P 3.4 points lower than highest (3/6)

Slide credit: Kevin B. Cohen

Bionlp 09 results1
BioNLP’09: Results

Unofficial results: contribution of bug repairs

Still the highest precision (#2 was 62.21)

Slide credit: Kevin B. Cohen

Bionlp 09 results2
BioNLP’09: Results

  • Contribution of coördination-handling

    • Bug-fixed results: F 27.62 (Task 1)

    • Without coordination-handling: F 24.72

    • Decrease in F of 2.9 without coördination-handling

Slide credit: Kevin B. Cohen

Syntax helps
Syntax helps

  • 125I-labeled C3b was covalently deposited on CR2, when hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>

  • CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>

  • The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>

  • Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>

More complex examples
More complex examples

  • Complex noun phrases

  • The inactive C3 (iC3), which forms spontaneously in serum in low amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2>

  • RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter>

  • Negation

  • TNP-BSA, however, did not bind to the CD4 receptor. <trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor>

  • Similarly, when cells expressing the wild type FSHR were treated with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>

Coordination is particularly hard
Coordination isparticularly hard

In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA. <mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa>

Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>

The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>

Bionlp shared task 11
BioNLP Shared Task ‘11

  • Extension of BioNLP’09 tasks

    • Generalization to full text (from abstracts)

    • Additional event types: post-translational modifications and catalysis

  • Methods:

    • Based on empirically derived patterns

    • Derived from training data + manual refinement

    • Using dependency relations (syntax)

    • Work of Haibin Liu (postdoc)

Integrating background knowledge
Integrating background knowledge

  • Can improve OpenDMAP precision with minimal cost to recall

    • Take advantage of background knowledge

    • Tighten constraints on slot fillers in the ontology

    • No change to existing patterns

  • Proof of concept:

    • Distinguish among several types of protein activation (enzyme and receptor) in GeneRIFs

    • Utilize Gene Ontology annotations

Refining selectional restrictions
Refining selectional restrictions

TP: [GeneRIF 104155 ]

an ER stress induces the activation of [caspase-12_protein - catalytic activity]activated_entity via [caspase-3_protein]activator

prevented FP: [GeneRIF 105594]

factor Xa can induce mesangial cell proliferation through the activation of ERK_proteinvia PAR2_protein in mesangial cells

Biological entities
Biological entities

  • Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest:

    • Diseases

    • Drugs, Chemicals, and other treatments

    • Anatomical and other locations

    • Time and temporal relationships

    • Methods and evidence

    • Molecular functions, biological processes

Two dictionary based tools tested against craft
Two dictionary-based toolstested against CRAFT

  • UIMA ConceptMapper

    • stemming and case matching relaxation

    • non-contiguous spans

    • ignore stopwords

    • order-independent lookup

  • Open Biomedical Annotator

    • ignore stopwords

    • partial word matches

Best run results
Best run results

  • CM/CTO: stemming + FindAllMatches: false

  • OBA/CTO: using default stop words

  • CM/GO_CC: stemming + caseMatch: insensitive

  • CM/ChEBI: caseMatch: sensitive

Concept matching conclusions
Concept Matching Conclusions

  • The kinds of terms in the ontology matter

  • The strategies used in the dictionary matching tools matter

  • OpenDMAP will support strategies that go beyond dictionary matching …

Evaluation via test suite
Evaluation via Test Suite

  • Big picture: How to evaluate ontology concept recognition systems?

  • Traditional approach: “corpus”

    • Expensive

    • Time-consuming to produce

    • Redundancy for some things…

    • …underrepresentation of others

  • Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that:

    • Control test data

    • Eliminate redundancy

    • Systematic coverage (Oepen 1998)

  • Immediate (broad) goal of this work: Are there general principles for test suite design?

Slide credit: Kevin B. Cohen


  • Steps: develop “catalogue” of dimensions along which terms vary

  • Use insights from linguistics and from how we know concept recognition systems work

    • Structural aspects: length

    • Content aspects: typography, orthography, lexical contents (function words)…

  • …to build a structured set of test cases

  • Also compare to other test suite work (Cohen et al. 2004) to look for common principles

Slide credit: Kevin B. Cohen

Structured test suite
Structured test suite



GO:0000133 Polarisomes

GO:0000108 Repairosomes

GO:0000786 Nucleosomes

GO:0001660 Fevers

GO:0001726 Ruffles

GO:0005623 Cells

GO:0005694 Chromosomes

GO:0005814 Centrioles

GO:0005874 Microtubules

  • GO:0000133 Polarisome

  • GO:0000108 Repairosome

  • GO:0000786 Nucleosome

  • GO:0001660 Fever

  • GO:0001726 Ruffle

  • GO:0005623 Cell

  • GO:0005694 Chromosome

  • GO:0005814 Centriole

  • GO:0005874 Microtubule

indution of apoptosis -> apoptosis induction (Syntax)

cell migration -> cell migrated (Part of speech)

ensheathment of neurons -> ensheathment of some neurons

Slide credit: Kevin B. Cohen

Methods results

  • Gene Ontology, revision 9/24/2009

  • Canonical: 188

  • Non-canonical: 117

  • Observation:

    • 5:1 “dirty” versus 5:1 “clean” is mark of “mature” testing

  • Applied publicly available concept recognition system

Slide credit: Kevin B. Cohen


  • 97.9% of canonical terms were recognized

    • All exceptions contain the word in

  • No non-canonical terms were recognized

  • What would it take to recognize the error pattern with canonical terms with a corpus-based approach??

  • General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context

Slide credit: Kevin B. Cohen