Research in the verspoor lab
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Research in the Verspoor Lab PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on
  • Presentation posted in: General

Research in the Verspoor Lab. Text Mining. Information extraction from the biomedical literature Entity recognition and normalization Relation and event extraction Last time, I promised that we would look at: Ontologies as constraints for information extraction. Making BioNLP relevant.

Download Presentation

Research in the Verspoor Lab

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Research in the verspoor lab

Research in the Verspoor Lab


Text mining

Text Mining

  • Information extraction from the biomedical literature

    • Entity recognition and normalization

    • Relation and event extraction

  • Last time, I promised that we would look at:

    • Ontologies as constraints for information extraction


Making bionlp relevant

Making BioNLP relevant

  • Recognition of OBO terms, relations

  • CRAFT corpus (first release later this year)


Opendmap extracts typed relations from the literature

OpenDMAP extracts typed relations from the literature

  • Concept recognition tool

    • Connect ontological terms to literature instances

    • Built on Protégé knowledge representation system

  • Language patterns associated with concepts and slots

    • Patterns can contain text literals, other concepts, constraints (conceptual or syntactic), ordering information, or outputs of other processing.

    • Linked to many text analysis engines via UIMA

  • Best performance in BioCreative II IPS task

  • >500,000 instances of three predicates (with arguments) extracted from Medline Abstracts

  • [Hunter, et al., 2008] http://bionlp.sourceforge.net


Opendmap

freetext

ontology

patterns

OpenDMAP

extracted

information

OpenDMAP


Opendmap1

freetext

ontology

patterns

OpenDMAP

protein protein interaction:

interactor1: cyclin E2

interactor2: cdk2

extracted

information

OpenDMAP

Cyclin E2 interacts with Cdk2 in a functional kinase complex.

<ontology>

Protein protein interaction :=

[int1] interacts with [int2]


Opendmap2

PROTÉGÉ ONTOLOGY

CLASS: protein protein interaction

SLOT: interactor1

TYPE: molecule

SLOT: interactor2

TYPE: molecule

PATTERNS

{c-interact} := [interactor1] interacts with [interactor2]

{c-interact} := [interactor1] is bound by [interactor2]

OpenDMAP

OpenDMAP


Biocreative ii example

BioCreative II Example

  • Some BioCreative patterns for interact

    {c-interact} := [interactor1]{w-is}{w-interact-verb1}{w-preposition} the? [interactor2];

    {w-is} := is, are, was, were;

    {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates, co-immunoprecipitated, co-localize, co-localizes, co-localized;

    {w-preposition} := among, between, by, of, with, to;

  • Matched text:

    PMID 16494873, SENT_ID 16494873_114

    Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9wasco-immunoprecipitatedwithSOX10}.

    INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT

    INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT

    {c-interact}:= [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2


Biocreative results

BioCreative Results

  • 359 full-text articles in the test set

  • 385 interaction assertions produced

  • Performance averaged per article (to avoid dominance of a few assertion-heavy articles)

    P = 0.39, R = 0.31,F = 0.29

  • Best result in the evaluation!

    • F score 10% higher than next-scoring system

    • F score > 3 standard deviations above mean

    • Recall 20% higher than next-scoring system


Biocreative conclusions

BioCreative conclusions

  • Information extraction in biomedical text is hard

    • Linguistic variability in how concepts are expressed

    • Complex concepts with multiple “slots”

  • OpenDMAP advances the state of the art

    • Use of an ontology grounds the search for information

    • Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)


Bionlp 09 methods

BioNLP’09: Methods

Protein_transport :=

[TRANSPORTED-ENTITY] translocation

@(from {DET}? [TRANSPORT-ORIGIN])

@(to {DET}? [TRANSPORT-DESTINATION])

Bax translocationto mitochondriafrom the cytosol

Bax translocationfrom the cytosolto the mitochondria

Slide credit: Kevin B. Cohen


Bionlp 09 methods1

BioNLP’09: Methods

Protein_transport :=

[TRANSPORTED-ENTITY] translocation

@(from {DET}? [TRANSPORT-ORIGIN])

@(to {DET}? [TRANSPORT-DESTINATION])

Protein

(Sequence Ontology)

Cellular Component

(Gene Ontology)

Slide credit: Kevin B. Cohen


Bionlp 09 methods2

BioNLP’09: Methods

Slide credit: Kevin B. Cohen


Bionlp 09 methods3

BioNLP’09: Methods

  • All event types represented as frames

    • Elements from ontology constrain every slot

      EVENT TYPE: REGULATION

      AtLoc: instance of biological_entity

      Cause: instance of protein

      CSite: instance of biological_concept or polypeptide_region

      Event_action: instance of trigger_word or detection_method

      Site: instance of biological_concept or polypeptide_region

      Theme: instance of protein or biological_process

      ToLoc: instance of biological_entity

Sequence Ontology

Molecular Interaction Ontology

Gene Ontology

Cell Cycle Ontology

Slide credit: Kevin B. Cohen


Bionlp 09 methods4

BioNLP’09: Methods

Partial view of ontology—reality is a little bit less clean

Slide credit: Kevin B. Cohen


Bionlp 09 methods5

BioNLP’09: Methods

BTO: BRENDA Tissue Ontology

CCO: Cell Cycle Ontology

CTO: Cell Type Ontology

GO: Gene Ontology

SO: Sequence Ontology

Slide credit: Kevin B. Cohen


Bionlp 09 methods6

BioNLP’09: Methods

  • Manual pattern-writing

    • Before availability of training data: based on native speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004)

    • After release of training data: based on examination of corpus data, targeting high-frequency predicates only

    • Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement

    • Protein binding rules re-used from BioCreative II protein-protein interaction task

    • Eschewed use of wildcards

Slide credit: Kevin B. Cohen


Bionlp 09 results

BioNLP’09: Results

Task 1: P 10 points higher than second-highest

Task 2: P 14 points higher than second-highest

Task 3: P 3.4 points lower than highest (3/6)

Slide credit: Kevin B. Cohen


Bionlp 09 results1

BioNLP’09: Results

Unofficial results: contribution of bug repairs

Still the highest precision (#2 was 62.21)

Slide credit: Kevin B. Cohen


Bionlp 09 results2

BioNLP’09: Results

  • Contribution of coördination-handling

    • Bug-fixed results: F 27.62 (Task 1)

    • Without coordination-handling: F 24.72

    • Decrease in F of 2.9 without coördination-handling

Slide credit: Kevin B. Cohen


Syntax helps

Syntax helps

  • 125I-labeled C3b was covalently deposited on CR2, when hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>

  • CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>

  • The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>

  • Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>


More complex examples

More complex examples

  • Complex noun phrases

  • The inactive C3 (iC3), which forms spontaneously in serum in low amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2>

  • RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter>

  • Negation

  • TNP-BSA, however, did not bind to the CD4 receptor. <trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor>

  • Similarly, when cells expressing the wild type FSHR were treated with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>


Coordination is particularly hard

Coordination isparticularly hard

In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA. <mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa>

Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>

The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>


Bionlp shared task 11

BioNLP Shared Task ‘11

  • Extension of BioNLP’09 tasks

    • Generalization to full text (from abstracts)

    • Additional event types: post-translational modifications and catalysis

  • Methods:

    • Based on empirically derived patterns

    • Derived from training data + manual refinement

    • Using dependency relations (syntax)

    • Work of Haibin Liu (postdoc)


Integrating background knowledge

Integrating background knowledge

  • Can improve OpenDMAP precision with minimal cost to recall

    • Take advantage of background knowledge

    • Tighten constraints on slot fillers in the ontology

    • No change to existing patterns

  • Proof of concept:

    • Distinguish among several types of protein activation (enzyme and receptor) in GeneRIFs

    • Utilize Gene Ontology annotations


Refining selectional restrictions

Refining selectional restrictions

TP: [GeneRIF 104155 ]

an ER stress induces the activation of [caspase-12_protein - catalytic activity]activated_entity via [caspase-3_protein]activator

prevented FP: [GeneRIF 105594]

factor Xa can induce mesangial cell proliferation through the activation of ERK_proteinvia PAR2_protein in mesangial cells


Results

Results


Biological entities

Biological entities

  • Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest:

    • Diseases

    • Drugs, Chemicals, and other treatments

    • Anatomical and other locations

    • Time and temporal relationships

    • Methods and evidence

    • Molecular functions, biological processes


Biological concept recognition

Biological Concept Recognition


Two dictionary based tools tested against craft

Two dictionary-based toolstested against CRAFT

  • UIMA ConceptMapper

    http://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator

    • stemming and case matching relaxation

    • non-contiguous spans

    • ignore stopwords

    • order-independent lookup

  • Open Biomedical Annotator

    http://bioportal.bioontology.org/annotator

    • ignore stopwords

    • partial word matches


Best run results

Best run results

  • CM/CTO: stemming + FindAllMatches: false

  • OBA/CTO: using default stop words

  • CM/GO_CC: stemming + caseMatch: insensitive

  • CM/ChEBI: caseMatch: sensitive


Concept matching conclusions

Concept Matching Conclusions

  • The kinds of terms in the ontology matter

  • The strategies used in the dictionary matching tools matter

  • OpenDMAP will support strategies that go beyond dictionary matching …


Evaluation via test suite

Evaluation via Test Suite

  • Big picture: How to evaluate ontology concept recognition systems?

  • Traditional approach: “corpus”

    • Expensive

    • Time-consuming to produce

    • Redundancy for some things…

    • …underrepresentation of others

  • Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that:

    • Control test data

    • Eliminate redundancy

    • Systematic coverage (Oepen 1998)

  • Immediate (broad) goal of this work: Are there general principles for test suite design?

Slide credit: Kevin B. Cohen


Methods

Methods

  • Steps: develop “catalogue” of dimensions along which terms vary

  • Use insights from linguistics and from how we know concept recognition systems work

    • Structural aspects: length

    • Content aspects: typography, orthography, lexical contents (function words)…

  • …to build a structured set of test cases

  • Also compare to other test suite work (Cohen et al. 2004) to look for common principles

Slide credit: Kevin B. Cohen


Structured test suite

Structured test suite

Canonical

Non-canonical

GO:0000133Polarisomes

GO:0000108Repairosomes

GO:0000786Nucleosomes

GO:0001660Fevers

GO:0001726Ruffles

GO:0005623Cells

GO:0005694Chromosomes

GO:0005814Centrioles

GO:0005874Microtubules

  • GO:0000133Polarisome

  • GO:0000108Repairosome

  • GO:0000786Nucleosome

  • GO:0001660Fever

  • GO:0001726Ruffle

  • GO:0005623Cell

  • GO:0005694Chromosome

  • GO:0005814Centriole

  • GO:0005874Microtubule

indution of apoptosis -> apoptosis induction (Syntax)

cell migration -> cell migrated (Part of speech)

ensheathment of neurons -> ensheathment of some neurons

Slide credit: Kevin B. Cohen


Methods results

Methods/Results

  • Gene Ontology, revision 9/24/2009

  • Canonical: 188

  • Non-canonical: 117

  • Observation:

    • 5:1 “dirty” versus 5:1 “clean” is mark of “mature” testing

  • Applied publicly available concept recognition system

Slide credit: Kevin B. Cohen


Results1

Results

  • 97.9% of canonical terms were recognized

    • All exceptions contain the word in

  • No non-canonical terms were recognized

  • What would it take to recognize the error pattern with canonical terms with a corpus-based approach??

  • General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context

Slide credit: Kevin B. Cohen


  • Login