slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text Mining Applications for Literature Curation PowerPoint Presentation
Download Presentation
Text Mining Applications for Literature Curation

Loading in 2 Seconds...

play fullscreen
1 / 28

Text Mining Applications for Literature Curation - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Text Mining Applications for Literature Curation. Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium. WormBase: A Database for C. elegans and Other Nematodes. www.wormbase.org. Curating Diverse Data Types . Aggregation Behavior. Which worms aggregate

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text Mining Applications for Literature Curation' - vea


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Text Mining Applications for

Literature Curation

Kimberly Van Auken

WormBase Consortium

Textpresso

Gene Ontology Consortium

slide3

Curating Diverse Data Types

Aggregation Behavior

Which worms aggregate

with other worms

and what contributes

to that behavior?

Bendesky et al., 2012, PLoS Genetics

slide4

Curating Diverse Data Types

Aggregation Behavior

Which worms (Strain)

aggregate with

other worms and

and what contributes to

that behavior?

Bendesky et al., 2012, PLoS Genetics

slide5

Curating Diverse Data Types

Aggregation Behavior

Which worms (Strain)

aggregate with

other worms

and what contributes to

that behavior?

Bendesky et al., 2012, PLoS Genetics

  • Strain information:
  • August 1, 1972
  • Pineapple field in Hawaii
slide6

Curating Diverse Data Types

Aggregation Behavior

Which worms

aggregate with

other worms (Phenotype)

and what contributes

to the behavior?

Bendesky et al., 2012, PLoS Genetics

slide7

Curating Diverse Data Types

Aggregation Behavior

Which worms

aggregate with

other worms (Phenotype)

and what contributes to

that behavior?

Bendesky et al., 2012, PLoS Genetics

  • Worm Phenotype Ontology (WPO): Bordering
  • (WBPhenotype:0001820)
  • Life stage ontology, e.g., L3 larval stage
  • Assay, e.g., food source
slide8

Curating Diverse Data Types

Aggregation Behavior

Which worms (Strain)

aggregate with

other worms (Phenotype)

and what contributes to

that behavior

(Molecular Basis)?

Bendesky et al., 2012, PLoS Genetics

slide9

Curating Diverse Data Types

Aggregation Behavior

Which worms (Strain)

aggregate with

other worms (Phenotype)

and what contributes

to that

behavior (Molecular Basis)?

Bendesky et al., 2012, PLoS Genetics

  • Gene: npr-1
  • Variation: ad609 (T(83)->I and T(144)->A)
  • Gene Ontology for npr-1:
    • Biological Process: feeding behavior
    • Molecular Function: neuropeptide receptor activity
    • Cellular Component: integral to plasma membrane
slide10

Literature Curation Workflow

PubMed keyword search – ‘elegans’

Full text paper acquisition

Data type flagging and entity recognition

Detailed curation/Fact extraction

slide11

Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’

Download citation XML

Article

type

Curator

actions

PMID

Title

Authors

Abstract

Journal

slide12

Literature Curation Workflow – Full Text Acquisition

  • Fully manual step
  • Done for all papers we select
  • Electronic copies stored in curation database
slide13

Data Type Flagging/Triage

  • Data Type Flagging/Triage:
  • General classification of papers
  • What types of experiments are in a paper?
  • e.g. RNAi phenotypes, Variation phenotypes,
  • Expression patterns, Physical interactions
slide14

Data Type Flagging Methods

  • Main pipeline:
  • Support Vector Machines (SVMs)
  • Other methods:
  • Textpresso category searches
  • hidden Markov models
  • Pattern matching scripts
slide15

Support Vector Machines: Document Classification

  • Machine learning models
  • Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments)
    • Positives: 100s, Negatives: 1000s
  • Resulting model classifies all new papers as negative
  • or positive (high, medium, low confidence)
slide16

Data Type Flagging – Support Vector Machines

SVMstrained for tendifferent data types:

  • Antibody
  • Genetic Interactions
  • Physical Interactions
  • Gene Expression
  • Regulation of Gene Expression
  • Variation Phenotypes
  • Overexpression Phenotypes
  • RNAi Phenotypes
  • Variation Sequence Change
  • Gene Structure Correction

See: Fang R, et al. (2012) Automatic categorization of diverse experimental

information in the bioscience literature. BMC Bioinformatics. 13(1):16

slide17

Curation from Support Vector Machine Results

  • SVM resultsleaddirectly to manual curation:
  • e.g. RNAi Phenotypes
  • ResultsfromSVMsareprocessedfurther
  • e.g. VariationSequenceChange

Pattern Matching Script – regular expressions

New variations (entityrecognition)

e.g. mg366, ju43, e1360

slide18

Data Type Flagging – Textpresso

Full text of articles

Terms, phrases, entities – semantically tagged

Keyword or category search

Match within sentence or entire paper

Wnt Pathway

HIV

Nemtaodes

S. cerevisiae

RegulonDB

….many others

C. elegans

Mouse

D. melanogaster

Neuroscience

Arabidopsis

Dicty

www.textpresso.org

slide19

Textpresso Categories

  • Pre-existing dictionaries, vocabularies:
  • Gene names
  • ChEBI(Chemical Entities of Biological Interest)
  • PATO
  • Sequence Ontology (SO)
  • Manually constructed by curators using language from published literature:
    • Sequence similarity – orthologous, conserved
    • Localization assays – GFP, antibody, fluorescence
    • Experimental verbs – required, regulates, exhibits
slide20

Data Type Flagging - Textpresso Category Searches

  • Data Type: C. elegans Human DiseaseHomologs
  • Three-category Textpresso search:
  • C. elegansgene
  • ’Ortholog’, ’Homolog’, ’Similar’, ’Model’
  • Human disease

”Wemapthisdefect in dauer response to a mutationin the scd-2gene, which, we show, encodesthe nematodeanaplasticlymphomakinse (ALK) homolog, a proto-oncogenereceptortyrosinekinase.”

slide21

Literature Curation Workflow

PubMed keyword search – ‘elegans’

Full text paper acquisition

Data type flagging and entity recognition

Detailed curation/Fact extraction

slide22

Textpresso: Semi-Automated Fact Extraction

  • Genetic Interactions
  • Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its
  • ability to dominantly (but weakly) suppresssep-1 (e2406ts), but
  • recessively suppresssep-1(ax110) (supplementary material Table S1).
  • Physical Interactions – after SVM document classifier Remarkably, only AIN-1coimmunoprecipitatedHA-tagged CePAB-1
  • (Figure 3A and B, lane 7).
  • Gene Ontology – Cellular Component Curation
  • During embryogenesis , PAN-1 protein is uniformlydistributed
  • throughout the cytoplasm of the germline and somatic blastomeres , as
  • seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1
  • in the P granules (Fig. 2K, N).
slide23

Textpresso: Semi-Automated GO Cellular

Component Curation

Textpresso Component

Gene

Products

Suggested GO Annotations

Textpresso Search Results

See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

slide24

Future Directions

  • Textpresso, other methods (HMMs) applied to additional data types
    • e.g. GO Biological Process curation (Phenotypes)
  • Focusing triage and fact extraction on novel findings
      • How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results?
      • e.g. Commonly used molecular markers
slide25

Literature Annotation Tool – Tracking Evidence

WB, GO Common Annotation Framework, BioCreative

slide26

Summary

  • Text Mining Applications for Literature Curation:
  • Paper approval and full text acquisition
  • Data type flagging and entity recognition
  • Fact extraction – record evidence
  • All steps of our pipeline incorporate some form of
  • semi- or fully automated approaches:
  • Scripts for downloads, pattern matching
  • Support Vector Machines for document classification
  • Textpresso for flagging and fact extraction
  • (Hidden Markov Models for flagging, fact extraction)
slide27

The WormBase Consortium, Textpresso

WormBase - Caltech

Textpresso - Caltech

Hans-Michael Muller

Yuling Li

James Done

Former member: ArunRangarajan

Paul Sternberg

JuancarlosChan

Wen Chen

Chris Grove

RanjanaKishore

Raymond Lee

Cecilia Nakamura

Daniela Raciti

Gary Schindelman

Kimberly Van Auken

Daniel Wang

XiaodongWang

Karen Yook

Former member: Ruihua Fang

WormBase – OICR, Toronto

Lincoln Stein

Abigail Cabunoc

Todd Harris

JD Wong

WormBase – Washington University

John Spieth

TamberlynBieri

Phil Ozersky

CGC – Oxford University, Oxford, UK

WormBase – EBI, Sanger, Hinxton, UK

Jonathan Hodgkin

Richard Durbin Paul Kersey Matt Berriman

Paul Davis Michael Paulini

Kevin Howe Mary Ann Tuli Gary Williams

slide28

Hidden Markov Models: Semi-Automated GO

Molecular Function Curation

  • For each sentence, HMM yields:
  • True positive score
  • False positive score
  • For each sentence, curator assigns:
  • Fully curatable (entity + indication of enzymatic activity)
  • Positive (experiment was performed, result but no entity)
  • False Positive (not about enzymatic activity at all)