research in the verspoor lab n.
Skip this Video
Loading SlideShow in 5 Seconds..
Research in the Verspoor Lab PowerPoint Presentation
Download Presentation
Research in the Verspoor Lab

Loading in 2 Seconds...

play fullscreen
1 / 42

Research in the Verspoor Lab - PowerPoint PPT Presentation

  • Uploaded on

Research in the Verspoor Lab. Linguistics, Lexicons, and Biomedical Verbs. I could go on and on and on and on But I probably won ’ t…. Biological Knowledge Discovery. Gene Normalization. Gene Normalization. Mapping a gene or protein name to an identifier (e.g. in GenBank)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Research in the Verspoor Lab' - gita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
linguistics lexicons and biomedical verbs
Linguistics, Lexicons, and Biomedical Verbs
  • I could go on and on and on and on
  • But I probably won’t…
gene normalization1
Gene Normalization
  • Mapping a gene or protein name to an identifier (e.g. in GenBank)
  • Very important task for using extracted information (more useful than just a name)
  • Ambiguity
    • with English words (“to”“dunce”“wingless”)
    • in naming (1168 genes in Entrez named “p60”)
    • in species (949 species have a gene named “p53”)
normalization methods
Normalization methods
  • Heuristic approach can be effective
    • Edit distance is too coarse (some characters matter more than others)
  • Some heuristics that appear to help
    • Ignore hyphens, commas, some other interrupting punctuation (but not, e.g., ' )
    • Ignore parenthetical elements
    • Consider translations among arabic/roman numerals, and latin/greek letters
    • Special words for compound noun phrases: receptor, precursor, mRNA, gene, protein, greek letter names, etc.
gene normalization a species based approach
Gene Normalization:a species-based approach
  • Based on species detection (NCBI Taxonomy terms)
    • Global cues:
      • First (first species mention)
      • Abstract (most frequent species in abstract)
      • Majority (most frequent species in doc)
    • Local cues, close to gene reference:
      • Recency
      • Window (most frequent in window)
    • “mixed” strategy setting confidence
      • “First” >> “Recency” >> “Window” >> “Majority
putting it all together biocreative ii 5 system architecture
Putting it all together: BioCreative II.5 System Architecture






uniprot dictionary match
UniProt Dictionary Match
  • Trie-based data structure
  • Protein names and synonyms normalized upon insertion
    • reduces number of variants
    • same form we search for in the text
gene candidate selection
Gene candidate selection
  • Normalized string match against SwissProt names and synonyms
    • lowercase
    • eliminating punctuation (apostrophes, hyphens, and parentheses)
    • converting Greek letters and Roman numerals to a standard form
    • removing spaces
  • Left and right token boundary constraints (right constraint relaxed for plurals)
protein match example
Protein Match example
  • Sentence:

Affixin/β-parvin is an integrin-linked kinase (ILK)-binding focal adhesion protein highly expressed in skeletal muscle and heart.

  • Normalized Sentence:affixinbparvinisanintegrinlinkedkinaseilkbindingfocaladhesion


  • Match Affixin to affixin (ID: Q9HBI1)
  • Match β-parvin to bparvin (ID: Q9HBI1)
species detection
Species detection
  • Dictionary lookup using UIMA Concept Mapper loaded with NCBI Taxonomy
  • Match species and sub-species; traverse is-a hierarchy for sub-species
bc ii 5 results
BC II.5 results



knogm and kabob
  • KNoGM: Knowledge-based

Normalization of Gene Mentions

  • Strategy based on WSD methods

from Agirre and Soroa, based on

knowledge graphs

  • Taking advantage of biological knowledge resources
  • KaBOB: Knowledge Base Of Biology
    • Integrated resource across biological databases
knowledge based methods in word sense disambiguation
Knowledge-based methods in Word Sense Disambiguation
  • Disambiguate words based on relations represented in a semantic graph
  • Take advantage of connections among word senses and prefer word senses that are semantically connected
  • Intuition: Spreading Activation
    • Can perform static analysis of the graph to determine most likely disambiguations based only on the state of connections in the graph
    • More effective: dynamic, consider words in context
ukb agirre soroa knowledge based wsd
UKB: Agirre & Soroa knowledge-based WSD
  • Knowledge-based word sense disambiguation method
    • knowledge = WordNet graph
    • algorithm = (personalized) page rank

PageRank: ranks vertices in a graph according to their relative structural importance

Personalized PageRank: bias certain vertices; “activation” from a vertex increases

knowledge based methods in gene normalization
Knowledge-based methods in Gene Normalization
  • Knowledge typically brought to bear based on textual matching of concepts known to be associated with genes
    • Gene ontology concepts
    • Chromosome locations
    • Species names
  • KNoGM takes advantage of such knowledge in a broader relational context
kabob knowledge base of biology
KaBOB: Knowledge Base of Biology
  • Goal: construction of an integrated, broad-coverage semantic resource of biological knowledge
    • information artifacts
    • abstracted biological knowledge
    • RDF representation using ontological relations
  • KaBOB v.0
    • iRefWeb protein interaction data
    • GO annotations
    • Homologene
    • NCBI Taxonomy
from knowledge based wsd to knogm
From knowledge-based WSD to KNoGM
  • knowledge: KaBOB
  • dictionary: gene name → gene identifiers
  • context: mentions of gene names, GO terms, NCBI Taxonomy terms
automated validation of high throughput predictions
Automated validation of high-throughput predictions
  • Collaboration with Mike Wall @ LANL
  • Combine structure-based predictions of active sites on proteins with literature-based validation
    • Given a PDB protein structure, and a prediction for residues in that structure that are active (ligand binding sites, catalytic sites, etc.)
    • Search the literature for evidence supporting the prediction
protein fold vs function
Protein Fold vs. Function
  • Many amino acids in a protein are responsible for defining the overall fold
  • However, only a small fraction of the residues in a protein are directly responsible for its behavior
  • The evolutionary pressures on these residues are different from other residues, and can cause mutations to be correlated with function (Lichtarge)
functional residues are often remote in sequence
Functional Residues Are OftenRemote in Sequence
  • Difficult to identify as motifs








-amylase from Alteromonas haloplanctis

Asp174, Glu200, Asp264

functional sites
Functional Sites
  • Types of Functional Sites
    • Catalytic sites
    • Allosteric Sites
    • Ligand-binding sites
    • Protein-protein interaction sites
  • Used to define motifs
    • Geometric hashing and other methods (TESS, Thornton lab)
  • Targets for Drug Design
dpa prediction of functional sites
DPA Prediction of Functional Sites




Catalytic Triad

Predicted Residues

nlp validation of protein active site predictions
NLP Validation of Protein Active Site Predictions
  • Combine structure-based predictions of active sites on proteins with literature-based validation
    • Given a PDB protein structure, and a prediction for residues in that structure that are active (ligand binding sites, catalytic sites, etc.)
    • Search the literature for evidence supporting the prediction
residue mention detection examples
Residue mention detection,examples
  • This missense mutation converts a highly conserved glycine (Gly17 of neurophysin) to a valine residue.
  • Killer of prune (Kpn) is a mutation in the awd gene which substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene.
  • Residues in both the N-terminal (Arg-66 and Glu-70) and C-terminal (Arg-200, Asp-254, Asp-255, and Asp-276) thirds of the protein are implicated in binding to cells.
  • … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine.
  • Other outliers of possible functional relevance include D18, R23, R59, R390 and A391.

Patterns must handle 3-letter and 1-letter abbrevations; various connectors, mutations, linguistic constructs such as coordination, and other variations in surface forms.

some regular expressions for aa mentions
Some regular expressions for AA mentions


"(alanine|asparagine|aspartic|cysteine|glutamic|glutamic acid|glutamine|glycine|histidine|allo\





AA_short = "(arg|asn|asp|cys|gln|gly|glu|his|ile|leu|lys|met|phe|pro|ser|thr|trp|tyr|val|asx|glx|xle|xaa|ala|ctt)"

AA_initial = "(A|C|D|E|F|G|H|I|K|L|M|N|P|Q|R|S|T|V|W|Y)”

AA_unbounded = AA_long + "|" AA_short

AA_bounded = "\b" + AA_unbounded + "\b"

AA_position_variant1 = "(\d+)([ \-]+)" + AA_bounded

#AA plus the position tyr85 with optional parenthesis around the position tyr(85)

AA_position_variant2 = AA_unbounded + "[ \-]*\(?\d+\)*?"

# (tyr85 to ser85, Tyr 85 Ser 85, trp27-gly360)

connection = "[ \-]?(\-|to|\s|\\)[ \-]?"

grammatical_expressions = "([ \-]?(to|substitution of|at position|acid)[ \-]?)”

pattern3 = AA_unbounded + ".?\d+" + connection + AA_unbounded + ".?\d+"

current pattern performance
Current pattern performance

Corpus 1: 61 full-text journal publications derived from Protein Data Bank (PDB) records that have known functional sites

Corpus 2: 7 full-text journal publications; 5 abstracts. Derived from PDB records that are known drug targets.

Corpus 3: 100 journal abstracts; obtained from Nagel et al (2009).

some initial results of integration
Some initial results of integration
  • For 32,195 PDB entries:
    • 26,829 entries map to a PubMed ID
    • 14,851 unique PubMed abstracts processed
    • 23,477 residues identified
      • 69% match surface residues on the relevant protein
        • 50% of these match predicted active sites
      • 79% of PDB entries have at least one residue identified
complicating factors
Complicating factors
  • AA numbering in sequences may not be consistent
    • Different “reference” sequences for the protein
    • Mutant or other variant sequences
  • Explicit mentions of mutations
  • Namespace ambiguity, possibly
nlp validation infrastructure
NLP validation: infrastructure
  • Requires scaling our architecture to process full text publications on a large scale
    • UIMA-AS (Asynchronous Scaleout)
    • Cloud/cluster computing
  • Take software engineering seriously
    • Robust, scaleable, modular architectures
    • Consider the kinds of knowledge structures we need to be able to represent and manipulate
      • hierarchical controlled vocabularies
      • patterns of expression

Annotation Representation









“biological regulation”














“M. musculus Mapk7”

…regulation of transcription of mouse Mapk7…


“regulation of transcription”





rdf:Property (s p o)


in a nutshell
In a nutshell
  • Ontologies and Semantic graph analysis
  • Vocabularies and Linguistic knowledge for the biomedical domain
  • Text Mining
  • Information Extraction
  • Addressing the needs of the biological user
  • Biological data analysis integrating multiple data sources
Larry Hunter (Lab director)

Eneko Agirre and Aitor Soroa

at EHU (UKB)

Kevin Livingston (KaBOB)

Kevin Cohen (NLP)

Helen Johnson (Linguist)

(Software engineers)

Bill Baumgartner

Chris Roeder

Tom Christiansen

Other Lab members:

Mike Bada, Hannah Tipney, Yuriy Malenkiy, Lynne Fox

Mike Wall and Judith Cohn at LANL

NIH grants

R01 LM 010120-01

R01 LM 009254

R01 LM 008111

R01 GM 083649

G08 LM 009639

T15 LM 009451

Guillaume Achaz for the gnome image