NLP pipeline for protein mutation knowledgebase construction

NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Background • Knowledge about mutations is crucial for many applications, e.g. Protein engineering and Biomedicine. • Protein mutations are described in scientific literature. • The amount of Information grow faster than manual database curation can handle. • Automatic reuse of mutation impact information from documents needed.

Example excerpts "The W125F mutant showed only a slight reduction of activity (Vmax) and a larger increase of Km with 1,2-dibromoethane." • Mutation • Directionalityofimpact • Protein property "Haloalkane dehalogenase (DhlA) from Xanthobacter autotrophicus GJI0 hydrolyses terminally chlorinated and brominated n-alkanes to the corresponding alcohols." • Protein name • Gene name • Organism name

Mutation impact ontology

NLP framework

Named entity recognition • Protein-, gene- and organism names • Gazetteer lists based on SwissProt • Mappings encoded in the MGDB • Mutation mentions • MutationFinder ~700 regular expressions • normalize into wNm-format

Named entity recognition Protein Properties • Protein functions • Noun phrases extracted with MuNPEx • Activity, binding, affinity, specificity as head nouns • Kinetic variables • Jape rules to extract Km, kcat and Km/kcat in current implementation

Mutation groundingLinking mutations positionally correct to target sequence • Important for reuse of mutation mentions • Levels of grounding:

mSTRAPviz Structure annotation visualization Mutations extracted from text visualized on the protein structure for which mutation grounding is a prerequisite.

Protein function grounding Mentions of protein functions are linked to correct Gene Ontology concepts. Previously grounded proteins and mutations provide us with hints. Grounding scored based on string similarity (later used during impact extraction)

Relation detection • Impacts • Words describing directionality + protein properties • Mutants • Set of mutations giving rise to altered proteins • Mutant – Impacts • The causal relation between mutants and their impacts

OwlExporter • Translates GATE Annotations to OWL instances • Application independent • Literature Specifications added automatically • Used here to populate our Mutation impact ontology to create a mutation knowledgebase

Example query Retrieve mutations that do not have an impact on haloalkane dehalogenase activity(also retrieve the Swissprot identifier of the protein beeing mutated).

Example query Retrieve mutations on Haloalkane Dehalogenase that do not impact negatively on the Michaelis Constant.

Evaluation Mutation grounding performance

What’s next? *Bromberg and Rost, 2007 Modularize into a set of web services Database (re-)creation Reuse in phenotype prediction algorithms, (SNAP)*

Jonas B. Laurila CSAS, UNB, Saint John j02h9@unb.ca Nona Naderi CSE, Concordia University, Montréal n_nad@encs.concordia.ca René Witte CSE, Concordia University, Montréal rwitte@cse.concordia.ca Christopher J.O. Baker CSAS, UNB, Saint John bakerc@unb.ca NLP pipeline for protein mutation knowledgebase construction Acknowledgement This research was funded in part by : • New Brunswcik Innovation Foundation, New Brunswick, Canada • NSERC, Discovery Grant, Canada • Quebec -New Brunswick University Co-operation in Advanced Education - Research Program, Government of New Brunswick, Canada

NLP pipeline for protein mutation knowledgebase construction

NLP pipeline for protein mutation knowledgebase construction

Presentation Transcript

Protein

Dayhoff Model:

Lecture 7: Protein purification

Protein

Mutation Breeding

Protein Purification

Protein Turnover and Amino Acid Catabolism

Lethal alleles

Keystone Exam Content Review

Protein Structure and Function

Pipeline Datapath

Content Protein fold and structure Homology modeling Protein-protein docking

Lean Construction

Chapter 12 Protein Biosynthesis 蛋白质生物合成

PIPELINE LEAK DETECTION

What is a mutation?

PREPARE BY: Miss lock shu ping

V9: Reliability of Protein Interaction Networks