species independent protein localization prediction for multi compartmentalized proteins
Download
Skip this Video
Download Presentation
Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Loading in 2 Seconds...

play fullscreen
1 / 16

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins' - yaron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
species independent protein localization prediction for multi compartmentalized proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Mark Doderer, Kihoon Yoon, and Stephen Kwek

Department of Computer Science,

University of Texas at San Antonio,

San Antonio, Texas 78249, USA

presentation outline
Presentation Outline
  • Problem Overview
    • Background
    • Problem Statement and Approach
  • Methods and Materials
    • Similarity Searching
    • modECOC
    • Datasets
  • Summary of Results
  • Conclusion
protein localization
Protein Localization
  • For a protein to achieve its functional intent it must localize to its intended location
  • This information can be used to solve other problems
  • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable
  • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.
problem statement
Problem Statement
  • Single location prediction
  • Multi location prediction
  • Many predictors focus on the majority class
a hybrid algorithm
A hybrid algorithm
  • If a similar protein can be found use the known protein to predict the unknown protein
  • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein
similarity searching classifier
Similarity Searching Classifier
  • BlastAll
  • PAM30 Matrix
  • Bit score of 100
modecoc machine learning classifier
modECOC – machine learning classifier
  • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier
    • Handles problems with many classes
    • Reliable class probability estimates
    • Doesn’t ignore the minority classes
    • Can use any classifier for the base classifiers
modification to ecoc to allow for multi location prediction
Modification to ECOC to allow for multi-location prediction
  • Modify base classifier labeling
    • “cyto_plas” will be re-labeled as 0.
    • “cyto_nucl” will be left out of this base classifier
  • Prediction through class score from voting
    • Find mean of class probabilities
    • Find standard deviations from mean for each class
    • Predict classes significantly different than the other classes
features characterizing the data
Features – characterizing the data
  • Amino acid frequency and sequence length
  • Physicochemical Characteristics
    • Betts and Russell
    • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch
  • For example hydrophobicity
    • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine
    • Least, partial and other
  • Gapped pairs with a gap of 0, 1 and 2 aa’s
    • Offers spatial information
  • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.
datasets
Datasets
  • WolfPsort
    • Three groups of species
      • 12771 animal, 2333 plant and 2113 fungi proteins
    • From SwissProt
    • 12 unique labels
    • Maximum of two labels
    • Very imbalanced
  • PHPD
    • 5191 yeast proteins
    • 22 unique labels
    • ranges from 2 to 5 possible labels
experiments
Experiments
  • Cross fold validation (2-PHPD / 5 WolfPsort)
  • Prediction and Scoring
    • WolfPsort
      • partial score for partially correct predictions
      • Never predicts more than 2 locations
    • PHPD
      • Always predicts three locations
      • Three measures – anything correct, average score for labels correct, each class score for that class prediction
conclusion
Conclusion
  • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes
  • ECOC is well suited to representing protein localization problems
  • modECOC handles multi-label problems with flexibility during prediction
ad