Species independent protein localization prediction for multi compartmentalized proteins
Download
1 / 16

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins' - yaron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Species independent protein localization prediction for multi compartmentalized proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Mark Doderer, Kihoon Yoon, and Stephen Kwek

Department of Computer Science,

University of Texas at San Antonio,

San Antonio, Texas 78249, USA


Presentation outline
Presentation Outline Multi-compartmentalized Proteins

  • Problem Overview

    • Background

    • Problem Statement and Approach

  • Methods and Materials

    • Similarity Searching

    • modECOC

    • Datasets

  • Summary of Results

  • Conclusion


Protein localization
Protein Localization Multi-compartmentalized Proteins

  • For a protein to achieve its functional intent it must localize to its intended location

  • This information can be used to solve other problems

  • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable

  • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.


Problem statement
Problem Statement Multi-compartmentalized Proteins

  • Single location prediction

  • Multi location prediction

  • Many predictors focus on the majority class


A hybrid algorithm
A hybrid algorithm Multi-compartmentalized Proteins

  • If a similar protein can be found use the known protein to predict the unknown protein

  • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein


Similarity searching classifier
Similarity Searching Classifier Multi-compartmentalized Proteins

  • BlastAll

  • PAM30 Matrix

  • Bit score of 100


Modecoc machine learning classifier
modECOC – machine learning classifier Multi-compartmentalized Proteins

  • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier

    • Handles problems with many classes

    • Reliable class probability estimates

    • Doesn’t ignore the minority classes

    • Can use any classifier for the base classifiers


Relabeling a dataset
Relabeling a dataset Multi-compartmentalized Proteins


Modification to ecoc to allow for multi location prediction
Modification to ECOC to allow for multi-location prediction Multi-compartmentalized Proteins

  • Modify base classifier labeling

    • “cyto_plas” will be re-labeled as 0.

    • “cyto_nucl” will be left out of this base classifier

  • Prediction through class score from voting

    • Find mean of class probabilities

    • Find standard deviations from mean for each class

    • Predict classes significantly different than the other classes


Features characterizing the data
Features – characterizing the data Multi-compartmentalized Proteins

  • Amino acid frequency and sequence length

  • Physicochemical Characteristics

    • Betts and Russell

    • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch

  • For example hydrophobicity

    • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine

    • Least, partial and other

  • Gapped pairs with a gap of 0, 1 and 2 aa’s

    • Offers spatial information

  • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.


Datasets
Datasets Multi-compartmentalized Proteins

  • WolfPsort

    • Three groups of species

      • 12771 animal, 2333 plant and 2113 fungi proteins

    • From SwissProt

    • 12 unique labels

    • Maximum of two labels

    • Very imbalanced

  • PHPD

    • 5191 yeast proteins

    • 22 unique labels

    • ranges from 2 to 5 possible labels


Experiments
Experiments Multi-compartmentalized Proteins

  • Cross fold validation (2-PHPD / 5 WolfPsort)

  • Prediction and Scoring

    • WolfPsort

      • partial score for partially correct predictions

      • Never predicts more than 2 locations

    • PHPD

      • Always predicts three locations

      • Three measures – anything correct, average score for labels correct, each class score for that class prediction


Results compared with wolfpsort
Results compared with WolfPsort Multi-compartmentalized Proteins


Results compared with phpd
Results compared with PHPD Multi-compartmentalized Proteins


Conclusion
Conclusion Multi-compartmentalized Proteins

  • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes

  • ECOC is well suited to representing protein localization problems

  • modECOC handles multi-label problems with flexibility during prediction