Species independent protein localization prediction for multi compartmentalized proteins
Download
1 / 16

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins' - yaron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Species independent protein localization prediction for multi compartmentalized proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Mark Doderer, Kihoon Yoon, and Stephen Kwek

Department of Computer Science,

University of Texas at San Antonio,

San Antonio, Texas 78249, USA


Presentation outline
Presentation Outline Multi-compartmentalized Proteins

  • Problem Overview

    • Background

    • Problem Statement and Approach

  • Methods and Materials

    • Similarity Searching

    • modECOC

    • Datasets

  • Summary of Results

  • Conclusion


Protein localization
Protein Localization Multi-compartmentalized Proteins

  • For a protein to achieve its functional intent it must localize to its intended location

  • This information can be used to solve other problems

  • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable

  • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.


Problem statement
Problem Statement Multi-compartmentalized Proteins

  • Single location prediction

  • Multi location prediction

  • Many predictors focus on the majority class


A hybrid algorithm
A hybrid algorithm Multi-compartmentalized Proteins

  • If a similar protein can be found use the known protein to predict the unknown protein

  • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein


Similarity searching classifier
Similarity Searching Classifier Multi-compartmentalized Proteins

  • BlastAll

  • PAM30 Matrix

  • Bit score of 100


Modecoc machine learning classifier
modECOC – machine learning classifier Multi-compartmentalized Proteins

  • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier

    • Handles problems with many classes

    • Reliable class probability estimates

    • Doesn’t ignore the minority classes

    • Can use any classifier for the base classifiers


Relabeling a dataset
Relabeling a dataset Multi-compartmentalized Proteins


Modification to ecoc to allow for multi location prediction
Modification to ECOC to allow for multi-location prediction Multi-compartmentalized Proteins

  • Modify base classifier labeling

    • “cyto_plas” will be re-labeled as 0.

    • “cyto_nucl” will be left out of this base classifier

  • Prediction through class score from voting

    • Find mean of class probabilities

    • Find standard deviations from mean for each class

    • Predict classes significantly different than the other classes


Features characterizing the data
Features – characterizing the data Multi-compartmentalized Proteins

  • Amino acid frequency and sequence length

  • Physicochemical Characteristics

    • Betts and Russell

    • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch

  • For example hydrophobicity

    • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine

    • Least, partial and other

  • Gapped pairs with a gap of 0, 1 and 2 aa’s

    • Offers spatial information

  • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.


Datasets
Datasets Multi-compartmentalized Proteins

  • WolfPsort

    • Three groups of species

      • 12771 animal, 2333 plant and 2113 fungi proteins

    • From SwissProt

    • 12 unique labels

    • Maximum of two labels

    • Very imbalanced

  • PHPD

    • 5191 yeast proteins

    • 22 unique labels

    • ranges from 2 to 5 possible labels


Experiments
Experiments Multi-compartmentalized Proteins

  • Cross fold validation (2-PHPD / 5 WolfPsort)

  • Prediction and Scoring

    • WolfPsort

      • partial score for partially correct predictions

      • Never predicts more than 2 locations

    • PHPD

      • Always predicts three locations

      • Three measures – anything correct, average score for labels correct, each class score for that class prediction


Results compared with wolfpsort
Results compared with WolfPsort Multi-compartmentalized Proteins


Results compared with phpd
Results compared with PHPD Multi-compartmentalized Proteins


Conclusion
Conclusion Multi-compartmentalized Proteins

  • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes

  • ECOC is well suited to representing protein localization problems

  • modECOC handles multi-label problems with flexibility during prediction


ad