Species independent protein localization prediction for multi compartmentalized proteins
Download
1 / 16

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on
  • Presentation posted in: General

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Mark Doderer, Kihoon Yoon, and Stephen Kwek

Department of Computer Science,

University of Texas at San Antonio,

San Antonio, Texas 78249, USA


Presentation Outline

  • Problem Overview

    • Background

    • Problem Statement and Approach

  • Methods and Materials

    • Similarity Searching

    • modECOC

    • Datasets

  • Summary of Results

  • Conclusion


Protein Localization

  • For a protein to achieve its functional intent it must localize to its intended location

  • This information can be used to solve other problems

  • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable

  • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.


Problem Statement

  • Single location prediction

  • Multi location prediction

  • Many predictors focus on the majority class


A hybrid algorithm

  • If a similar protein can be found use the known protein to predict the unknown protein

  • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein


Similarity Searching Classifier

  • BlastAll

  • PAM30 Matrix

  • Bit score of 100


modECOC – machine learning classifier

  • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier

    • Handles problems with many classes

    • Reliable class probability estimates

    • Doesn’t ignore the minority classes

    • Can use any classifier for the base classifiers


Relabeling a dataset


Modification to ECOC to allow for multi-location prediction

  • Modify base classifier labeling

    • “cyto_plas” will be re-labeled as 0.

    • “cyto_nucl” will be left out of this base classifier

  • Prediction through class score from voting

    • Find mean of class probabilities

    • Find standard deviations from mean for each class

    • Predict classes significantly different than the other classes


Features – characterizing the data

  • Amino acid frequency and sequence length

  • Physicochemical Characteristics

    • Betts and Russell

    • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch

  • For example hydrophobicity

    • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine

    • Least, partial and other

  • Gapped pairs with a gap of 0, 1 and 2 aa’s

    • Offers spatial information

  • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.


Datasets

  • WolfPsort

    • Three groups of species

      • 12771 animal, 2333 plant and 2113 fungi proteins

    • From SwissProt

    • 12 unique labels

    • Maximum of two labels

    • Very imbalanced

  • PHPD

    • 5191 yeast proteins

    • 22 unique labels

    • ranges from 2 to 5 possible labels


Experiments

  • Cross fold validation (2-PHPD / 5 WolfPsort)

  • Prediction and Scoring

    • WolfPsort

      • partial score for partially correct predictions

      • Never predicts more than 2 locations

    • PHPD

      • Always predicts three locations

      • Three measures – anything correct, average score for labels correct, each class score for that class prediction


Results compared with WolfPsort


Results compared with PHPD


Conclusion

  • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes

  • ECOC is well suited to representing protein localization problems

  • modECOC handles multi-label problems with flexibility during prediction


ad
  • Login