Species independent protein localization prediction for multi compartmentalized proteins
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background

Download Presentation

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Species independent protein localization prediction for multi compartmentalized proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Mark Doderer, Kihoon Yoon, and Stephen Kwek

Department of Computer Science,

University of Texas at San Antonio,

San Antonio, Texas 78249, USA


Presentation outline

Presentation Outline

  • Problem Overview

    • Background

    • Problem Statement and Approach

  • Methods and Materials

    • Similarity Searching

    • modECOC

    • Datasets

  • Summary of Results

  • Conclusion


Protein localization

Protein Localization

  • For a protein to achieve its functional intent it must localize to its intended location

  • This information can be used to solve other problems

  • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable

  • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.


Problem statement

Problem Statement

  • Single location prediction

  • Multi location prediction

  • Many predictors focus on the majority class


A hybrid algorithm

A hybrid algorithm

  • If a similar protein can be found use the known protein to predict the unknown protein

  • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein


Similarity searching classifier

Similarity Searching Classifier

  • BlastAll

  • PAM30 Matrix

  • Bit score of 100


Modecoc machine learning classifier

modECOC – machine learning classifier

  • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier

    • Handles problems with many classes

    • Reliable class probability estimates

    • Doesn’t ignore the minority classes

    • Can use any classifier for the base classifiers


Relabeling a dataset

Relabeling a dataset


Modification to ecoc to allow for multi location prediction

Modification to ECOC to allow for multi-location prediction

  • Modify base classifier labeling

    • “cyto_plas” will be re-labeled as 0.

    • “cyto_nucl” will be left out of this base classifier

  • Prediction through class score from voting

    • Find mean of class probabilities

    • Find standard deviations from mean for each class

    • Predict classes significantly different than the other classes


Features characterizing the data

Features – characterizing the data

  • Amino acid frequency and sequence length

  • Physicochemical Characteristics

    • Betts and Russell

    • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch

  • For example hydrophobicity

    • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine

    • Least, partial and other

  • Gapped pairs with a gap of 0, 1 and 2 aa’s

    • Offers spatial information

  • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.


Datasets

Datasets

  • WolfPsort

    • Three groups of species

      • 12771 animal, 2333 plant and 2113 fungi proteins

    • From SwissProt

    • 12 unique labels

    • Maximum of two labels

    • Very imbalanced

  • PHPD

    • 5191 yeast proteins

    • 22 unique labels

    • ranges from 2 to 5 possible labels


Experiments

Experiments

  • Cross fold validation (2-PHPD / 5 WolfPsort)

  • Prediction and Scoring

    • WolfPsort

      • partial score for partially correct predictions

      • Never predicts more than 2 locations

    • PHPD

      • Always predicts three locations

      • Three measures – anything correct, average score for labels correct, each class score for that class prediction


Results compared with wolfpsort

Results compared with WolfPsort


Results compared with phpd

Results compared with PHPD


Conclusion

Conclusion

  • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes

  • ECOC is well suited to representing protein localization problems

  • modECOC handles multi-label problems with flexibility during prediction


  • Login