1 / 19

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions. Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU. Classical methods to predict a structure of a new protein:.

dino
Download Presentation

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU

  2. Classical methods to predict a structure of a new protein: • Sequence comparison to the known proteins in search of similarities • Sequences often diverge and become unrecognizable • Structure comparison to the known structures in the PDB database • Structural data is sparse and not available for newly sequenced genes

  3. What other features can be used to improve prediction? • Domain content • Subcellular location • Tissue specificity • Species type • Pairwise interaction • Enzyme cofactors • Catalytic activity • Expression profiles, etc.

  4. Having so many features it is important: • To extract relevant information • Directly from the sequence • Predicted secondary structure • Features extracted from database • To combine data in a feasible model • Mixture model of Probabilistic Decision Trees (PDT) was used

  5. Features extracted directly from the sequence (percentage): • 20 individual amino acids • 16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc) • 20 most informative dipeptides

  6. Features predicted from the sequence: • Secondary structure predicted by the PSIPRED: • Coil • Helix • Strand

  7. Features extracted from SWISSPROT database: • Binary features (presence/absence) • Alternative products • Enzyme cofactors • Catalytic activity • Nominal features • Tissue Specificity (2 different definitions) • Subcellular location • Organism and Species classification • Continuous • Number of patterns exhibited by each protein (“complexity” of a protein)

  8. Mixture model of PDT (Probabilistic Decision Trees) • Can handle nominal data • Robust to the errors • Missing data is allowed

  9. How to select an attribute for a decision node? • Use entropy to measure the impurity • Impurity must reduce after the split • Alternative measure – Mantras distance metric (has lower bias towards low split info).

  10. Enhancements of the algorithm: • Dynamic attribute filtering • Discretizing numerical features • Multiple values for attributes • Missing attributes • Binary splitting • Leaf weighting • Post-prunning • 10-fold cross-validation

  11. The probabilistic fremaework • Attribute is selected with probability that depends on its information gain • Weight the trees by the performance

  12. Evaluation of decision trees • Accuracy = (tp + tn)/total • Sensitivity = tp/(tp + fn) • Selectivity = tp/(tp + fp) • Jensen-Shannon divergence score

  13. Handling skewed distributions (unequal class sizes) • Re-weight cases by 1/(# of counts) • Increases the impurity # of false positives • Mixed entropy • Uses average of weighted & unweighted information gain to split and prune trees • Interlaced entropy • Start with weighted samples and later use the unweighted entropy

  14. Model selection (simplification) • Occam’s razor: out of 2 models with the same result choose more simple • Bayesianapproach:themostprobable model has max.posterior probability

  15. Learning strategy optimization

  16. Pfam classification test (comparison to BLAST) • PDT performance – 81% • BLAST performance – 86% • Main reasons: • Nodes become impure because weighted entropy stops learning too early • Important branches were eliminated by post-pruning when validation set is small

  17. EC classification test (comparison to BLAST) • PDT performance on average – 71% • BLASTperformance was often smaller

  18. Conclusions • Many protein families cannot be defined by sequence similarities only • New method makes use of other features (structure, dipeptides, etc.) • Besides classification, PDT allow feature selection for further use • Results comparable to BLAST

  19. Modifications and Improvements • Use global optimization for pruning • Use probabilities for attribute values • Use boosting techniques (combine weighted trees) • Use Gini-index to measure node-impurity

More Related