1 / 60

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

This research paper focuses on predicting protein function using machine-learned hierarchical classifiers. The study explores different predictors and evaluates their performance in a hierarchy. Experimental results and conclusions are presented.

countess
Download Presentation

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu eisner@cs.ualberta.ca

  2. Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca

  3. eisner@cs.ualberta.ca

  4. Proteins • Functional Units in the cell • Perform a Variety of Functions • e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules • Can take years to study a single protein • Any good leads would be helpful! eisner@cs.ualberta.ca

  5. Protein Function Prediction and Protein Function Determination • Prediction: • An estimate of what function a protein performs • Determination: • Work in a laboratory to observe and discover what function a protein performs • Prediction complements determination eisner@cs.ualberta.ca

  6. Proteins • Chain of amino acids • 20 Amino Acids • FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI eisner@cs.ualberta.ca

  7. Ontologies • Standardized Vocabularies (Common Language) • In biological literature, different terms can be used to describe the same function • e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” • Can be structured in a hierarchy to show relationships eisner@cs.ualberta.ca

  8. Gene Ontology • Directed Acyclic Graph (DAG) • Always changing • Describes 3 aspects of protein annotations: • Molecular Function • Biological Process • Cellular Component eisner@cs.ualberta.ca

  9. Gene Ontology • Directed Acyclic Graph (DAG) • Always changing • Describes 3 aspects of protein annotations: • Molecular Function • Biological Process • Cellular Component eisner@cs.ualberta.ca

  10. Hierarchical Ontologies • Can help to represent a large number of classes • Represent General and Specific data • Some data is incomplete – could become more specific in the future eisner@cs.ualberta.ca

  11. Incomplete Annotations eisner@cs.ualberta.ca

  12. Goal • To predict the function of proteins given their sequence eisner@cs.ualberta.ca

  13. Data Set • Protein Sequences • UniProt database • Ontology • Gene Ontology Molecular Function aspect • Experimental Annotations • Gene Ontology Annotation project @ EBI • Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins • Final Data Set: 14,362 proteins eisner@cs.ualberta.ca

  14. Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca

  15. Predictors • Global: • BLAST NN • Local: • PA-SVM • PFAM-SVM • Probabilistic Suffix Trees eisner@cs.ualberta.ca

  16. Predictors • Global: • BLAST NN • Local: • PA-SVM • PFAM-SVM • Probabilistic Suffix Trees Linear eisner@cs.ualberta.ca

  17. Why Linear SVMs? • Accurate • Explainability • Each term in the dot product in meaningful eisner@cs.ualberta.ca

  18. PA-SVM Proteome Analyst eisner@cs.ualberta.ca

  19. PFAM-SVM Hidden Markov Models eisner@cs.ualberta.ca

  20. PST • Probabilistic Suffix Trees • Efficient Markov chains • Model the protein sequences directly: • Prediction: eisner@cs.ualberta.ca

  21. BLAST • Protein Sequence Alignment for a query protein against any set of protein sequences eisner@cs.ualberta.ca

  22. BLAST eisner@cs.ualberta.ca

  23. Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca

  24. Evaluating Predictions in a Hierarchy • Not all errors are equivalent • Error to sibling different than error to unrelated part of hierarchy • Proteins can perform more than one function • Need to combine predictions of multiple functions into a single measure eisner@cs.ualberta.ca

  25. Evaluating Predictions in a Hierarchy • Semantics of the hierarchy – True Path Rule • Protein labeled with: {T} -> {T, A1, A2} • Predicted functions: {S} -> {S, A1, A2} • Precision = 2/3 = 67% • Recall = 2/3 = 67% eisner@cs.ualberta.ca

  26. Evaluating Predictions in a Hierarchy • Protein labelled with {T} -> {T, A1, A2} • Predicted: {C1} -> {C1, T, A1, A2} • Precision = 3/4 = 75% • Recall = 3/3 = 100% eisner@cs.ualberta.ca

  27. Supervised Learning eisner@cs.ualberta.ca

  28. Cross-Validation • Used to estimate performance of classification system on future data • 5 Fold Cross-Validation: eisner@cs.ualberta.ca

  29. Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca

  30. Inclusive vs Exclusive Local Predictors • In a system of local predictors, how should each local predictor behave? • Two extremes: • A local predictor predicts positive only for those proteins that belong exactly at that node • A local predictor predicts positive for those proteins that belong at or below them in the hierarchy • No a priori reason to choose either eisner@cs.ualberta.ca

  31. Exclusive Local Predictors eisner@cs.ualberta.ca

  32. Inclusive Local Predictors eisner@cs.ualberta.ca

  33. Training Set Design • Proteins in the current fold’s training set can be used in any way • Need to select for each local predictor: • Positive training examples • Negative training examples eisner@cs.ualberta.ca

  34. Training Set Design eisner@cs.ualberta.ca

  35. Training Set Design eisner@cs.ualberta.ca

  36. Training Set Design eisner@cs.ualberta.ca

  37. Training Set Design eisner@cs.ualberta.ca

  38. Training Set Design eisner@cs.ualberta.ca

  39. Comparing Training Set Design Schemes • Using PA-SVM eisner@cs.ualberta.ca

  40. Exclusive have more exceptions eisner@cs.ualberta.ca

  41. Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca

  42. Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca

  43. Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca

  44. Top-Down Search eisner@cs.ualberta.ca

  45. Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca

  46. Predictor Results eisner@cs.ualberta.ca

  47. Similar and Dissimilar Proteins • 89% of proteins – at least one good BLAST hit • Proteins which are similar (often homologous) to the set of well studied proteins • 11% of proteins – no good BLAST hit • Proteins which are not similar to the set of well studied proteins eisner@cs.ualberta.ca

  48. Coverage • Coverage: Percentage of proteins for which a prediction is made eisner@cs.ualberta.ca

  49. Similar Proteins – Exploiting BLAST • BLAST is fast and accurate when a good hit is found • Can exploit this to lower the cost of local predictors • Generate candidate nodes • Only compute local predictors for candidate nodes • Candidate node set should have: • High Recall • Minimal Size eisner@cs.ualberta.ca

  50. Similar Proteins – Exploiting BLAST • candidate nodes generating methods: • Searching outward from BLAST hit • Performing the union of more than one BLAST hit’s annotations eisner@cs.ualberta.ca

More Related