1 / 22

Bayesian Classification of Protein Data

Bayesian Classification of Protein Data. Thomas Huber huber@maths.uq.edu.au Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland. Today’s talk. Protein score functions from mining protein data Bayesian classification A toy example

nike
Download Presentation

Bayesian Classification of Protein Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Classification of Protein Data Thomas Huberhuber@maths.uq.edu.auComputational Biology andBioinformatics EnvironmentComBinEDepartment of MathematicsThe University of Queensland

  2. Today’s talk • Protein score functions from mining protein data • Bayesian classification • A toy example • A protein scoring function for fold recognition • Where are score/energy functions useful? • A few examples

  3. Why do we care about Protein Structures/Prediction? • Academic curiosity? • Understanding how nature works • Urgency of prediction • 104 structures are determined • insignificant compared to all proteins • sequencing = fast & cheap • structure determination = hard & expensive TrEMBL sequences (computer annotated) Transistors in Intel processors SwissProt sequences (annotated) structures in PDB

  4. Three basic choices in (molecular) modelling • Representation • Which degrees of freedom are treated explicitly • Scoring • Which scoring function (force field) • Searching • Which method to search or sample conformational space

  5. Protein Scoring Functions from Mining Protein Data • Classification Theory • Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) • Theory of finite mixtures • Class  attribute probability distribution of all members

  6. Bayesian approach • Simplifications • Stating a simplified model • Assume attributes are independently distributed • P(Xicj|S) requires class description • Expectation Maximization (EM)

  7. How many classes • Again Bayes’ rule • P(m) favours smaller number of classes • No over-fitting of data (like with maximum likelihood methods)

  8. A Toy ExampleDihedral preference of Valine • Four interesting degrees of freedom • -,-dihedral • angle • Adjacent amino • acid types   i-1 i+1 • Data:893 non-redundant proteins • 12074 four-dimensional data points

  9. Valine Data Classification • AutoClass classification • Model: Gaussian distribution for /, discrete probabilities for amino acids • Total of 50 tries with #classes [2:11] • Each try refined until fully converged • Best classification has 5 classes

  10. Amino Acid Attribute vectors of -helix Classes • Log-Preferences

  11. Re-invention of the Wheel • Textbook secondary structure pattern • Helices are likely on outside of proteins • I, I+3 and I+4 hydrophobic interface • From C.-I. Branden and J. Tooze, Introduction to Protein Structure

  12. Fragment-based Protein Scoring • Find classification for fragments of size 7 residues • 237566 fragments (1494 non-redundant protein chains) • 28 descriptors • 7 amino acid type • 14 -/-dihedral angles • 7 number of neighbours of each amino acid • 200 CPU hours on National Facility computers • 325 classes (modelling the probability distribution of native fragments) • Use this classification to evaluate likelihood of a fragment sequence-structure match • Total score =  fragment scores

  13. Fold Recognition = Computer Matchmaking • Structure Disco

  14. Does it work? • Discrimination (TIM 1amk_) • Generalisation 1 3 2 4 1 2 5 3 5 4

  15. Sequence-Structure MatchingThe search problem • Gapped alignment = combinatorial nightmare

  16. Why is Fold Recognition better than Sequence Comparison? • Comparison is done in structure space not in sequence space

  17. Finding Remote Homologueswith sausage • 572 sequence-structure pairs • Structures are similar (FSSP) • > 70% structurally aligned • < 20% sequence identity

  18. RNA-dependent RNA Polymerases

  19. A Real Case ExampleRNA-dependent RNA polymerases • Dengue virus • Bacteriophage 6

  20. Is this Yet Another Profile Method? • Yes, but a much more general profile method • Profile is not residue based (like profile-like threading force fields) • Profiles not for protein families (like in HMMs or -Blast) • BUT local sequence profiles for optimally chosen classes of fragments • Local profiles can be arbitrarily assembled • Extreme flexibility • Sequence-structure alignment (=assembling best profile matches) • Deterministic, using dynamic programming

  21. People • sausage • Andrew Torda (RSC) • Oliver Martin (RSC) • GlnB/GlnK, RdR polymerases • Subhash Vasudevan (JCU) • Sausage and Cassandra freely available • http://rsc.anu.edu.au/~torda • huber@maths.uq.edu.au

More Related