1 / 61

Conditional Graphical Models for Protein Structure Prediction

Conditional Graphical Models for Protein Structure Prediction. Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006. Nobelprize.org. DSCTFTTAAAAKAGKAKAG. Protein sequence. +. Protein function. Protein structure.

shalin
Download Presentation

Conditional Graphical Models for Protein Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Graphical Models forProtein Structure Prediction Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006

  2. Nobelprize.org DSCTFTTAAAAKAGKAKAG Protein sequence + Protein function Protein structure Snapshot of Cell Biology

  3. Protein Structures and Functions Example: triple beta-spiral fold Adenovirus Fibre Shaft Virus Capsid Courtesy of Nobelprize.org

  4. Protein Structure Determination • Lab experiments: time and labor- consuming • X-ray crystallography Nobel Prize, Kendrew & Perutz, 1962 • NMR spectroscopy Nobel Prize, Kurt Wuthrich, 2002 • The gap between sequence and structure necessitates computational methods of protein structure determination • 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 1MBN 1BUS

  5. Protein Structure Hierarchy We focus on predicting the topology of the structures from sequences APAFSVSPASGACGPECA

  6. Major Challenges • Protein structures are non-linear • Long-range dependencies • Structural similarity often does not indicate sequence similarity • Sequence alignment reaches twilight zone (under 25% similarity) β-α-β motif Ubiquitin (blue) Ubx-Faf1 (gold)

  7. Previous Work • Sequence similarity perspective • Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] • Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] • Window-based methods, e.g. PSI_pred [Jones, 2001] • Physical forces perspective • Homology modeling or threading, e.g. Threader [Jones, 1998] • Structural biology perspective • Methods of careful design for specific structures, e.g.αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to capture the structure properties Generative models based on physical free-energy Hard to generalize due to the various informative features

  8. Structured Prediction • Many prediction tasks involve outputs with correlations or constraints Structure Sequence • Tree Grid Input John ate the cat . SEQUENCEXS…WGIKQLQAR Output HHHCCCEEE…EECCCCEEE • Fundamental importance in many areas • Potential for significant theoretical and practical advances

  9. Graphical Models • A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999] • Node: random variables • Edges: dependency relations • Directed graphical model (Bayesian networks) • Undirected graphical model (Markov random fields)

  10. Conditional Random Fields • Hidden Markov model (HMM)[Rabiner, 1989] • Conditional random fields (CRFs)[Lafferty et al, 2001] • Model conditional probability directly • Allow arbitrary dependencies in observation • Adaptive to different loss functions and regularizers • Promising results in multiple applications

  11. Protein Structure Prediction • Dependency between residues (single observation) • Dependency between components (subsequences of observations)

  12. Outline • Brief introduction to protein structures • Graphical models for structured-prediction • Conditional graphical models for protein structure prediction • General framework • Specific models • Experiment results • Conclusion and discussion

  13. Our Solution: Conditional Graphical Models Local dependency Long-range dependency • Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} • Feature definition • Node feature • Local interaction feature • Long-range interaction feature

  14. Conditional Graphical Models (II) • Conditional probability given observed sequences x is defined as • Prediction: • Training phase : learn the model parameters λ • Minimizing regularized negative log loss • Iterative search algorithms by seeking the direction whose empirical values agree with the expectation

  15. Major Components • Graph topology • Secondary structure prediction: CRF, kernel CRF • Tertiary fold recognition: Segmentation CRF, Chain graph model • Quaternary fold recognition: Linked segmentation CRF • Efficient inference • Prefer exact inference with O(nd) complexity • Resort to approximate inference • Features • Allows flexible and rich feature definition

  16. Protein Secondary Structure Prediction • Given a protein sequence, predict its secondary structure assignments • Three classes: helix (H), sheets (E) and coil (C) • Input: APAFSVSPASGACGPECA • Output: CCEEEEECCCCCHHHCCC

  17. CRF on Secondary Structure Prediction [Liu et al, Bioinformatics 2004] C C E E …. ... C • Node semantics –secondary structure assignment • Graphical model - conditional random fields (CRFs) or kernel CRF • Inference algorithm - efficient inferences exists, such as forward-backward or Viterbi algorithm

  18. Training Phase Testing Phase • Input: ..APAFSVSPASGACGPECA.. • Output 1: Does the target fold exist? • Output 2: ..NNEEEEECCCCCHHHCCC.. Yes Protein Fold Recognition and Alignment • Protein fold: identifiable regular arrangement of secondary structural elements • Different from previous simple fold classification • Provide important information and novel biological insights

  19. Conditional Graphical Model for Fixed Template Fold[Liu et al, RECOMB 2005] • Node semantics - secondary structure elements of variable lengths • Graphical model - segmentation conditional random fields (SCRFs) • Inference - forward-backward and Viterbi-like algorithm can be derived given some assumptions β-α-β motif

  20. Conditional Graphical Model for Repetitive Fold Recognition [Liu et al, ICML 2005] • Node semantics - two layer segmentation Y = {M, {Ξi}, T} • Level 1: envelop, or one repeat, level 2: components of one repeat • Graphical model - Chain graph model • A graph consisting of directed and undirected graphs • Inference - forward-backward algorithm and Viterbi-like algorithm

  21. Conditional Graphical Model for for Quaternary Fold Recognition[Liu et al, IJCAI 2007] • Node semantics – secondary structure elements and/or simple fold • Graphical model - linked segmentation CRF (L-SCRF) • Fix template and/or repetitive subunits • Inter-chain and intra-chain interactions

  22. Approximate Inference • Varying dimensionality requires reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] • Four types of Metropolis proposals • State switching • Position switching • Segment split • Segment merge • Simulated annealing reversible jump MCMC [Andireu et al, 2000] • Replace the sample with RJ MCMC • Theoretically converge on the global optimum

  23. Conditional Graphical Models for Protein Structure Prediction

  24. Kernelization Segment Correlations Local and Global Tradeoff Inter-chain Segment Correlations Model Roadmap Generalized as conditional graphical models Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Linked segmentation CRFs

  25. Outline • Brief introduction to protein structures • Graphical models for structured prediction • Conditional graphical models for protein structure prediction • Experiment results • Fold recognition • Fold alignment prediction • Discovery of potential membership proteins • Conclusion and discussion

  26. Experiments: Target Fold • Right-handedβ-helix fold [Yoder et al, 1993] • Bacterial infection of plants, binding the O-antigen and so on • Leucine-rich repeats (LLR) [Kobe & Deisenhofer, 1994] • Structural framework for protein-protein interaction

  27. Experiments: Target Quaternary Fold • Triple beta-spirals [van Raaij et al. Nature 1999] • Virus fibers in adenovirus, reovirus and PRD1 • Double barrel trimer [Benson et al, 2004] • Coat protein of adenovirus, PRD1, STIV, PBCV

  28. Tertiary Fold Recognition: β-Helix fold • Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times

  29. Quaternary Fold Recognition: Triple β-Spirals • Histogram and ranks for known triple β-spirals against PDB-minus dataset

  30. Quaternary Fold Recognition: Double Barrel-Trimer • Histogram and ranks for known double barrel-trimer against PDB-minus dataset

  31. Fold Alignment Prediction:β-Helix • Predicted alignment for known β-helices on cross-family validation

  32. Fold Alignment Prediction:LLR and Triple β-Spirals • Predicted alignment for known LLRs using chain graph model (left) and triple β-spirals using L-SCRFs

  33. Discovery of Potential β-helices • Hypothesize potential β-helices from Uniprot reference databases • Full list can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html • Verification on proteins with later resolved structures from different organisms • 1YP2: potato tuber ADP-glucose pyrophosphorylase • 1PXZ: major allergen from Cedar Pollen • GP14 of Shigella bacteriophage as a β-helix protein

  34. Conclusion • Thesis Statement • Conditional graphical models are effective for protein structure prediction • Strong claims • Effective representation for protein structural properties • Flexibility to incorporate different kinds of informative features • Efficient inference algorithms for large-scale applications • Weak claims • Ability to handle long-range interactions • Best performance bounded by prior knowledge

  35. Contribution and Limitation • Contribution to machine learning • Enrichment of graphical models • Formulation to incorporate domain knowledge • Contribution to computational biology • Effective for protein structure prediction and fold recognition • Solutions for the long-range interactions (inter-chain and intra-chain) • Limitation • Manual feature extraction • Difficulty in verification • High complexity

  36. Protein structure prediction Protein function and protein-protein interaction prediction Drug target design Graph-based semi-supervised learning Active learning for structured data Graph topology learning + Future Work • Computational biology • Machine Learning

  37. Acknowledgement • Jaime Carbonell, Eric Xing, John Lafferty, Vanathi Gopalakrishnan • Chris Langmead, Yiming Yang, Roni Rosenfeld, Peter Weigele , Jonathan King, Judith Klein-Seetharaman, , Ivet Bahar, James Conway and many more • And fellow graduate students …

  38. Features for Tertiary Fold Recognition • Node features • Regular expression template, HMM profiles • Secondary structure prediction scores • Segment length • Inter-node features • β-strand Side-chain alignment scores • Preferences for parallel alignment scores • Distance between adjacent B23 segments • Features are general and easy to extend

  39. Features for Protein Fold Recognition

  40. Discovery of Potential Double Barrel-Trimer • Potential proteins suggested in [Benson, 2005]

  41. Inference Algorithm for SCRF • Backward-forward algorithm* • Viterbi algorithm* p(state yr ends at r |xl+1 xl+2… xr-1xrand state yl ends at l) =

  42. Contrastive Divergence

  43. Reversible jump MCMC Algorithm • Three types of proposals • Position switching: randomly select a segment j and a new position assignment dj(i+1) ~U(dj-1(i),dj+1(i)) • Segment split: randomly select a segment j and split it into two segments where (dj(i+1) , dj+1(i+1) ) = G(dj-1(i) ,u(i) ) where u(i) ~ U • Segment merge: randomly select a segment j and merge segment j and j+1 • Simulated annealing reversible jump MCMC for computing y = argmax P(y|x) [Andireu et al, 2000]

  44. Simulated annealing reversible jump MCMC

  45. Protein Structural Graph for Beta-helix

  46. Protein Structure Determination • Lab experiments: time and labor- consuming • X-ray crystallography • NMR spectroscopy • Electron microscopy and many more • Computational methods: • Homology modeling: ≥ 30% sequence similarity • Fold recognition: < 30% sequence similarity • Ab inito methods: no template structure needed • Active research area in multiple scientific fields

  47. Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients Evaluation Measure

  48. Outline • Brief introduction to protein structures • Discriminative graphical models • Generalized discriminative graphical models for protein fold recognition • Experiment results • Conclusion and discussion

  49. Graphical Models for Structured Prediction • Conditional Random Fields • Model conditional probability directly, not joint probability • Allow arbitrary dependencies in observation (e.g. long range, overlapping) • Adaptive to different loss functions and regularizers • Promising results in multiple applications • Recent developments • Alternative estimation algorithms (Collins, 2002, Dietterich et al, 2004) • Alternative loss functions, use of kernels (Taskar et al., 2003, Altun et al, 2003, Tsochantaridis et al, 2004) • Baysian formulation (Qi and Minka, 2005) and semi-markov version (Sarawagi and cohen, 2004)

More Related