1 / 56

Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data

Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data. Jianlin Jack Cheng, PhD Computer Science Department University of Missouri, Columbia March 7, 2012. Two Challenges. Protein Structure Modeling Gene Regulatory Network Modeling. The Genomic Era.

sahirah
Download Presentation

Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University of Missouri, Columbia March 7, 2012

  2. Two Challenges • Protein Structure Modeling • Gene Regulatory Network Modeling

  3. The Genomic Era Collins, Venter, Human Genome, 2000

  4. Sequencing Revolution • $1000 Personal Genome in 2010s • Transcriptome • Proteome

  5. Genome Implications to Information Sciences and Life Sciences Elements and Systems

  6. Growth of Protein Sequences AGCWY…

  7. Growth of Protein Structures in PDB

  8. Computational Protein Structure Folding / Prediction Structure = f ( sequence) ?  E = MC2

  9. Template-Based Approach Chothia, Nature,1992 Protein sequence space is astronomical! Protein structure space is limited! Protein Data Bank Fold MWLKKFGINKH… Recognition Alignment Target protein Template

  10. Modeller Fisher, 2005

  11. Template-Free Protein Structure Prediction http://pubs.acs.org/subscribe/archive/mdd/v03/i09/html/willis.html

  12. Template-Free Approach Sampling: MCMC and Simulated Annealing MWLKKFGINLLIGQSV… Simulation …… Select structure with minimum free energy

  13. Major Challenges in Protein Structure Prediction • Select best templates? • Generate best alignments? • Generate best models? • Select best models? Pick a needle in a stack of hay!

  14. Major Challenges in Protein Structure Prediction • Select best templates? • Generate best alignments? • Generate best models? • Select best models? Wang, Eickholt, Cheng, Bioinformatics, 2010

  15. A Conformation Ensemble Approach • P(conformation) P(-energy) • Conformation Distribution • Maximum Likelihood & Maximum a Posterior Brooks et al., 2001

  16. New Views on Protein Modeling Protein structure modeling problem is simply a grand computational and statistical sampling problem. • Random sampling (template-free) • Targeted sampling (template-based)

  17. A Unified Protein Structure Prediction Pipeline 1. Template Ranking 2. Multiple-Template Combination Combination Alignments MAR-TCRK-EGAP-WY… Y-R-MH-R-DGM-MWT… TAKMTHK-DEGFG-YW… Query-Template 1 Input Query MARTCRKEGAP-WY… Y-RMH-RDGM-MWT… MARTCRKE… . . . Query-Template 2 MAR-TCRK-EGAPWY… TAKMTHK-DEGFGYW… . . . . . . 4. Evaluation & Refinement 3. Model Generation Output Wang et al., Bioinformatics, 2010

  18. Sampling in Alignment and Fold Space • PSI-BLAST (sequence – profile) • SAM (sequence – HMM) • HMMer (sequence – HMM) • Compass (profile – profile) • HHSearch (HMM - HMM) • PRC (HMM-HMM) • FOLDpro (machine learning) • MSACompro (profile-profile) Cheng, Baldi, Bioinformatics, 2006 Deng, Cheng, BMC Bioinformatics, 2011

  19. Multi-Template Combination in Template and Alignment Space Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL FGLMGN LSSWVGA (10-80) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG (10-70) Temp3 QGTARDRAWQLEVERHRAQGTSASFL (10-10) Temp4 AANQLDAMRALGYAQERYFEMDLMRRAPAGELSELFGAKAVDLK (10-5) Cheng, BMC Structure Biology, 2008

  20. Multi-Template Combination in Template and Alignment Space Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL------FGLMGN------LSSWVGA----- (10-80) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG--------------------------------- (10-70) Temp3 ---------------------------ARDRAWQLEVERHRAQGTSASFL---------- (10-10) Temp4 ----------------------------------------------------GAKAVDLK (10-5)

  21. Cheng, BMC Structure Biology, 2008 Advantage: reduce variance of modeling

  22. Multi-Template VS Single-Top-Template Improve 38 / 45 targets Improvement by 6.8% P-value < 10-4 Cheng, BMC Structure Biology, 2008

  23. Combination of Template-Free and Template-Based Sampling 100% TBM 50% TBM+50%FM 100% FM Protein Modeling Spectrum

  24. Recursive Protein Modeling – Integrate TBM and FM Initial Region Decomposition Model aligned / certain regions by TBM Keep certain regions / core fixed Divide & Conquer Conditional Sampling Model unaligned / uncertain regions by FM Compose TBM, FM components into larger certain components Increase fitness & reduce bias Satisfactory? No Repeat Yes Cheng et al., 2011

  25. Recursive Modeling Mimics Protein Folding Cascade ks.uiuc.edu

  26. Core-Constrained Tail Refinement

  27. Template-Based + Template-Free & Recursive Modeling (CASP9) Cheng et al., 2011

  28. Insights – A Bayesian Approach • Incorporate prior information: template-based region • Conditional sampling: use certain regions to constrain uncertain regions • Reduce uncertainty gradually • Iteratively optimize the conformation

  29. Model Selection • Single model approach • Ensemble approach Wang et al., Proteins, 2009 Cheng et al., Proteins, 2009 Wang et al., Bioinformatics, 2011

  30. Model Quality Evaluation Select top 5 ranked models as references . . .

  31. Model Quality Assessment Top Five A V E R A G E Compare each model with reference models Average global quality Re-rank models (+10%) . . . Cheng et al., Proteins, 2009 Wang and Cheng, 2011

  32. Iterative Ranking Wang and Cheng, 2011 Randomly selecting five reference models seems to work

  33. Model Refinement by Model Combination Structure comparison . . . Select top 5 models as seed models . . . Identify similar models or fragments Model ranking

  34. Model Combination and Averaging Average Advantage: reduce variance of modeling – maximize likelihood

  35. CASP9 Top 20 Servers http://predictioncenter.org/casp9/

  36. CASP9 Top 20 Servers on AB Initio Targets http://predictioncenter.org/casp9/

  37. Some High-Quality CASP Predictions T0390 GDT=0.90 T0426 GDT=0.97 T0432 GDT=0.92 T0458 GDT=0.97 Orange: structure; Green: model 50 of 120 CASP8 targets are in high-accuracy, RMSD < 2 Å Wang et al, 2010

  38. Modeling Gene Regulation Process by Mining RNA-Seq Data • Tens of thousands of genes • Expression of gene is regulated • Genes tend to function in groups • Regulators and targets Hasty et al., 2001

  39. Gene Regulatory Network Modeling (RNA-Seq, Microarray) Zhu et al., in preparation

  40. RNA-Seq Data Processing Steps • Isolate RNA • Prepare a RNA library • RNA sequencing by NGS • Reads mapping • Quantification and analysis Pepke et al., 2009

  41. RNA-Seq Data Mapping • Un-mapped reads • Ambiguous reads • Biological variance versus technology variance • Tool: TopHat, Bowie Hass & Zody, 2010

  42. Construct Gene Expression Profiles • Count the number of reads mapped to each gene • Normalize counts into quantitative values by length of genes and total number of reads • Tools: Cufflink, HTseq, MULTICOM • RPKM - reads per kb per million reads

  43. Mapping Results of Mouse Transcriptome Perturbed by Drug-like Compounds Li et al., 2011

  44. Identify Differentially Expressed Genes • T-test (BioConductor) • Poisson distribution (edgeR) • Negative binomial distribution (DEGseq)

  45. Differential Expression Analysis Li et al., 2011

  46. Scatter Plot of Expression Values Li et al., 2011

  47. Costa, 2010

  48. Expression Profiles of Genes in Multiple Conditions

  49. Gene Regulatory Network • A cluster of genes having similar expression profiles • Several regulators whose expression can explain the expression of the cluster of genes Segal et al., Nature Genetics, 2003

  50. Expectation Maximization Approach Generate initial clusters using K-means Recursively select TFs to construct decision tree to maximize likelihood Reassign gene to a tree that maximize its likelihood Likelihood increased? Yes No

More Related