slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Achim Tresch Computational Biology PowerPoint Presentation
Download Presentation
Achim Tresch Computational Biology

Loading in 2 Seconds...

  share
play fullscreen
1 / 51
cachet

Achim Tresch Computational Biology - PowerPoint PPT Presentation

129 Views
Download Presentation
Achim Tresch Computational Biology
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. ‘Omics’ - Analysis of high dimensional Data Achim TreschComputational Biology

  2. Regulation of gene expression protein Translation Localization Stability Post-transcriptional mRNA 3’UTR Transcriptional Pol II DNA Activation Repression

  3. Regulation of gene expression • Where does each transcription factor bind in the genome, in each cell type, at a given time? Near which genes ? • What is the “cis-regulatory code” of each factor ? Does it require any co-factors ? DNA Activation Repression

  4. Chromatin Immunoprecipitation (ChIP) Transcription factor of interest Antibody Sequencing

  5. Chromatin Immunoprecipitation (ChIP) Control: input DNA Sequencing

  6. Chromatin Immunoprecipitation (ChIP) Sonication 25-40bp ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG Average length ~ 250bp

  7. Chromatin Immunoprecipitation (ChIP) Sonication 25-40bp ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG Average length ~ 250bp

  8. Chromatin Immunoprecipitation (ChIP) ChIP-Seq Analysis Workflow ELAND Bowtie SOAP SeqMap … FindPeaks CHiPSeq BS-Seq SISSRs QuEST MACS CisGenome … Alignment Peak Detection Motif Analysis Annotation Visualization

  9. Read Alignment Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

  10. Read Alignment genome Read count T A T T A A T T A T C C C C A T A T A T G A T A T genome Expected read count Expected read count = total number of reads * extended fragment length / chr length

  11. Read Alignment

  12. Peak Detection • We need to correct for input DNA reads (control) • - non-uniformly distributed (form peaks too) - vastly different numbers of reads between ChIP and input • Calculate read count at each position (bp) in genome • Determine if read count is greater than expected

  13. Peak Detection Is the observed read count at a given genomic position greater than expected ? Frequency x = observed read count λ = expected read count Read count The Poisson distribution

  14. Peak Detection Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) genome P(X>=10) = 1.7 x 10-10 log10 P(X>=10) = -9.77 The Poisson distribution -log10 P(X>=10) = 9.77

  15. Peak Detection Read count Expected read count -Log(p) Expected read count = total number of reads * extended fraglen / chrlen

  16. Peak Detection Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len Input reads

  17. Peak Detection ChIP INPUT Read count Read count Expected read count Expected read count Genome positions (bp) Genome positions (bp) -Log(Pc) -Log(Pi) Threshold Log(Pc) - Log(Pi)

  18. Peak Detection Normalized Peak score (at each bp) P(XChIP) • Determine all genomic regions with R>=15 • Merge peaks separated by less than 100bp • Output all peaks with length >= 100b R = -log10 P(Xinput) Will detect peaks with high read counts in ChIP, low in Input

  19. Peak Detection The constant rate assumption does not hold! Negative binomial model fits the data better! HongkaiJi et al. Nature Biotechnology 26: 1293-1300. 2008

  20. Visualization ChIP reads Input reads Detected Peaks 80% are within <20kb of a known gene

  21. Motif Search True TF binding peak? Yes Dependence is quantified using the mutual information Yes Target regions Yes True TF peak Yes Yes No Yes Yes … Absent Motif No Present No Random regions No No No No …

  22. Highly informative k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265 ... ... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 MI=0.081 MI=0.045 MI=0.040 ... Not informative Motif Search

  23. A/G C/G/T T/G A/T/G C/G A/C/G Motif Search Optimizing k-mers into more informative degenerate motifs True TF binding peak? ATCCGTACA Yes Yes Target regions Yes Yes Yes Yes … No No Random regions ATCC[C/G]TACA No No No which character increases the mutual information by the largest amount ? No …

  24. Motif Search change

  25. Motif Analysis Motif co-occurrence anallysis Discovered Motifs Enrichment Depletion

  26. The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money

  27. The ENCODE Project • 2 Tier 1 cell lines • GM12878 (B cell) • K562 (CML cells) • 5 Tier 2 cells • HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC • Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences

  28. The ENCODE Project Lots of data and data types generated by RNA-seq RNA-array TF ChIP-seq Histonemodif ChIP-seq DNaseHS-seq Methyl-seq Methyl27-bisulfite 1M SNP genotyping

  29. HMM segmentation PCA analysis Dynamic Bayesian Networks Integrative Data Analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Std. Peaks Region calls Active regions …… Biological interpretation

  30. Integrative Data Analysis 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path 25-state HMM State E State C State F State I State A Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis

  31. Metagene Analysis of RNA transcription initiation F pA Pol II B H Kin28 ChIP-chip profiles, averaged across ~300 expressed genes of medium length CTD Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

  32. Metagene Analysis of RNA transcription promotor escape initiation nascent RNA F pA Pol II Pol II B H Kin28 5‘ S5P P CTD S7P P Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

  33. Metagene Analysis of RNA transcription promotor escape elongation initiation Elf1 nascent RNA nascent RNA F pA Pol II Pol II Pol II B H Kin28 Spt4/5 * * m7G P 20 S5P CE P P CBP 80 S2P P Ctk1 CTD S7P P P Spt6 Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

  34. Metagene Analysis of RNA transcription promotor escape elongation termination initiation Elf1 nascent RNA nascent RNA F pA Pol II Pol II Pol II Pol II B H Kin28 Spt4/5 * * m7G P P 20 S5P CE P P CBP P Pcf11 80 S2P P Ctk1 CTD S7P P P P * Spt6 Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

  35. Metagene Analysis of RNA transcription promotor escape elongation termination initiation Elf1 F pA Pol II Pol II Pol II Pol II B H Kin28 Spt4/5 * * m7G P P 20 S5P CE P P CBP P Pcf11 80 S2P P Ctk1 CTD S7P P P P * Spt6 Is the sequence of binding, dissociation and modification events universal?

  36. HMM Analysis of RNA transcription ChIP-chip occupancy profiles genomic position Ernst and Kellis (2012): ChromHMM: automating chromatin state discovery and characterization

  37. HMM Analysis of RNA transcription ChIP-chip occupancy vectors

  38. HMM Analysis of RNA transcription typical occupancy vector(s) state 3 state 1 state 2 transition matrix state 4 state 5

  39. Textbook: Hidden Markov Models (HMMs) D: Data (occupancy vectors) • D2 • D2 • D1 • D1 • D3 • D3 X: Hidden (transcription) states • Ψ : Emission distributions ΨX1 ΨX1 ΨX2 ΨX2 ΨX3 ΨX3 Γ : Transition probabilities • X1 • X1 • X2 • X2 • X3 • X3 [less important: P(X1): Initial state distribution] ΓX2X3 ΓX1X2 genomic position Likelihood: Decoding: Viterbi algorithm Parameter Learning: Baum-Welch algorithm

  40. Results on the S.cerevisiae data set Viterbi paths genes transcription start site

  41. Results on the S.cerevisiae data set 2 5 1 Productive elongation Elf1, Ser2P high Initiation- elongation transition Nucl. high Ser2P low Initiation state: TFIIB high Nucl., Spt5, Ser2P, Elf1 low 8 Untranscribed regionsall low except Nucl. Termination Pcf11 high 8

  42. Results on the S.cerevisiae data set initiation-elongation Observation: The transition matrix is almost symmetric, due to transcription in forward and reverse direction transition graph initiation transition matrix termination productive elongation early elongation intergenic/untranscribed

  43. Sense vs. antisense transcription ChIP-chip tracks(multivariate Gaussian emissions) • D1 • D4 • D3 • D2 • D6 • D5 • X1 • X2 • X4 • X3 • X5 • X6 transcriptannotation Transcription on Crick strand Transcription on Watson strand Transcrpt. onCrick strand

  44. The bidirectional Hidden Markov Model Additional constraint 1: Corresponding Watson and Crick states have identical emission distributions Ψ1 Ψ2 . . .Ψk Ψk . . . Ψ2Ψ1 “Watson“ transcription states Additional constraint 2: Γ12= P(Xt+1=Ψ2 | Xt=Ψ1 ) = P(Xt=Ψ1 |Xt+1=Ψ2) = Γ21• P(Xt=Ψ1 ) / P(Xt=Ψ2 ) „Crick“ transcription states Intergenic state

  45. State transitions reflect biochemichal transitions 8 2 1 4 7 5 9 standard transcription Mayer et al. (2010): transition from initiation to elongation at +150bp 10 untranscribed genes

  46. Different transcription cycles ?! 8 8 2 2 1 1 4 3 7 5 9 standard transcription (stepwise recruitment) highly transcribed genes (immediate recruitment)

  47. A grammar of transcription • very low synthesis rate • high decay rate • Enrichment of stress response genes P I • low synthesis rate • very high decay rate • enrichment of genes involved in epigenetic regulation of gene expression, cell cycle PE EE2 EE1 E1 P T • medium synthesis rate • medium decay rate • Enrichment of genes involved in reproduction PE EE2 EE1 E2 P T • high synthesis rate • low decay rate • Enrichment of genes involved in ribosome biogenesis, rRNA processing PPE E3 P T

  48. A grammar of transcription • medium synthesis rate • medium decay rate • Enrichment of genes involved ijn G1 phase of cell cycle PE EE1 EE2 P T • very high synthesis rate • Very low decay rate • Enrichment of ribosomal protein genes, intron containing genes P T PPE

  49. Annotation of bidirectional promoters Viterbi sequence from directional HMM 696 bidirectional promoters ...........-pE-pE-pE-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-eE2--eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-.............. Text search (Regular Expression) PE- P+ P- PE+

  50. Annotation of 45 unknown transcripts Strand-specific transcription data from Xu et al., Nature 2009 stable transcripts on the - strand cryptic transcripts on the - strand Viterbi sequence from directional HMM two new transcripts