1 / 36

Special Topics in Genomics Lecture 1: Introduction

Special Topics in Genomics Lecture 1: Introduction. Instructor: Hongkai Ji Department of Biostatistics Email: hji@jhsph.edu. Outline of today’s lecture. Introduction to genome and genomics Topics and tools Relevance of statistics. DNA.

amandla
Download Presentation

Special Topics in Genomics Lecture 1: Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in GenomicsLecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics Email: hji@jhsph.edu

  2. Outline of today’s lecture • Introduction to genome and genomics • Topics and tools • Relevance of statistics

  3. DNA DNAs (Deoxyribonucleic acids) are molecules to store genetic information of a living organism. DNA consists of two polymers made from four types of nucleotides: adenine (A) guanine (G), cytosine (C) and thymine (T). Purines: A, G; Pyrimidines: C, T Two polymers are complementary to each other and from a double-helix structure 5’-ACCGTTCGACGGTAA-3’ ||||||||||||||| 3’-TGGCAAGCTGCCATT-5’

  4. Chromosome

  5. Genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTCTCACACCTGACATGAAAAGGCACATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAGGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTGATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGAGGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGGTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAACACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGCCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGCCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTAGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTGTTGTTTTCACCTGTCCCCAGCCCTAAGCCAGGTGTGGCCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTATTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACTTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGTCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTCACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGGCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAAGGAAGGAACCTGTGGACTCCTCCCTACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTCCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGCACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGCCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT… Total amount of DNA in human genome: 3 * 109 base pairs (bp)

  6. Gene Gene Gene Gene Gene Gene

  7. Central Dogma Gene expression

  8. X X A A A X X Y B Y B B Y Z C C C Z Z Z Y Topic 1: gene expression and microarray Expression No Expression Spatially Temporally

  9. Microarray cDNA sample probe

  10. Microarray data

  11. TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... Topic 2: transcriptional regulation Transcription factors (TF): TF1 TF2 Transcription factor binding sites (TFBS): CCACCCAC, TAATAAAAT TF1 TF2 TF1 TF2

  12. Transcription factor binding motif TF GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA 123456789 TGGGTGGTC TGGGTGGTA TGGGAGGTC TGGGTGGTG TGAGTGGTC TGGGTGGTC 1 2 3 4 5 6 7 8 9 A 0 0 1 0 1 0 0 0 1 C 0 0 0 0 0 0 0 0 4 G 0 6 5 6 0 6 6 0 1 T 6 0 0 0 5 0 0 6 0 TF TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TF CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TF TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG TF AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC TF ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG Transcription Factor Binding Sites (TFBS) Motif

  13. Motifs are regulatory codes in the genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTCTCACACCACCCATGTTTTGTTTATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAGGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTGATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGTTGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGGTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAACACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGCCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGCCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTAGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTTTTGTTTTCACCTGTCCCCACCCATAAGCCAGGTGTGGCCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTATTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACTTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGTCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTCACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGGCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAAGGAAGGAACCTGTGGACTCCACCCAACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTCCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGCACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGCCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT Transcription Factor Binding Sites (TFBS) Gene

  14. TF1 TF2 Transcription factors Other genes Activation TF1 TF2 Repression TACTACCACCCACAACATAATAAAATCTAA TTAATAAAATACCACCCACAACCTAAGGAT Gene2 Gene1 Other Interactions TF2 TF1 TF3 Gene3 Gene regulatory network TF3 Diseases Misregulation

  15. Motif discovery and decoding regulatory programs in the genome Genomic Language Dictionary GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC step1 step2 GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGGAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC Human Language Dictionary guesswhatthestoryisaslongasyouknowthelanguageitshouldbeprettyeasy step1 Know Guess Be … step2 Guess what the story is. As long as you know the language, it should be pretty easy.

  16. GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene1 Gene2 Gene3 Finding motifs from co-regulated genes (Roth et al., 1998; Hughes et al., 2000; etc.) GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Condition1 Condition2 Gene 1 Gene 2 Gene 3 … Gene N

  17. 100~1000 bp 100~1000 bp 100~1000 bp Gene1 Gene2 Gene3 10k~1000k bp 10k~1000k bp 10k~1000k bp Gene1 Gene2 Gene3 Motif discovery is difficult in mammalian genomes due to a low signal-to-noise ratio yeast human

  18. Topic 3: ChIP-chip and tiling array ChIP-chip (Chromatin ImmunoPrecipitation coupled with Microarray) 500~2000 bp long No IP IP

  19. ChIP-chip on tiling arrays Probe: 25~60 bp long 35~300 bp spacing 500~2000 bp long IP CT IP1 1000 20 32 1120 800 50 12 1700 600 11 20 17 80 780 60 IP2 1200 30 25 1500 730 45 11 1650 700 15 30 23 90 790 70 CT1 80 32 30 21 32 35 22 50 30 24 25 33 12 30 10 CT2 20 25 27 50 29 60 17 45 20 13 15 29 21 45 13

  20. 500~2000bp 6~30bp A combined approach to study gene regulation ChIP-chip GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC Sequence Analysis

  21. Topic 4: alternative splicing and exon array promoter intron intron gene exon exon exon transcription start site (TSS) splicing

  22. Alternative splicing exon 1 exon 2 exon 3 exon 4 exon 5 Isoform 1 Isoform 2 Isoform 3

  23. Exon array

  24. Topic 5: single nucleotide polymorphism and SNP array SNPs: occur every 100 to 1000 bp make up 90% of genetic variations minor allele frequency >= 1% (otherwise we call them mutations)

  25. SNP array ACCGTGGA[C/T]CTGAACCG |||||||| | |||||||| TGGCACCT[G/A]GACTTGGC ACCGTGGA[G]CTGAACCG ACCGTGGA[C]CTGAACCG ACCGTGGA[T]CTGAACCG ACCGTGGA[A]CTGAACCG What will happen when the genotype is CC? CT? TT? Applications: 1. Genotyping & genome-wide association study 2. Copy number variations and loss of heterozygosity 3. Allele specific expression …

  26. Topic 6: next-generation sequencing Traditional sequencing

  27. Next-generation sequencing Prepare genomic DNA  Attach DNA to surface  Bridge amplification  Fragement become double stranded  Denature the double stranded molecules  Complete amplification  Determine first base  Image first base  Determine second base  Image second base  Sequence reads over multiple cycles  Align data. >50 milliion clusters/flow cell, each 1000 copies of the same template, 1 billion bases per run, 1% of the cost of capillary-based method. (From: http://www.illumina.com/downloads/SS_DNAsequencing.pdf)

  28. Array vs. next-generation sequencing

  29. Array vs. next-generation sequencing Microarray, Exon array  RNA-seq ChIP-chip  ChIP-seq SNP array  SNP/mutation detection by sequencing …  …

  30. Other topics • Epigenomics • Transposon • miRNA

  31. Relevance of statistics Need new statistical theories and tools Genomics Statistics Guide development of efficient data analysis strategies

  32. Example 1: differential gene expression

  33. Gene i=1 i=2 i=3 … i=I t-statistic 1.2 6.7 5.1 … -0.5 p-value 0.30 0.001 0.002 … 0.56 Bonferroni adjustment Rejections … Example 1: multiple testing Multiplicity needs to be adjusted in order to determine statistical significance Bonferroni adjustment too stringent False discovery rate

  34. False discovery rate (FDR) False discovery rate (FDR, Benjamini & Hochberg, 1995) FDR = E(V/R) = Pr(R>0)E(V/R|R>0) FWER = Pr(V ≥1)

  35. Test 1 2 3 … I Sample Variance (df) … … … Pooling information Multiplicity caused some problem in controlling type I errors, but it can be used to improve statistical power! A common distribution Variance Estimates Modified t-statistics

  36. Inference by iterative estimation/sampling (Gibbs sampler)  A Example 2: motif discovery A C G T A .3 .2 .2 .3 C .2 .3 .3 .2 G .2 .3 .3 .2 T .3 .2 .2 .3 1 2 3 4 5 6 7 8 9 A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17 C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17 T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00 Background: 0 Motif: Θ S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA A: 000000000000001000000000000000000000000001000000000000000000000000000000 f (A,Θ | S) Marginalization: f (A | S) = ∫ f (A, Θ | S) dΘ

More Related