1 / 44

Dark matters in the genomes

Dark matters in the genomes. Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI. About myself. About myself. About myself. About myself. Cell, nucleus, and chromosomes. DNA. A. G. G. C. G. T. A. G. A. G. A. G. A. T. C. C. T. T. G. A. T. T. C. C. G. C. A. A. C.

shadi
Download Presentation

Dark matters in the genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI

  2. About myself

  3. About myself

  4. About myself

  5. About myself

  6. Cell, nucleus, and chromosomes

  7. DNA A G G C G T A G A G A G A T C C T T G A T T C C G C A A C T C T C A A G G A A C A A

  8. DNA and Genome • Genome is all the DNA in a cell made up of A, T, G, C... • How many A's, T's, G's, and C's are there in the human genome? 3,200,000,000 letters • A sizable book, say, Lord of the Ring: Fellowship of the Ring 764,470 characters in 410 pages ~2,000 characters per page • The book of our life 1,600,000 pages 4,186 Fellowship of the ring

  9. Genome sequencing chronology

  10. Genome sequencing chronology

  11. Now, 1366 genomes are sequenced or being sequenced

  12. Between human and other animals • How much do our and chimp genomes differ? • 0.1% • 1% • 10% • 50% • 90% • How many genes do you think we share with worm? • 1% • 10% • 50% • 75% • 99%

  13. Genome and better food

  14. Basic understanding of science & environment

  15. Our research interest TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG

  16. Evolution of genome sizes • C-value: 1pg ~= 1.02Gb • Thale cress (Arabidopsis thaliana): 0.16 pg • Fruit fly (Drosophila melanogaster): 0.18 pg • Pufferfish (Takifugu rubripes): 0.4 pg • Human (Homo sapiens): 3.5 pg • Onion (Allium cepa): 16.75 pg • Tiger salamander (Ambystoma tigrinum): 32 pg • Marbled lungfish (Protopterus aethiopicus): 132 pg http://www.rbgkew.org.uk/

  17. Genic region and genome size Dan Graur

  18. Exon UTR Intron Annotated genes Cis-regulatory elements Dead genes (pseudogenes) Novel genes What's in the genome Genome Selfish elements

  19. "Non-genic": repetitive elements • E.g. Human genome • Exons take up? • Introns account for? • Repetitive elements occupy? • Unknown? • A B C • 1% 24% 25% • 24% 1% 25% • 35% 60% 45% • 40% 15% 5% Venter et al. (2001) Science 291:1304

  20. cDNA array Tiling array Gap size: 10bp Probe size: 25bp What are in the unknown regions? • Investigate with tiling array • Number of features: • Arabidopsis, 135Mb, 1 chip, ~6x106 features • Human, 3Gb, 7 chips, ~4.2x107 features

  21. "Non-genic": unannotated genes • Tiling array analysis of human Chr 21, 22 Kapranov et al., 2002. Science

  22. Tiling array analysis of human transcriptome • Human Chr 21, 22 • What do you think these expressed regions represent?? Kapranov et al., 2002. Science

  23. Difficulties for coding gene prediction • Training data • You need to know something... • “Biased” toward the properties of the majority. • Real genes that are shorter tend to be much harder to predict. Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp) Gene finder Cor Sn Snfk (%) Sp GISMO 0.64 63.0 86.4 69.0 Glimmer 0.54 72.0 83.7 44.0 CRITICA 0.60 46.0 67.4 84.0 Snfk denotes the sensitivity in detecting function-known genes. Krause et al., 2006. Nucleic Acid Res. 35:540

  24. Novel coding sequence identification • Arabidopsis thaliana as an example • 135Mb, ~50% occupied by annotated genes. • Focus on coding sequences 90-300bp long. • What would you do next to eliminate ORFs that are likely false predictions? 133,090 sORFs

  25. Criterion 1: Codon usage bias • Some codons are used more frequently than others http://www.cbs.dtu.dk/services/GenomeAtlas/

  26. Criterion 1: Codon usage bias • For example: codons for proline • Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be real coding sequence? Seq1 CCT CCA CCT Seq2 CCC CCG CCC

  27. Posterior probability calculation Bayes' theorm Novel CDS identification

  28. Novel CDS identification • Determine base composition probabilities • Feature tables Coding sequences CDS parameters Non-coding sequences NCDS parameters Coding sequences c1 c2 c3 c4 c5 c6 Non-coding sequences n

  29. Posterior probability of coding sequence • Compare known non-coding and coding sequences Hanada et al., 2007. Genome Res.

  30. Posterior probability of coding sequence • Scanning Arabidopsis genome Hanada et al., 2007. Genome Res.

  31. After applying the first criterion 7,442 coding sORFs

  32. How good is the CDS finding measure • For the training data • For 18 Arabidopsis small protein genes • All 18 are predicted as CDS. • For 84 yeast small protein genes • All 84 are predicted as CDS.

  33. So what does this mean? • If a sequence is a true coding sequence • Our approach can predict them with high accuracy. • So, the sensitivity is very good. • Is this good enough?? • What about specificity? • Namely, how good is the criteria in excluding false positives?

  34. Gap size: 10bp Probe size: 25bp Criterion 2: Expression • What would be the expression level you would expect for true CDS compared to false CDS? Tiling array Frequency Expression level

  35. Comparison of expression levels • Exon, intron, tRNA, rRNA, our predictions A: Exon B: Intron C: Prediceted novel CDS D: tRNA E: rRNA

  36. Applying the second criterion • Prediction significantly enriched in expressed sequences 2,996 transcribed sORFs

  37. Criterion 3: Purifying selection • Compare known coding and non-coding sequences

  38. Criterion 3: Purifying selection • Compare known coding and non-coding sequences

  39. Our research interests 17,000 6,000 45,000 10,000 30,000 25,000

  40. Duplication Mechanism and Loss Rate Gene Duplications Mechanisms Preferential retention Preferential retention Consequences Consequences

  41. + Duplication mechanisms • Whole genome duplication • Tandem duplication • Segmental duplication • Duplicative transposition

  42. Differences in Duplicability • Duplicability • The propensity for the retention of a duplicate gene • Computational analysis of genome-wide trend

  43. Functional Consequences of Duplication • Functional divergence and conservation • Is it because of changes in cis-regulatory elements or coding sequences • How are duplicates retained, subfunctionalization or neofunctionalization

  44. Acknowledgement • Lab members • TIGR • Chris Town • Hank Wu • University of Chicago • Wen-Hsiung Li • Justin O. Borevitz • Xu Zhang • Funding Kousuke Hanada Melissa Lehti-Shiu Cheng Zou

More Related