1 / 28

Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …

Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …. Next Few Topics. Gene Recognition Finding genes in DNA with computational methods Large-scale alignment & multiple alignment Comparing whole genomes, or large families of genes

ianthe
Download Presentation

Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics 101 • DNA sequencing • Alignment • Gene identification • Gene expression • Genome evolution • …

  2. Next Few Topics • Gene Recognition Finding genes in DNA with computational methods • Large-scale alignment & multiple alignment Comparing whole genomes, or large families of genes • Gene Expression and Regulation Measuring the expression of many genes at a time Finding elements in DNA that control the expression of genes

  3. Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

  4. Reading • GENSCAN • EasyGene • SLAM • Twinscan Optional: Chris Burge’s Thesis

  5. DNA transcription RNA translation Protein Gene expression CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

  6. Gene structure intron1 intron2 exon2 exon3 exon1 transcription splicing translation Codon: A triplet of nucleotides that is converted to one amino acid exon = protein-coding intron = non-coding

  7. Where are the genes?

  8. In humans: ~22,000 genes ~1.5% of human DNA

  9. Finding Genes • Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP • Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… • Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron • Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length • Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns

  10. Approaches to gene finding • Homology • BLAST, Procrustes. • Ab initio • Genscan, Genie, GeneID. • Hybrids • GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.

  11. Exon 3 Exon 1 Exon 2 Intron 1 Intron 2 5’ 3’ Stop codon TAG/TGA/TAA Start codon ATG 1. Exploit the regular gene structure Splice sites

  12. Next Exon: Frame 0 Next Exon: Frame 1

  13. 2. Recognize “coding bias” • Each exon can be in one of three frames ag—gattacagattacagattaca—gtaag Frame 0 ag—gattacagattacagattaca—gtaag Frame 1 ag—gattacagattacagattaca—gtaag Frame 2 Frame of next exon depends on how many nucleotides are left over from previous exon • Codons “tag”, “tga”, and “taa” are STOP • No STOP codon appears in-frame, until end of gene • Absence of STOP is called open reading frame (ORF) • Different codons appear with different frequencies—codingbias

  14. 2. Recognize “coding bias” Amino Acid SLC DNA codons Isoleucine I ATT, ATC, ATA Leucine L CTT, CTC, CTA, CTG, TTA, TTG Valine V GTT, GTC, GTA, GTG Phenylalanine F TTT, TTC Methionine M ATG Cysteine C TGT, TGC Alanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCG Threonine T ACT, ACC, ACA, ACG Serine S TCT, TCC, TCA, TCG, AGT, AGC Tyrosine Y TAT, TAC Tryptophan W TGG Glutamine Q CAA, CAG Asparagine N AAT, AAC Histidine H CAT, CAC Glutamic acid E GAA, GAG Aspartic acid D GAT, GAC Lysine K AAA, AAG Arginine R CGT, CGC, CGA, CGG, AGA, AGG Stop codons Stop TAA, TAG, TGA Can map 61 non-stop codons to frequencies & take log-odds ratios

  15. atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

  16. Biology of Splicing (http://genes.mit.edu/chris/)

  17. 3. Recognize splice sites Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996) (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

  18. Donor site 5’ 3’ Position % 3. Recognize splice sites

  19. 3. Recognize splice sites • WMM: weight matrix model = PSSM (Staden 1984) • WAM: weight array model = 1st order Markov (Zhang & Marr 1993) • MDD: maximal dependence decomposition (Burge & Karlin 1997) • Decision-tree algorithm to take pairwise dependencies into account • For each position I, calculate Si = ji2(Ci, Xj) • Choose i* such that Si* is maximal and partition into two subsets, until • No significant dependencies left, or • Not enough sequences in subset • Train separate WMM models for each subset G5G-1 G5G-1 A2 G5G-1 A2U6 G5 All donor splice sites not G5 G5 not G-1 G5G-1 not A2 G5G-1A2 not U6

  20. 4. Model the duration of regions

  21. intron exon exon intron intergene exon intergene Hidden Markov Models for Gene Finding First Exon State Intron State Intergene State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

  22. intron exon exon intron intergene exon intergene Hidden Markov Models for Gene Finding First Exon State Intron State Intergene State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

  23. T A A T A T G T C C A C G G G T A T T G A G C A T T G T A C A C G G G G T A T T G A G C A T G T A A T G A A Exon1 Exon2 Exon3 Duration HMM for Gene Finding Duration Modeling Introns: regular HMM states—geometric duration Exons: special duration model VE0,0(i) = maxd=1…D { Prob[duration(E0,0)=d]aIntron0,E0,0 j=i-d+1…ieE0,0(xj) } where i is an admissible exon-ending state, D is restricted by the longest ORF GENSCAN: Chris Burge and Sam Karlin, 1997 Best performing de novo gene finder HMM with duration modeling for Exon states duration

  24. HMM-based Gene Finders • GENSCAN (Burge 1997) • Big jump in accuracy of de novo gene finding • Currently, one of the best • HMM with duration modeling for Exon states • FGENESH (Solovyev 1997) • Currently one of the best • HMMgene (Krogh 1997) • GENIE (Kulp 1996) • GENMARK (Borodovsky & McIninch 1993) • VEIL (Henderson, Salzberg, & Fasman 1997)

  25. Better way to do it: negative binomial • EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A • Negative binomial with n = 3

  26. GENSCAN’s hidden weapon • C+G content is correlated with: • Gene content (+) • Mean exon length (+) • Mean intron length (–) • These quantities affect parameters of model • Solution • Train parameters of model in four different C+G content ranges!

  27. TP FP TN FN TP FN TN Actual Predicted Actual TP FP Predicted No Coding / Coding FN TN Evaluation of Accuracy Coding / No Coding (Slide by NF Samatova)

  28. Results of GENSCAN • On the initial test dataset (Burset & Guigo) • 80% exact exon detection • 10% partial exons • 10% wrong exons • In general • HMMs have been best in de novo prediction • In practice they overpredict human genes by ~2x

More Related