90 likes | 161 Views
The Toy Exon Finder. The “Toy” genome. The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome The exons tend to be very small, on the order of 20 bp The introns in multi-exon genes tend to be very small, about 20 bp
E N D
The “Toy” genome • The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome • The exons tend to be very small, on the order of 20 bp • The introns in multi-exon genes tend to be very small, about 20 bp • The exons incorporate fairly strong codon biases and a significant GC bias • The splice sites, start codons, and stop codons are flanked by positions with fairly strong base composition biases • The codon usage and base composition statistics can be well characterized with some sample data • Genes occur only on one strand, which we will call the Forward strand
Toyscan 1 • Use GC content to find exons • Find all ORFs such that each ORF either • Begins with a START and ends with a STOP • Begins with a START and ends with a GT • Begins with AG and ends with GT • Begins with AG and ends with STOP • Set threshold t such that if an exon has GC content below t, label it as noncoding • For all remaining pairs of ORFs p1, p2, do: • If p1 and p2 overlap, then discard with ORF with lower GC content • Output all ORFs that remain, calling them exons
Toyscan 2 • Use codon bias to find exons • Codon frequencies for “true” exons are assumed to be known • Stop codons not included so they have probability 0 • Define codonBias function: • For an input ORF (given), score all 3 frames • Ignore the fact that some frames have stop codons in them • Score = sum of log probabilities of all codons in that frame • Probabilities are taken from the “known” probabilities • Divide Score by number of codons n. This normalizes it. • Output the highest score of the 3 frames
Toyscan 2 (cont.) • Note: the codonBias function achieves its maximum when the observed distribution within an ORF matches the “correct” distribution from real genes • Define TOYSCAN_2 as: • A codon bias score threshold, t, is input • For all ORFs, score them with the codonBias function • If the score is < t, delete the ORF • For all remaining pairs of ORFs p1, p2, do: • If p1, p2 overlap then discard the ORF with the lower codonBias score • Output all remaining ORFs as exons
Toyscan 3 • Use codon bias and weight matrix models (WMMs) • Input includes WMMs for start, stop, donor, and acceptor sites • Donor WMM includes 5 positions after GT • Acceptor WMM includes 5 positions before AG • Start codon WMM includes 5 positions before ATG • Stop codon WMM includes 5 positions after TAA/TGA/TGA
Toyscan 3 (cont.) • Score a weight matrix (scoreWMM): • For each position i in the sequence S, sum the log probabilities of the bases in the interval (i,j) using the WMM, where j-i+1 is the width of the WMM • Score an ORF (scoreORF): • choose the matrices to use on the left and right ends of the ORF • E.g., internal exon has acceptor on left, donor on right • Score = WMM(left end) + WMM(right end) + codonBias • return Score
Toyscan 3 (cont.) • Now define Toyscan_3 as: • assume a scoring threshold, t, is provided • You will have to experiment to find a good value for t • Get all ORFs • Score all ORFs using the scoreORF procedure • If the score is < t, delete the ORF • For all remaining pairs of ORFs p1, p2, do: • If p1, p2 overlap then discard the ORF with the lower scoreORF score • Output all remaining ORFs as exons
GFF format # coding GC: 49% # noncoding GC: 50% 1 toy-genome initial-exon 31 46 . + . transgrp=1; 1 toy-genome final-exon 79 98 . + . transgrp=1; 1 toy-genome single-exon 129 140 . + . transgrp=2; 1 toy-genome single-exon 164 193 . + . transgrp=3; 1 toy-genome single-exon 228 260 . + . transgrp=4; 1 toy-genome single-exon 287 304 . + . transgrp=5; 1 toy-genome single-exon 331 354 . + . transgrp=6; 1 toy-genome single-exon 377 400 . + . transgrp=7; 1 toy-genome single-exon 424 435 . + . transgrp=8; 1 toy-genome initial-exon 475 488 . + . transgrp=9; 1 toy-genome internal-exon 512 526 . + . transgrp=9; 1 toy-genome final-exon 545 593 . + . transgrp=9;