1 / 9

The Toy Exon Finder

The Toy Exon Finder. The “Toy” genome. The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome The exons tend to be very small, on the order of 20 bp The introns in multi-exon genes tend to be very small, about 20 bp

kelii
Download Presentation

The Toy Exon Finder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Toy Exon Finder

  2. The “Toy” genome • The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome • The exons tend to be very small, on the order of 20 bp • The introns in multi-exon genes tend to be very small, about 20 bp • The exons incorporate fairly strong codon biases and a significant GC bias • The splice sites, start codons, and stop codons are flanked by positions with fairly strong base composition biases • The codon usage and base composition statistics can be well characterized with some sample data • Genes occur only on one strand, which we will call the Forward strand

  3. Toyscan 1 • Use GC content to find exons • Find all ORFs such that each ORF either • Begins with a START and ends with a STOP • Begins with a START and ends with a GT • Begins with AG and ends with GT • Begins with AG and ends with STOP • Set threshold t such that if an exon has GC content below t, label it as noncoding • For all remaining pairs of ORFs p1, p2, do: • If p1 and p2 overlap, then discard with ORF with lower GC content • Output all ORFs that remain, calling them exons

  4. Toyscan 2 • Use codon bias to find exons • Codon frequencies for “true” exons are assumed to be known • Stop codons not included so they have probability 0 • Define codonBias function: • For an input ORF (given), score all 3 frames • Ignore the fact that some frames have stop codons in them • Score = sum of log probabilities of all codons in that frame • Probabilities are taken from the “known” probabilities • Divide Score by number of codons n. This normalizes it. • Output the highest score of the 3 frames

  5. Toyscan 2 (cont.) • Note: the codonBias function achieves its maximum when the observed distribution within an ORF matches the “correct” distribution from real genes • Define TOYSCAN_2 as: • A codon bias score threshold, t, is input • For all ORFs, score them with the codonBias function • If the score is < t, delete the ORF • For all remaining pairs of ORFs p1, p2, do: • If p1, p2 overlap then discard the ORF with the lower codonBias score • Output all remaining ORFs as exons

  6. Toyscan 3 • Use codon bias and weight matrix models (WMMs) • Input includes WMMs for start, stop, donor, and acceptor sites • Donor WMM includes 5 positions after GT • Acceptor WMM includes 5 positions before AG • Start codon WMM includes 5 positions before ATG • Stop codon WMM includes 5 positions after TAA/TGA/TGA

  7. Toyscan 3 (cont.) • Score a weight matrix (scoreWMM): • For each position i in the sequence S, sum the log probabilities of the bases in the interval (i,j) using the WMM, where j-i+1 is the width of the WMM • Score an ORF (scoreORF): • choose the matrices to use on the left and right ends of the ORF • E.g., internal exon has acceptor on left, donor on right • Score = WMM(left end) + WMM(right end) + codonBias • return Score

  8. Toyscan 3 (cont.) • Now define Toyscan_3 as: • assume a scoring threshold, t, is provided • You will have to experiment to find a good value for t • Get all ORFs • Score all ORFs using the scoreORF procedure • If the score is < t, delete the ORF • For all remaining pairs of ORFs p1, p2, do: • If p1, p2 overlap then discard the ORF with the lower scoreORF score • Output all remaining ORFs as exons

  9. GFF format # coding GC: 49% # noncoding GC: 50% 1 toy-genome initial-exon 31 46 . + . transgrp=1; 1 toy-genome final-exon 79 98 . + . transgrp=1; 1 toy-genome single-exon 129 140 . + . transgrp=2; 1 toy-genome single-exon 164 193 . + . transgrp=3; 1 toy-genome single-exon 228 260 . + . transgrp=4; 1 toy-genome single-exon 287 304 . + . transgrp=5; 1 toy-genome single-exon 331 354 . + . transgrp=6; 1 toy-genome single-exon 377 400 . + . transgrp=7; 1 toy-genome single-exon 424 435 . + . transgrp=8; 1 toy-genome initial-exon 475 488 . + . transgrp=9; 1 toy-genome internal-exon 512 526 . + . transgrp=9; 1 toy-genome final-exon 545 593 . + . transgrp=9;

More Related