the toy exon finder
Download
Skip this Video
Download Presentation
The Toy Exon Finder

Loading in 2 Seconds...

play fullscreen
1 / 9

The Toy Exon Finder - PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on

The Toy Exon Finder. The “Toy” genome. The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome The exons tend to be very small, on the order of 20 bp The introns in multi-exon genes tend to be very small, about 20 bp

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Toy Exon Finder' - kelii


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the toy genome
The “Toy” genome
  • The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome
  • The exons tend to be very small, on the order of 20 bp
  • The introns in multi-exon genes tend to be very small, about 20 bp
  • The exons incorporate fairly strong codon biases and a significant GC bias
  • The splice sites, start codons, and stop codons are flanked by positions with fairly strong base composition biases
  • The codon usage and base composition statistics can be well characterized with some sample data
  • Genes occur only on one strand, which we will call the Forward strand
toyscan 1
Toyscan 1
  • Use GC content to find exons
  • Find all ORFs such that each ORF either
    • Begins with a START and ends with a STOP
    • Begins with a START and ends with a GT
    • Begins with AG and ends with GT
    • Begins with AG and ends with STOP
  • Set threshold t such that if an exon has GC content below t, label it as noncoding
  • For all remaining pairs of ORFs p1, p2, do:
    • If p1 and p2 overlap, then discard with ORF with lower GC content
  • Output all ORFs that remain, calling them exons
toyscan 2
Toyscan 2
  • Use codon bias to find exons
  • Codon frequencies for “true” exons are assumed to be known
  • Stop codons not included so they have probability 0
  • Define codonBias function:
    • For an input ORF (given), score all 3 frames
    • Ignore the fact that some frames have stop codons in them
    • Score = sum of log probabilities of all codons in that frame
    • Probabilities are taken from the “known” probabilities
    • Divide Score by number of codons n. This normalizes it.
    • Output the highest score of the 3 frames
toyscan 2 cont
Toyscan 2 (cont.)
  • Note: the codonBias function achieves its maximum when the observed distribution within an ORF matches the “correct” distribution from real genes
  • Define TOYSCAN_2 as:
    • A codon bias score threshold, t, is input
    • For all ORFs, score them with the codonBias function
    • If the score is < t, delete the ORF
    • For all remaining pairs of ORFs p1, p2, do:
      • If p1, p2 overlap then discard the ORF with the lower codonBias score
    • Output all remaining ORFs as exons
toyscan 3
Toyscan 3
  • Use codon bias and weight matrix models (WMMs)
  • Input includes WMMs for start, stop, donor, and acceptor sites
  • Donor WMM includes 5 positions after GT
  • Acceptor WMM includes 5 positions before AG
  • Start codon WMM includes 5 positions before ATG
  • Stop codon WMM includes 5 positions after TAA/TGA/TGA
toyscan 3 cont
Toyscan 3 (cont.)
  • Score a weight matrix (scoreWMM):
    • For each position i in the sequence S, sum the log probabilities of the bases in the interval (i,j) using the WMM, where j-i+1 is the width of the WMM
  • Score an ORF (scoreORF):
    • choose the matrices to use on the left and right ends of the ORF
      • E.g., internal exon has acceptor on left, donor on right
    • Score = WMM(left end) + WMM(right end) + codonBias
    • return Score
toyscan 3 cont1
Toyscan 3 (cont.)
  • Now define Toyscan_3 as:
    • assume a scoring threshold, t, is provided
      • You will have to experiment to find a good value for t
    • Get all ORFs
    • Score all ORFs using the scoreORF procedure
    • If the score is < t, delete the ORF
    • For all remaining pairs of ORFs p1, p2, do:
      • If p1, p2 overlap then discard the ORF with the lower scoreORF score
    • Output all remaining ORFs as exons
gff format
GFF format

# coding GC: 49%

# noncoding GC: 50%

1 toy-genome initial-exon 31 46 . + . transgrp=1;

1 toy-genome final-exon 79 98 . + . transgrp=1;

1 toy-genome single-exon 129 140 . + . transgrp=2;

1 toy-genome single-exon 164 193 . + . transgrp=3;

1 toy-genome single-exon 228 260 . + . transgrp=4;

1 toy-genome single-exon 287 304 . + . transgrp=5;

1 toy-genome single-exon 331 354 . + . transgrp=6;

1 toy-genome single-exon 377 400 . + . transgrp=7;

1 toy-genome single-exon 424 435 . + . transgrp=8;

1 toy-genome initial-exon 475 488 . + . transgrp=9;

1 toy-genome internal-exon 512 526 . + . transgrp=9;

1 toy-genome final-exon 545 593 . + . transgrp=9;

ad