The toy exon finder
Download
1 / 9

The Toy Exon Finder - PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on

The Toy Exon Finder. The “Toy” genome. The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome The exons tend to be very small, on the order of 20 bp The introns in multi-exon genes tend to be very small, about 20 bp

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Toy Exon Finder' - kelii


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The toy genome
The “Toy” genome

  • The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome

  • The exons tend to be very small, on the order of 20 bp

  • The introns in multi-exon genes tend to be very small, about 20 bp

  • The exons incorporate fairly strong codon biases and a significant GC bias

  • The splice sites, start codons, and stop codons are flanked by positions with fairly strong base composition biases

  • The codon usage and base composition statistics can be well characterized with some sample data

  • Genes occur only on one strand, which we will call the Forward strand


Toyscan 1
Toyscan 1

  • Use GC content to find exons

  • Find all ORFs such that each ORF either

    • Begins with a START and ends with a STOP

    • Begins with a START and ends with a GT

    • Begins with AG and ends with GT

    • Begins with AG and ends with STOP

  • Set threshold t such that if an exon has GC content below t, label it as noncoding

  • For all remaining pairs of ORFs p1, p2, do:

    • If p1 and p2 overlap, then discard with ORF with lower GC content

  • Output all ORFs that remain, calling them exons


Toyscan 2
Toyscan 2

  • Use codon bias to find exons

  • Codon frequencies for “true” exons are assumed to be known

  • Stop codons not included so they have probability 0

  • Define codonBias function:

    • For an input ORF (given), score all 3 frames

    • Ignore the fact that some frames have stop codons in them

    • Score = sum of log probabilities of all codons in that frame

    • Probabilities are taken from the “known” probabilities

    • Divide Score by number of codons n. This normalizes it.

    • Output the highest score of the 3 frames


Toyscan 2 cont
Toyscan 2 (cont.)

  • Note: the codonBias function achieves its maximum when the observed distribution within an ORF matches the “correct” distribution from real genes

  • Define TOYSCAN_2 as:

    • A codon bias score threshold, t, is input

    • For all ORFs, score them with the codonBias function

    • If the score is < t, delete the ORF

    • For all remaining pairs of ORFs p1, p2, do:

      • If p1, p2 overlap then discard the ORF with the lower codonBias score

    • Output all remaining ORFs as exons


Toyscan 3
Toyscan 3

  • Use codon bias and weight matrix models (WMMs)

  • Input includes WMMs for start, stop, donor, and acceptor sites

  • Donor WMM includes 5 positions after GT

  • Acceptor WMM includes 5 positions before AG

  • Start codon WMM includes 5 positions before ATG

  • Stop codon WMM includes 5 positions after TAA/TGA/TGA


Toyscan 3 cont
Toyscan 3 (cont.)

  • Score a weight matrix (scoreWMM):

    • For each position i in the sequence S, sum the log probabilities of the bases in the interval (i,j) using the WMM, where j-i+1 is the width of the WMM

  • Score an ORF (scoreORF):

    • choose the matrices to use on the left and right ends of the ORF

      • E.g., internal exon has acceptor on left, donor on right

    • Score = WMM(left end) + WMM(right end) + codonBias

    • return Score


Toyscan 3 cont1
Toyscan 3 (cont.)

  • Now define Toyscan_3 as:

    • assume a scoring threshold, t, is provided

      • You will have to experiment to find a good value for t

    • Get all ORFs

    • Score all ORFs using the scoreORF procedure

    • If the score is < t, delete the ORF

    • For all remaining pairs of ORFs p1, p2, do:

      • If p1, p2 overlap then discard the ORF with the lower scoreORF score

    • Output all remaining ORFs as exons


Gff format
GFF format

# coding GC: 49%

# noncoding GC: 50%

1 toy-genome initial-exon 31 46 . + . transgrp=1;

1 toy-genome final-exon 79 98 . + . transgrp=1;

1 toy-genome single-exon 129 140 . + . transgrp=2;

1 toy-genome single-exon 164 193 . + . transgrp=3;

1 toy-genome single-exon 228 260 . + . transgrp=4;

1 toy-genome single-exon 287 304 . + . transgrp=5;

1 toy-genome single-exon 331 354 . + . transgrp=6;

1 toy-genome single-exon 377 400 . + . transgrp=7;

1 toy-genome single-exon 424 435 . + . transgrp=8;

1 toy-genome initial-exon 475 488 . + . transgrp=9;

1 toy-genome internal-exon 512 526 . + . transgrp=9;

1 toy-genome final-exon 545 593 . + . transgrp=9;


ad