1 / 17

Semi-Automated Training of Geneid: Melon Example

This document provides an in-depth explanation of the (semi)-automated training process of Geneid, a protein-coding gene prediction tool, using the example of Melon. It covers the steps involved in training, optimization, and evaluation, along with the advantages and disadvantages of manual training. The document also discusses the potential future improvements and the possibility of making Geneid into a user-friendly software package.

jnye
Download Presentation

Semi-Automated Training of Geneid: Melon Example

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (SEMI)-Automated TRAINING OF GENEID:Melon EXAMPLE Francisco Câmara Ferreira Group meeting, June 2011

  2. Geneid: • Geneid is a protein-coding gene prediction tool: can be optimized for prediction in different species • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential • Dynamic programming algorythm: maximize score of assembled exons -> assembled gene • Training: • compute a model for splice sites (PWMs/Markov) • Derive model for coding DNA

  3. Parameter file development: • Currently geneid (v.1.4) has 46 parameter files. http://genome.crg.eu/software/geneid/index.html

  4. Training geneid “manually” Disadvantages: -required running 38 awk/Perl/C programs (often one after other) -easier to make mistakes at different steps -could take 2-3 days incl. evaluation Advantages: -more “control” over training and optimization could potentially generate a better matrix

  5. GeneidTRAINer1_0.pl(a PERL-language integration tool) Twenty-three awk scripts -Scripts to derive coding potential (i.e. MarkovMatrices.awk) -Scripts to compute PWMs or Markov models of splice sites (i.e. logratio_kmatrix.awk; Getkmatrix.awk) Five C-language programs Geneid SSgff Evaluation

  6. GeneidTRAINer1_0.pl Command line options: -Species (C. melo) -gff (gff2 format) -fasta (multifasta annots) -sout (stats file name) -branch (meme branch profile: y/n) -reduced (“reduced” training: y/n) {-path (location of programs/scripts)}

  7. Modules used by GeneidTRAINer1_0.pl: use strict; use Getopt::Long; use File::Path; use Data::Dumper; use Geneid::Param; use Geneid::Isocore;

  8. Full/partial training First time training for a species command line: • “REDUCED” training excludes de following: • Setting aside of sequences for testing (if >500 genes) • Select whether to perform 10x cross-validation (if >500 genes) • Extracting introns and CDS • Extract splice sites and start codons • Extract (400nt) flanked gene models and build “artificial contig” by concatenating gene models (training/test) • Extract random “background sequences” geneidTRAINer1_0.pl –species C.melo –gff c.melo.gff –fastas cmelo.fa -sout stats.txt –branch no –reduced no Subsequent training for a species command line: geneidTRAINer1_0.pl –species C.melo –gff c.melo.gff –fastas cmelo.fa -sout stats.txt –branch no –reduced yes

  9. “REDUCED” training would exclude all steps shown in this slide start user interactivity/program flow (I) Branch=no Reduced=no 10X cross-validation? (1) yes yes -Extract CDS/introns/sites -Error checking (Eval: 10x cross-val + test set) Set aside 20% for test set? yes >500 gene models? no no no (2) (3) -Extract CDS/introns/sites -Error checking (Eval: test set) -Extract CDS/introns/sites -Error checking (Eval: 10x cross-val)

  10. user interactivity/program flow (II) Display recommended 1) donor, 2) acceptor, 3) start, 4) (branch) profile Modify profile? Compute PWM or Markov on new profile yes no Compute PWM or Markov on suggested profile

  11. user interactivity/program flow (III) > 400,000 coding /100,000 non-coding bases? Derive coding model of order 5 yes no Derive coding model of order 4

  12. user interactivity/program flow (IV) Display recommended 1) intron range (min/max) / 2) inter-genic distance range Modify gene model? Modify gene model with new range yes no Use default intron/intergenic range

  13. user interactivity/program flow (V) Optimize/evaluate on “artificial” contigs? yes Display default eWF/ oWF range/step (optimization) no Optimize parameter file using new eWF/oWF range Display default eWF/ oWF range/step (optimization) Modify optimization range? yes no yes Modify optimization range? Optimize parameter file using new eWF/oWF range EVALUATE (on test set-contig or on training set-contig (if <500 gene models) –maybe biased Optimize parameter file using default eWF/oWF range no Optimize parameter file using default eWF/oWF range EVALUATE: 1) test set-single seqs + 10x cross-validation ; 2) test set-single seqs; 3) 10x cross-validation (<500 sequences)

  14. user interactivity/program flow (VI) Plot annotations + geneid predictions using gff2ps? Plot predictions + END PROGRAM yes no END PROGRAM

  15. Statistics File (melon training)

  16. Statistics File (melon training)

  17. Things (still) to do: • Convert awk scripts/programs to perl (python?) to get a “cleaner”, easier to use, software tool • Write better usage instructions… • Make it into a “package” (including geneid and other essential programs source code?) that can be easily installed by users interested in training geneid without having to have much knowledge of the training process itself.. • Perhaps try to publish to a technical journal?

More Related