1 / 15

Gene Prediction in Zea Mays

Gene Prediction in Zea Mays. 07/20/06. Project Summary. Building a Training Set Curation and filtering Training, fine tuning of Twinscan Parameter Estimation Model Redesign and Performance Analysis Maize Twinscan 1.0 Prediction of novel genes in Maize BACs Selection, contig masking

luka
Download Presentation

Gene Prediction in Zea Mays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Prediction in Zea Mays 07/20/06

  2. Project Summary • Building a Training Set • Curation and filtering • Training, fine tuning of Twinscan • Parameter Estimation • Model Redesign and Performance Analysis • Maize Twinscan 1.0 • Prediction of novel genes in Maize BACs • Selection, contig masking • Identification of strong novel candidates • Prospectus

  3. Learning • Twinscan must be trained on new species, much like a human learns a new language: Parameter Estimation • Building a Training Set • Curation • Cleaning • Processing • Model Development • Training Set Analysis • Revision • iParameterEstimation

  4. Genbank • 508 mRNA records from Genbank • EST-Genome • 126 were bad alignments • Bad splice sites • Inframe stop codons • 382 good alignments exist • Protein BLAST yielded 273 clusters • No known retrotransposons found

  5. Public/Private Monsanto cDNA Sequences • EST-Genome against Zea Mays 4.0 release • Cleaned for non-canonical splice sites • Clustered redundant genes from Genbank • In all 1257 training sequences • 809 Public Monsanto mRNAs • 212 Proprietary Monsanto mRNAs • 273 Genbank mRNAs

  6. Applying iParameterEstimation • Several retraining iterations steadily improved the performance of Maize Twinscan: • Donor site revamp • Addition of more training data • Addition of geometric tail on intron length distribution

  7. Benchmarking Performance • 4-fold cross-validation • Twinscan Maize 1.0: Current Best

  8. Novel Predictions • Release 4 of AZM • Collaborative effort between Wash U, CSH, Iowa State and Arizona • Contains 65,325 contigs out of 1573 BACs • Isolated and repeat masked • Danforth Center repeat library and RepBase • Run with and without masking

  9. Effects of Repeat Masking

  10. Contig Length

  11. Prediction Positioning

  12. Filtering Novel Predictions

  13. Novel Candidates • Filter the remaining contig-internal predictions • Blast against all MaizeGDB • Search for Retrotransposons from Rice and Arabidopsis in predictions • 30 Rice homolog found in the remainder • 330 remaining putative novel genes

  14. Training Set Characteristics

  15. Prospectus for the Next 6 Months • PCR 150 of the novel candidates • Continue improving Twinscan • Train on RT-PCR products • Use rice-trained Twinscan • Make predictions on new sequence as it arrives • Only 1573 of estimated 18k BACs are sequenced • Total estimated gene count is 45k-50k • Explore Other Cereals • Sorghum - JGI 5x Shotgun, done later this year • Soy - JGI, possibly finished in 2007

More Related