1 / 28

WormBase - and the not so stable genome.

WormBase - and the not so stable genome. Paul Davis, WormBase. Overview. Genome Overview Project. Layout. Data as of 1998. Curated Gene Set and Genome Genome stats. Gene stats. Gene curation. User Community and catering for their needs. The Genome Sequencing Project.

Download Presentation

WormBase - and the not so stable genome.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WormBase - and the not so stable genome. Paul Davis, WormBase Informatics Meeting

  2. Overview • Genome Overview • Project. • Layout. • Data as of 1998. • Curated Gene Set and Genome • Genome stats. • Gene stats. • Gene curation. • User Community and catering for their needs. Informatics Meeting

  3. The Genome Sequencing Project • Clone based sequencing venture between Genome Sequencing Centre (St Louis) and Sanger. • C. elegans was 1st multicellular organism genome published. • 97-Mb made up of • 2527 cosmids, • 257 YACs, • 113 fosmids, • 44 PCR products. • 5 major clone gaps. • Annotated to find 19,099 protein coding genes. Informatics Meeting

  4. Clone Based Strategy Random Clone Library Produced (30-40kb). Tiling path selected and sequenced. Yac Library produced (fragments used to fill gaps) Remaining gaps PCR’d and sequenced. Informatics Meeting

  5. C.elegans genome sequence • Clones are assembled into superlinks based on their overlap tags. • 6 Chromosomes are split into 17 superlinks which are maintained by: • G.S.C. St Louis • Sanger. Informatics Meeting

  6. Genomic Change Since The 1998 Science Paper. Contiguous genome! Prior to scheduled release cycle Informatics Meeting

  7. Since 1998 Science Publication • Continue to re-annotate gene set based on a number of different sources. • Transcript data. • Comparison to protein databases. • Comparison to the C. briggsae data sets. • Literature • Last gap closed in October 2002. • 2.42% increase in genome size. • Identified from a number of sources. • Relatively stable genome. • Small number of sequencing errors. • Small number of repeat errors. • List of errors to be validated. Informatics Meeting

  8. Sequence updates Repeat assembly Issues 3rd Party Submission. Genomic Change Since Final Gap Closure Oct 2002 5909 9416 4419 8133 Informatics Meeting

  9. How errors are identified. • Gene predictions may have an incorrect structure compared to available experimental data • mRNA, • EST, • Identification biased towards coding regions. • Curator may identify a prediction that avoids problems by: • Use of incorrect splice donor/acceptors on intron exon boundaries. • Premature truncation of the prediction. • Splicing out of Internal stop codons. • Extra intron to allow for frame shift. Informatics Meeting

  10. How errors are identified Cont. • WormBase Users • Identification of a single copy prediction that is a pseudogene that the user believes not to be a pseudogene through their research/observations. • Or vice versa. • Identification of a prediction that does not follow the “family” structure, missing out a motif/domain to avoid a problem region. • Pseudogenes may be real or reflect a sequencing error. • Each case is investigated. • Clone in archive, • PCR, • Comparison to multiple transcript reads. Informatics Meeting

  11. mRNA mRNA ESTs ESTs 1 1 2 2 3 3 Example of a sequencing error. Single bp insertion into the genome causing a shift from frame 2 – 3. Investigated and corrected Base removed allowing original predictions to be corrected. Informatics Meeting

  12. The Present Situation. • A contiguous genome sequence. • The contiguation of the genome has made an impact on the way the genome can be analysed as well as yielding numerous genes that would probably still be unknown. • WormBase • WormBase has been running since 2000 and has grown to allow accommodation of new data types, curation of existing data, and to facilitate the worm community in accessing and mining this data. • The needs of the community. • Always evolving. Informatics Meeting

  13. Number of Gene Predictions in genome. • Gene predictions 15.7% increase from 1998 • 20,066 CDS (22,858 including splice forms) • Isoforms. • EST/mRNA data, • Paper evidence, • New gene predictions. • Gene family homology studies, • EST/mRNA data, • Gene predictions also removed & merged. Informatics Meeting

  14. Collaboration to find new genes based on multiple strategies. Predictions Including Isoforms Coding Genes Coding Gene Predictions Over Time. Increase in CDS due to new strategies of gene identification Pruning bad gene predictions. Informatics Meeting

  15. Partially Confirmed Confirmed Genome View Colour corresponds to strand not confidence. Predicted Informatics Meeting

  16. Analysis of the gene set. Transcript Builder introduced OST (Orfeome Sequence Tags) 70,000 New ESTs submitted to NDB New strategies for gene annotation. Informatics Meeting

  17. How Gene Curation is Driven. • We Create a number of curation lists • Confirmed introns not in gene models • ESTs/mRNAs in introns. • Overlapping Gene predictions. • Predictions overlapping known repeats. • Short Genes <150bp • Short introns <40bp • Mainly in maintenance mode. Informatics Meeting

  18. New Direction for Gene Curation. • Looking at gene predictor overlaps vs WormBase Gene set. • Protein family analysis. • Multiple species comparisons. • Other transcript data. • TEC-RED • SAGE Informatics Meeting

  19. Gene Predictor Overlaps. • Within WormBase we supply 2 extra gene sets generated by • Genefinder • Twinscan • Former curator did analysis of where two predictors overlap where we don’t have a curated gene. Informatics Meeting

  20. Strong Splicing Good briggsae DNA::DNA Alignment Predictor Overlaps. Genefinder Prediction New CDS Prediction Twinscan Prediction Informatics Meeting

  21. C. briggsae Comparison • C. elegans vs C. briggsae • C. briggsae hybrid gene set analysis(Avril Coghlan). • Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” • WormBase Has worked to incorporate the ~1300 new genes reported. • There are a number of nematodes being sequences and this data will be a main focus of our curation efforts in the future. Informatics Meeting

  22. Predictions Including Isoforms Coding Genes Coding Gene Predictions Over Time. Informatics Meeting

  23. Our User Community. • Within the worm community there are different needs from a sequenced genome. • Bioinformatics groups wanting stability to perform global analysis • Researchers wanting the latest, accurate sets of gene predictions. Informatics Meeting

  24. How Are We Catering for Different Needs? • WormBase 3 week release cycle. • Quick turnaround for corrections and data. • Good for research groups interested in subsets of genes. • Bad for global analysis groups as sequence changes throw out coordinates. • Introduction of WormBase “Frozen” release versions. • These take place every 10 releases (~ 6 months). • 1st “Frozen” release was May 2003 (WS100). • Separate websites (http://ws**0.wormbase.org/. • Remain available on ftp site. Informatics Meeting

  25. Genomic Change Since Final Gap Closure Oct 2002 5909 9416 4419 8133 Informatics Meeting

  26. Frozen Release Effects • User benefits. • Allows bioinformatics groups to coordinate analyses. • Can reference a specific release. • Continued availability of release. • Stability/insulated from sequence changes. • Effects on WormBase. • Requires more resources. • Curation and sequence updates can be processed as they are identified. • Other database resources that use WormBase encouraged to use frozen releases. • NCBI. Informatics Meeting

  27. Acknowledgements • Genome Sequencing Center St. Louis • Sequencing and finishing teams etc. • WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth • Wellcome Trust Sanger Institute • Sequencing and finishing teams etc. • WormBase team Richard Durbin Anthony Rogers Michael Han Mary Ann Tuli Gary Williams • AceDB Ed Griffiths Roy Storey Informatics Meeting

  28. The End! Informatics Meeting

More Related