1 / 106

Making best use of TAIR tools and datasets

Making best use of TAIR tools and datasets. Philippe Lamesch Donghui Li The Arabidopsis Information Resource www.arabidopsis.org contact us: curator@arabidopsis.org. TAIR: The Arabidopsis Information Resource. collect, curate and distribute information on Arabidopsis

jaunie
Download Presentation

Making best use of TAIR tools and datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource www.arabidopsis.org contact us: curator@arabidopsis.org

  2. TAIR: The Arabidopsis Information Resource • collect, curate and distribute information on Arabidopsis • information freely available from arabidopsis.org

  3. Outline • Gene structure – Philippe Lamesch • Gene function – Donghui Li • Metabolic pathway – Donghui Li • New tools – Philippe Lamesch

  4. Slides available from TAIR www.arabidopsis.org

  5. TAIR is used worldwide Visits per month (source: Google Analytics)

  6. TAIR usage in Asia: June 2009-June 2010

  7. What we do: (1) Arabidopsis genome annotation

  8. What we do: (2) manual literature curation • Controlled vocabulary annotations Gene Ontology (GO) http://www.geneontology.org/ Plant Ontology (PO) http://www.plantontology.org/ • Gene name, symbol • Allele, phenotype • Summary statement composition

  9. What we do: (3) metabolic pathway curation AraCyc A metabolic pathway database for Arabidopsis thaliana that contains information about both predicted and experimentally determined pathways, reactions, compounds, genes and enzymes. PlantCyc and PMN (Plant Metabolic Network)

  10. What we do: (4) work with ABRC to distribute research material

  11. Part I: The Arabidopsis genome annotation • A new approach for improving the Arabidopsis genome annotation • Where to find gene structure related data at TAIR • The Arabidopsis gene structure confidence ranking

  12. Arabidopsis genome annotation • Arabidopsis genome sequenced almost 10 years ago • High quality sequence with few gaps • TIGR did initial genome annotation • TAIR took over responsibility in 2005 • Current TAIR9 stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs

  13. Genome annotation at TAIR Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants

  14. Genome annotation at TAIR Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants Annotate ‘atypical’ gene classes Short protein-coding genes Transposable element genes Trans. element Pseudogenes * * * ** * * uORFs (genes within UTR of other genes)

  15. Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: Use ESTs and cDNAs and a assembly tool called PASA to improve gene structures TAIR10: Use new experimental data and new prediction tools to further improve gene structure predictions TAIR10

  16. Genome annotation TAIR6-TAIR9 Using PASA and ESTs/cDNAs NCBI Clustered transcripts

  17. Genome annotation TAIR6-TAIR9 Using PASA and ESTs/cDNAs NCBI Clustered transcripts Resulting gene model

  18. Genome annotation TAIR6-TAIR9 Using PASA and ESTs/cDNAs NCBI Clustered transcripts Resulting gene model comparison Previous gene model Novel genes New Splice-variants Gene structure updates

  19. Manual annotation at TAIR: Apollo Short MS peptide Radish sequence alignments Eugene prediction dicot sequence alignments monocot sequence alignments Aceview gene predictions cDNAs ESTs 2 gene isoforms

  20. TAIR10: using proteomics and RNA-seq data to improve genome annotation • 4-step process: • Mapping RNA seq & Peptides • Assembly/Gene built • Manual review • Integration (genome release/Gbrowse)

  21. Mapping and Assembly • Mapping • RNA-seq sequences (Tophat (C. Trapnell), Supersplat(T.C. Mockler)) • Peptides (6-frame translation, spliced exon graph) • Assembly approaches • Augustus (M. Stanke) • Uses spliced RNA seq reads, peptides • Aim: Identify additional splice-variants, update existing genes • TAU (T.C. Mockler) • Uses spliced RNA seq reads • Aim: Identify additional splice-variants • Cufflinks (C. Trapnell) • Uses spliced and unspliced RNA seq data • Aim: Identify novel genes

  22. Augustus RNA-seq datasets (Mockler Lab, Ecker Lab) TopHat, SuperSplat 200 Million aligned RNA-seq reads 203,000 clustered spliced RNA-seq junctions (spliced RNA-seq junction) 145,000 RNA-seq junctions based on >1 read

  23. Augustus 145,000 RNA-seq junctions based on >1 read 260,000 peptides (Baerenfaller et al, Castellana et al) + ESTs & cDNAs + AGI models Augustus gene prediction 11% of RNA-seq junctions incorporated into Augustus models 64% of peptide sequences incorporated into Augustus models Predicted Augustus models: 5461 distinct models 1596 novel models

  24. Categorisation/Review Incorrect junction in TAIR model Unsupported exon TAIR confidence rank TAIR Model Augustus Model (correction) TAU Models (Splice variants, NMD targets) RNA-seq Junctions (colour reflects matching model) Peptides

  25. Example Augustus update

  26. Example 2 Augustus update

  27. Example Augustus splice variant

  28. Example 2 August splice variant

  29. Augustus/TAU/Cufflinks Augustus • Incorporate 64% of peptides not contained in TAIR, 11 % for RNA-seq junctions • 5461 potential updated genes • 1596 potential novel genes TAU • 30,083 junctions distinct to Augustus or TAIR models • 10,902 junctions incorporated into 10,491 TAU models Cufflinks • 367 novel assemblies which fall above the 100 bp & >15 FPKM filter 4 #TE-filter applied to AUG and cufflinks models

  30. Preliminary Results Augustus/TAU/Cufflinks predicted models are classified into categories: Novel genes Updated genes Splice-variants B-list Rejects 4

  31. Preliminary Results Augustus/TAU/Cufflinks predicted models are classified into categories: Novel genes 21 Updated genes 812 Splice-variants 2134 B-list 1586 Rejects 2318 4

  32. Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information

  33. Gene Locus Page

  34. Gene Model Page

  35. Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information

  36. Gbrowse

  37. GBrowse Header Main Browser Window Track Menu

  38. Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information

  39. FTP site

  40. FTP site

  41. FTP site

  42. Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information

  43. Gene Confidence Rank • Attributes confidence scores to all exons and gene models based on different types of experimental and computational evidence

  44. Assigning A Confidence Rank E1 E4

  45. Full support No support

  46. New Tools at TAIR • N-Browse • GBrowse • Synteny viewer

  47. New Tools at TAIR • N-Browse (in collaboration wit the Kris Gunsalus Lab, NYU) • GBrowse • Synteny viewer

  48. N-Browse

  49. N-Browse: Finding information about edges (interactions)

  50. N-Browse: How to select and move nodes

More Related