1 / 32

Motivation for Reference Genome Effort

Motivation for Reference Genome Effort. Fully and reliably annotated Genomes: empower scientific research are essential for use in automatic inference.

ronh
Download Presentation

Motivation for Reference Genome Effort

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motivation for Reference Genome Effort Fully and reliably annotated Genomes: • empower scientific research • are essential for use in automatic inference. We comprehensively capture the experimental data from the most active research communities producing high-confidence functional descriptions to leverage the power of the comparative method for inference.

  2. Deliverable of Reference Genome Effort • Proteome sets • Annotation best practices documentation • Annotation software tool • Reference annotations for inference of function in other species

  3. Evolutionary relationships are the “glue” in RefGenome • Goal • identify genes in reference genomes that may have the same or similar functions, so that comprehensive curation can be done simultaneously • Why? • Different model organisms have different strengths for investigating gene function, and these can often inform each other • Most genes did not first evolve within a given extant species: they were INHERITED from a common ancestor shared with other species. Genes in different organisms have similar functions because they were inherited, and haven’t changed much since the common ancestor.

  4. Selection of “annotation set”, including independent ortholog identification at each MOD structural annotation of genomes used to build gp2protein files Current process Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set ISS annotations made independently by each MOD

  5. Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files New processcoordinate and centralize where possible Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

  6. Gp2protein files used to build trees Select “gene set for concurrent annotation” from a central resource with more complete information Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

  7. Gp2protein files used to build trees Make homology-based annotations concurrently and consistently in the context of an evolutionary tree Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

  8. Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

  9. Update on progress:comprehensive gene sets from each MOD • Short term solution implemented as of 9/4 • Gp2protein files are now approximately complete • Most sets were OK as deposited by the MOD • A few sets had to be augmented (missing genes filled in from Ensembl or Entrez Gene), one set had to be reduced by selecting a single “representative” protein sequence per gene • Long term solution: UniProt? • SwissProt record includes all alternatively spliced exons , which is ideal for evolutionary modeling of protein coding gene history • We have already shared the gp2protein files with SwissProt, and they are comparing to UniProt “complete proteome” sets

  10. Proposal made at this meeting • Write a white paper describing the “complete protein-coding gene set” needs/requirements for the RefGenome project • Michael will approach Amos and discuss options for working together

  11. Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

  12. Example: NEDD4 • Selected for electronic jamboree Oct. 2008 • Human NEDD4 was “core” target • OrthoMCL identified “orthologs” in • Drosophila • C. elegans • Mouse (2) • Human (2) • Zebrafish • Rat • Curators at SGD identified an ortholog in yeast from a published paper

  13. duplications at base of metazoa WWP1/2; SMURF1/2 diverge NEDD4 conserved duplication at base of chordata HACE1 diverges NEDD4 conserved duplication at base of reptilia? Orthologs (green) and paralogs (orange) of human NEDD4 (red)

  14. OrthoMCL cluster containing human NEDD4/NEDD4L (blue) and curator-identified yeast ortholog (lt. blue) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia

  15. Orthologs (green) and paralogs (orange) of human NEDD4 (red) And “conserved orthologs” of NEDD4/NEDD4L (yellow) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia

  16. Update on progressGene trees and “homology set” selection tool • Gene trees have been built for all existing PANTHER families, from all RefGenome species, plus 35 other “phylogenetically informative” species • Tree Curation Tool has been updated by Paul’s and Suzi’s groups in collaboration • Retrieves and displays tree, and UniProt information for each sequence • Displays OrthoMCL clustering results-- scalable to any number of different clustering algorithms • “Pre-alpha” prototype has been installed and is being tested by Pascale • GOC has obtained supplemental funding to support • Adding multiple homology clustering algorithms • A “protein family curator”

  17. Proposal made at this meeting • Lead RefGenome Curator and Protein Family Curator work together to define set of genes to be annotated concurrently • No need for review by individual MODs

  18. Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins

  19. Annotation inference based on homology • We need to make homology inferences correctly and consistently • Infer only from annotations with experimental evidence • Use explicit evolutionary model: inheritance (maybe with modification) from a common ancestor! • Homology inference is actually two inferences • 1. the common ancestor has the same annotation as its descendant that has been characterized • 2. another (unannotated) descendant has the same annotation as its ancestor • Need traceable, versioned evidence trail: • Inferred annotation -> tree -> experimental annotation(s) -> literature

  20. GO process: cellular response to UV

  21. ? ? GO process: positive regulation of synaptogenesis

  22. GO function: ubiquitin-protein ligase activity

  23. Proposal made at this meeting • Protein family curator makes first pass at homology inferences • Confers with individual MODs as necessary • Iterative: protein family curator prepares list of inferred annotations for each MOD, each MOD reviews and can suggest changes

  24. Gp2protein files used to build trees Annotation process Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins

  25. Trees and clusters used to define ref. genome annotation sets Protein family curator (Princeton/Pascale) suggests protein set based on report/examination of trees MOD curators annotate all experimental data to completion Protein family curator mediates annotation review Review and sign off on r.g. experimental annotations Protein family curator Inferences made to ancestral proteins Protein family curator Reviewed by protein family and MOD curators Inferences made to extant proteins Done!

  26. Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Transformations Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins

  27. Princeton / P-POD update • New run with protein sets used by PANTHER under way • Implementing algorithms for generation of consensus clusters and other ortholog prediction methods • New P-POD features

  28. P-POD search

  29. P-POD results/disambiguation

  30. P-POD-Notung

  31. Gp2protein files used to build trees Pascale picks a focal gene structural annotation of genomes used to build gp2protein files Trees and clusters used to define ref. genome annotation sets Gp2protein files used to build “ortholog clusters” UniProtcomplete proteome project? Review and sign off on r.g. experimental annotations How to most efficiently incorporate input from all MOD curators? Inferences made to ancestral proteins Inferences made to extant proteins How are resulting homology-based annotations delivered to MODs?

More Related