1 / 30

Ensembl Gene Set: Understanding Gene Annotation and Biological Evidence

Learn about the Ensembl Gene Set and how it is determined through the gene annotation pipeline, manual curation, and biological evidence. Explore pseudogenes, ncRNAs, and the CCDS project.

amandap
Download Presentation

Ensembl Gene Set: Understanding Gene Annotation and Biological Evidence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Ensembl Gene setThe “Genebuild” 21 April 2008

  2. Outline • The GeneBuild (determining the Ensembl gene set) • What it means for the scientist? • ‘annotation pipeline’ vs ‘manual curation’ • Pseudogenes • ncRNAs • The CCDS project

  3. Introduction • What is available? I) Sequence Assemblies from genome sequencing efforts

  4. Gene Sequencing- the Assembly This generates clones, vs new sequencing methods http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.html

  5. Ciona intestinalis Shotgun assembly Clones Available • Human: • (Tilepath- used in the assembly)

  6. ContigView: Clones and Contigs Contigs Clones (Plate/well numbers) Ensembl Transcripts

  7. Task: View the tilepath clone in ContigView for the region containing the human BRCA2 gene. Hint: Start with a search for the BRCA2 gene.

  8. The Ensembl Geneset • How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome? Ensembl Geneset Protein Sequence Assembly

  9. Once the Assembly is Imported… • Proteins/mRNAs are aligned. • These have been submitted to databases such as: • UniProt (manually curated) and • RefSeq (partially manually curated)

  10. The BiologicalEvidence All Ensembl gene predictions are based on experimental evidence: • UniProt/Swiss-Prot • A manually curated database and therefore of highest accuracy • NCBI RefSeq • A partially manually curated database • UniProt/TrEMBL • Automatically annotated translations of EMBL coding sequence (CDS) features • EMBL / GenBank / DDBJ • Primary nucleotide sequence repository

  11. Database Relationship NCBI RefSeq EMBL-Bank DDBJ GenBank Individual Lab’s Submission UniProt Swiss-Prot TrEMBL

  12. EST Genebuild Sequence (Assembly) Manual annotation (HAVANA) EMBL-Bank GenBank DDBJ Proteins (e.g. Swiss-Prot) Ensembl mRNA EST genes

  13. Why do I want to know?… • Ensembl genes may be based on multiple protein/mRNAs • What is an Ensembl gene based on?

  14. Task • Look at the evidence for the human EPO gene. • What was this gene based on? • Hint: Go to Exon Information from the GeneView page

  15. EPO gene supporting evidence

  16. Species-Specific GeneBuilds • Pan troglodytes genes are built by projection from human genes. • Zebrafish has many gene duplications. Homo sapiens genes must have protein evidence, not just mRNA.

  17. Task • When was the chimpanzee (Pan troglodytes) Genebuild performed? • Can you find information as to how genes were annotated? • Hint: Look on the chimpanzee index page

  18. External Gene Set: VEGA/Havana • Human, zebrafish, mouse and dog • Havana transcripts in blue or gold… • What are Havana transcripts?

  19. Havana and Ensembl match When a Havana (manually curated) and Ensembl (automatic methods) predict the same transcript, basepair for basepair, the transcripts are merged and coloured gold.

  20. Manually-curated gene sets in Ensembl • Vega (Havana) • Homo sapiens,Danio rerio, • Mus musculus and Canis familiaris • WormBase • Caenorhabditis elegans • FlyBase • Drosophila melanogaster • SGD • Saccharomyces cerevisiae

  21. What Can Go Wrong? • A Gap in the assembly • Gene might not be found in Ensembl • II) Fused genes BLAST hit (SwissProt entry) Gene might be associated with two names

  22. Outline • The genome sequence • The Genebuild • ‘manual curation’ by Havana • Other: EST gene set Pseudogenes ncRNAs

  23. Expressed Sequence Tags vs ‘cDNA’ • ESTs are annotated separately. Why? • mRNA and cDNA used in the GeneBuild: • Sequenced to high standard, often complete. • EST: Lower quality sequence. • ‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. • ESTs are only 500-800 nucleotides long • Low quality fragment- sequence error of ~2%. • BUT confers useful expression information • discovery of new genes esp in diseased organisms • Tissue type • Timing/developmental stage • Samples more transcripts, variants

  24. Where Can I See This EST Geneset?ContigView Choose EST genes EST track

  25. Processed Unprocessed mRNA AAAAAA Produced by gene duplication and rearrangement Reverse transcription and re-integration pseudogene AAAAAA Pseudogenes: ‘False’ Genes

  26. ncRNAs (non coding RNAs) • What types are in Ensembl? • tRNA (transfer RNA) • rRNA (ribosomal RNA) • scRNA (small cytoplasmic) • snRNA (small nuclear) • snoRNA (small nucleolar) • miRNA (microRNA)

  27. ncRNAs (2 types) • I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern) • II) High sequence conservation (miRNA) • BLAST alignment • ‘RNA fold’ applied to make sure • sequences can fold (hairpin)

  28. ncRNAs… where can I see them? • Find them in ContigView: • or use BioMart.

  29. Summary – Ensembl Genes *All Ensembl genes are based on biological evidence (protein and mRNA) • One Ensembl gene may come from proteins and mRNAs in various databases. • Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human. • The CCDS set strives for consensus coding sequences across databases. • Pseudogenes and RNAs are annotated, along with a separate EST gene set.

  30. For more on GeneBuild: • Help and Documentation • (About Ensembl) http://www.ensembl.org/info/about/docs/genome_annotation.html

More Related