330 likes | 517 Views
Presentation Outline. The importance of gene numberGene definition and detectionGenome inflation argumentsPost-completion changes in model eukaryotesEnsembl and NCBI gene pipeline numbersCompleted chromosome gene numbersMan, mouse and fishPost-genomic transcript and protein increasesNovel ge
E N D
2. Presentation Outline The importance of gene number
Gene definition and detection
Genome inflation arguments
Post-completion changes in model eukaryotes
Ensembl and NCBI gene pipeline numbers
Completed chromosome gene numbers
Man, mouse and fish
Post-genomic transcript and protein increases
Novel gene skimming
Proteomics for gene detection
Conclusions
3. So Who Cares About Gene Number? Central to evolutionary questions of gene expansion vs. protein diversity from alternative splicing and post-translational modifications
Announcement of genome closure sets expectations for gene closure
Gene delineation essential for genetics and clinical genomics
Defines limits for the number of potential drug targets and therapeutic proteins
Sets the baseline for Human Proteome Organisation and other academic large-scale proteomics initiatives, e.g. Sanger Atlas of Gene Expression
Sets the baseline for commercial initiatives such as OGS/Confirmant (www.confirmant.com) Protein Atlas of the Human GenomeTM – A database of mass-spec data on human proteins mapped onto genome data
4. Definitions The Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology"
This presentation is concerned with the protein-coding gene number - defined as: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation”
5. Spread of Estimates in the Literature
6. Evidence for Identifying Genes Bioinformatic
Detection of protein identity in genomic DNA
Gene prediction with protein similarity support
Matches with ESTs that include ORFs and/or splice sites
Cross-species comparisons for orthologous exon detection
Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals and the absence of repeat elements
Experimental
Cloning of predicted genes
Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation
Loss-of-function approaches
High-throughput transcript sampling by EST or SAGE tagging
In-vitro expression
Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing
7. Arguments for high numbers (I) Model eukaryotes (yeast/worm/fly) will show a significant post-genomic rise in gene number
Human genome assembly is not complete
Gene prediction programs have a significant false-negative rate
The Ensembl pipeline is conservative
Mammalian protein and transcript coverage is incomplete
Selective skimming experiments have revealed new genes
Extensive human/mouse genomic sequence conservation
8. Arguments for high numbers (II) There exists a substantial subset of “cryptic” proteins (5,000 to 10,000) that have the following characteristics;
Low specificity of detection by gene prediction (predominantly single exon)
Not sampled in any mammals by mRNAs or ESTs (rare or restricted transcripts)
Diverged in sequence from all other proteins in current databases (rapidly evolving and clade-specific)
Predominantly short proteins (smORFs)
9. Model Eukaryotes – no Significant Post-Completion Gene Increases S.cerevisiae 1.5% increase since 1997
C.elegans 2% increase since 1998
D.melanogaster 2% decrease since 2001
Ciona intestalis, protochordate with many vertebrate proteins but ~ 4,000 less genes than C.elegans
S.pombe only 4,824 genes
Massive functional genomics focus on yeast, worm and fly
10. Model Eukaryotes: Close to Gene Number Closure ? S.cervisiae – remaining uncertainties in ORF totals;
6128 (Snyder & Gerstein 2003)
6202 from EBI,
6356 from SGD-Stanford
6449 from CYDG-MIPS
Mass-spectrometric identification of tryptic peptides;
S.cervisiae 23% ORF confirmation and 60(?) novels, P.falciparium 24% ORF confirmation and 100 orphan peptides
Themes from latest Drosophila re-annotation
45% of genes changed
~800 new genes balanced by reduced gene fragmentation
increased transcript length and exon count
11. Mammals vs. Eukaryotes: Average Protein Length and Exon Number
12. Post-genomic Coverage of Protein and Transcript Data The cornerstones of genome annotation, locating transcript identities and detecting protein homology, are directly related to coverage in non-genomic data
Since the first draft human genome in early 2001 there has been a massive increase in the human mRNA and protein data
EST data feeds several international high-throughput mRNA projects
EST data is increasingly used as supporting evidence for predicted genes
Data from other mammals can be used for homology detection of human genes
These data increase would be expected to result in an increased gene number
13. Mammalian Transcript Coverage in UniGene, March 2003
14. Human Transcripts: Post-genomic mRNA Growth in UniGene Rapid growth in redundant mRNA
But slow growth in clustered set ~ 9,000 over 2 years
This will include some splice variants
15. Human Protein Number Changes in the International Protein Index and SPTr
16. Mammalian Post-Genomic Protein Growth:Grist to the Genome Annotation Mill Growth in SPTr despite 100% redundancy removal
Mouse biggest increase of 7.4-fold
Predominantly re-sampling the same mammalian gene set?
17. Ensembl Gene Number: Essentially Flat
Massive increase in human protein and transcript coverage over 2 years
But 24,847 genes, only 801 more than the first release
Knowns < from 90% in Nov-01 to over 95% Mar-03
Novel genes > 12,398 Nov-01 to 5,421 (21%) Mar-03
Exons-per-gene < 6.5 Jan-02 to 10.0 Mar-03
Alternative splicing < from 3,669 Nov-01 to 12,500 Mar-02
18. Ensembl and NCBI GP31 Comparison Gene numbers (24,847 & 26,846) approximately congruent except higher NCBI totals on 14 and 22
Ensembl novels approximately 20% for each chromosome with maximum of 43% for Y and minimum of 12% for 17
19. NCBI Gene Number: Still Yo-yoing NCBI genomic pipeline includes varying proportions of EST-only supported and unsupported ab inito predictions as RefSeq XP protein models
New category “locp” in GP32 statistics page infers 23,270 protein-coding genes
20. Humans, rodents and fish Rodents lower numbers despite ~ 500 more ODR genes
Lower exon counts from reduced transcript coverage?
Lower alternative splicing from reduced transcript coverage?
Teleost fish known to have lineage-specific duplications
21. Addressing the smORF Question: Protein Size Distributions in Human SPTr
22. Addressing the smORF Question (II) No database evidence for substantially increased smORF discovery in eukaryotes or mammals
The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals
Although small proteins are less conserved i.e. evolve more rapidly, those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function
23. Pseudogenes: Potential false-positives in Genome Annotation Human upper estimate of 20,000 (Harrison et al. 2002)
Mouse estimate of 14,000 with ~ 12% in gene set
Completed chromosome teams annotate an average of ~ 1:4 pseudo:real genes
But this varies between 10:1 for chrom 7 and 2.3:1 for 22
Conservative estimate for Ensembl would reduce human gene total to ~ 22,500
NCBI LocusLink 2172 with 80 (3.5%) expressed as mRNA
Difficult to prove null-translation for pseudogenes with minor disablements such as premature stops
24. Experimental Transcript Skimming as Evidence for High Gene Numbers Exon arrays (Dunham et al. 1999)
Gene arrays (Penn et al. 2000)
RT-PCR (Das et al. 2001)
SAGE-tags (Saha et al. 2002, Chen et al. 2002)
Oligo tiling (Kapranov et al. 2002)
No novel proteins were submitted to the primary databases
There is increasing evidence for significant amounts of antisence and other non-ORF transcription in human and mouse
It now becomes necessary to clone a full length ORF with the necessary features of gene anatomy, and submission to the public databases, before the discovery of novel genes can be claimed
25. Human Proteome Sampling with MS/MS Peptide Identification
615 from the human heart mitochondria (Taylor et al. 2003)
500 from breast cancer cell membranes (Adams et al. 2003)
491 from microsomal fractions (Han et al. 2001)
490 from blood serum (Adkins et al. 2003)
311 from the splicesome (Rappsilber et al. 2002)
Total approaches ~ 10% of human genes
No reported data on protein prediction confirmation
Technical caveats on search space for novel gene detection by correlative algorithms
One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0)
Proteomics will have a key impact on characterising the proteome but there is no evidence so far for significant novel gene discovery
26. Gene Numbers for Individual Completed Chromosomes
27. Vertebrate Genome Annotation (VEGA) for Human 14, 20 and 22 Novel CDSs = where an ORF can be determined
Novel transcripts = ORFs not frame-fixed by homology
Putative transcripts = where spliced ESTs define intron/exon boundaries but not an ORF
28. Gene Numbers for Individual Completed Chromosomes Averaging the completed chromosomes exceeds Ensembl GP31 genes by ~12%
This extrapolates to ~ 28,000 genes
The five chromosomes still only cover 13% of genome
The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support
Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers
Future status of partial genes unclear
29. An Example of Disappearing Novelty Cysteine and tyrosine-rich 1 (CYYR1), a novel unpredicted gene on
human chromosome 21 (21q21.2), encodes a cysteine and tyrosine-
rich protein and defines a new family of highly conserved
vertebrate- specific genes (Vitale et al. Gene 2002 May 15;290:141-51)
Nineteen additional unpredicted transcripts from human
chromosome 21 (Reymond et al.Genomics 2002 Jun;79(6):824-32)
Human accessions
BF676689 834 bp EST 21-DEC-2000
AY061853 2320 bp 154 aa 11-JUN-2002 Reymond et al., Geneva
AF401639 2686 bp 154aa 13-JUN-2002 Vitale et al., Bologna
AAM56646 154 aa 20-JUN-2002 US Patent 6368794
AL833200 3048 bp (no CDS)12-JUL-2002 German Genome Project
AK054581 1678 bp 262aa 01-AUG-2002 NEDO human project
BC036761 2000 bp 154aa 26-AUG-2002 NIH-MGC Project
ENSG00000166265 154aa Sept 2002
Mouse
BB621846 581 bp EST 26-OCT-2001 RIKEN
AY061854 1064 bp 165aa 11-JUN-2002 Reymond et al.
AF442733 498 bp 165aa 13-JUN-2002 Vitale et al.
30. Disappearing Novelty (II)
31. Current Numbers From Major Public Gene Sets Lower numbers explicitly non-redundant and exclude pseudogenes
Higher numbers have increasing splice variant content
32. So What Would Constitute Gene Closure ? The human genome was closed on April 14th but yeast gene number still not closed after six years
Comparative genomics will contribute to resolving the mammalian gene sets e.g. three-way human/mouse/rat
Closure-by-clone from VEGA
Proteomic closure by confirming at least one protein splice form from all plausible genes (expression in vitro and detection in vivo?)
Likely to be remaining grey areas e.g. transcribed pseudogenes producing truncated proteins and apparently intact genes may have undetectable impairments that render them functionally superfluous and translationally silent
Grey areas may not be numerically large
33. Conclusions The model eukaryotes have not shown post-genomic rises in gene number
The Ensembl gene number has been essentially flat
The pseudogene-adjusted Ensembl gene total on a largely complete GP is ~22,000
The five curated complete chromosomes extrapolate to ~28,000 but leave many “unclosed” annotations
The massive increase in post-genomic transcript coverage is extending exons but predominantly re-sampling known genes
Database submissions of novel human genes have slowed to a trickle
Initial mouse & rat have lower gene numbers than human
No evidence for large numbers of cryptic smORFs
Widespread occurrence of non-protein transcripts could explain previous high gene estimates from transcript skimming
Gene number closure likely to be well below 30,000
34. Acknowledgments Paul Kersey for IPI figures
Lucas Wagner of the NCBI for the retrospective UniGene data
Arek Kasprzyk of the EBI for historical and preview Ensembl release statistics
Numerous other people at NCBI, EBI, and Sanger Centre who have graciously answered queries on their data collections
The OGS Proteome Discovery Team for useful discussions