Has the Yo-yo Stopped A Human Gene Number Update Dr Christopher Southan, Proteome Discovery Oxford Glycosciences Pr

2. Presentation Outline The importance of gene number Gene definition and detection Genome inflation arguments Post-completion changes in model eukaryotes Ensembl and NCBI gene pipeline numbers Completed chromosome gene numbers Man, mouse and fish Post-genomic transcript and protein increases Novel gene skimming Proteomics for gene detection Conclusions

3. So Who Cares About Gene Number? Central to evolutionary questions of gene expansion vs. protein diversity from alternative splicing and post-translational modifications Announcement of genome closure sets expectations for gene closure Gene delineation essential for genetics and clinical genomics Defines limits for the number of potential drug targets and therapeutic proteins Sets the baseline for Human Proteome Organisation and other academic large-scale proteomics initiatives, e.g. Sanger Atlas of Gene Expression Sets the baseline for commercial initiatives such as OGS/Confirmant (www.confirmant.com) Protein Atlas of the Human GenomeTM � A database of mass-spec data on human proteins mapped onto genome data

4. Definitions The Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology" This presentation is concerned with the protein-coding gene number - defined as: �transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation�

5. Spread of Estimates in the Literature

6. Evidence for Identifying Genes Bioinformatic Detection of protein identity in genomic DNA Gene prediction with protein similarity support Matches with ESTs that include ORFs and/or splice sites Cross-species comparisons for orthologous exon detection Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals and the absence of repeat elements Experimental Cloning of predicted genes Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation Loss-of-function approaches High-throughput transcript sampling by EST or SAGE tagging In-vitro expression Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing

7. Arguments for high numbers (I) Model eukaryotes (yeast/worm/fly) will show a significant post-genomic rise in gene number Human genome assembly is not complete Gene prediction programs have a significant false-negative rate The Ensembl pipeline is conservative Mammalian protein and transcript coverage is incomplete Selective skimming experiments have revealed new genes Extensive human/mouse genomic sequence conservation

8. Arguments for high numbers (II) There exists a substantial subset of �cryptic� proteins (5,000 to 10,000) that have the following characteristics; Low specificity of detection by gene prediction (predominantly single exon) Not sampled in any mammals by mRNAs or ESTs (rare or restricted transcripts) Diverged in sequence from all other proteins in current databases (rapidly evolving and clade-specific) Predominantly short proteins (smORFs)

9. Model Eukaryotes � no Significant Post-Completion Gene Increases S.cerevisiae 1.5% increase since 1997 C.elegans 2% increase since 1998 D.melanogaster 2% decrease since 2001 Ciona intestalis, protochordate with many vertebrate proteins but ~ 4,000 less genes than C.elegans S.pombe only 4,824 genes Massive functional genomics focus on yeast, worm and fly

10. Model Eukaryotes: Close to Gene Number Closure ? S.cervisiae � remaining uncertainties in ORF totals; 6128 (Snyder & Gerstein 2003) 6202 from EBI, 6356 from SGD-Stanford 6449 from CYDG-MIPS Mass-spectrometric identification of tryptic peptides; S.cervisiae 23% ORF confirmation and 60(?) novels, P.falciparium 24% ORF confirmation and 100 orphan peptides Themes from latest Drosophila re-annotation 45% of genes changed ~800 new genes balanced by reduced gene fragmentation increased transcript length and exon count

11. Mammals vs. Eukaryotes: Average Protein Length and Exon Number

12. Post-genomic Coverage of Protein and Transcript Data The cornerstones of genome annotation, locating transcript identities and detecting protein homology, are directly related to coverage in non-genomic data Since the first draft human genome in early 2001 there has been a massive increase in the human mRNA and protein data EST data feeds several international high-throughput mRNA projects EST data is increasingly used as supporting evidence for predicted genes Data from other mammals can be used for homology detection of human genes These data increase would be expected to result in an increased gene number

13. Mammalian Transcript Coverage in UniGene, March 2003

14. Human Transcripts: Post-genomic mRNA Growth in UniGene Rapid growth in redundant mRNA But slow growth in clustered set ~ 9,000 over 2 years This will include some splice variants

15. Human Protein Number Changes in the International Protein Index and SPTr

16. Mammalian Post-Genomic Protein Growth:Grist to the Genome Annotation Mill Growth in SPTr despite 100% redundancy removal Mouse biggest increase of 7.4-fold Predominantly re-sampling the same mammalian gene set?

17. Ensembl Gene Number: Essentially Flat Massive increase in human protein and transcript coverage over 2 years But 24,847 genes, only 801 more than the first release Knowns < from 90% in Nov-01 to over 95% Mar-03 Novel genes > 12,398 Nov-01 to 5,421 (21%) Mar-03 Exons-per-gene < 6.5 Jan-02 to 10.0 Mar-03 Alternative splicing < from 3,669 Nov-01 to 12,500 Mar-02

18. Ensembl and NCBI GP31 Comparison Gene numbers (24,847 & 26,846) approximately congruent except higher NCBI totals on 14 and 22 Ensembl novels approximately 20% for each chromosome with maximum of 43% for Y and minimum of 12% for 17

19. NCBI Gene Number: Still Yo-yoing NCBI genomic pipeline includes varying proportions of EST-only supported and unsupported ab inito predictions as RefSeq XP protein models New category �locp� in GP32 statistics page infers 23,270 protein-coding genes

20. Humans, rodents and fish Rodents lower numbers despite ~ 500 more ODR genes Lower exon counts from reduced transcript coverage? Lower alternative splicing from reduced transcript coverage? Teleost fish known to have lineage-specific duplications

21. Addressing the smORF Question: Protein Size Distributions in Human SPTr

22. Addressing the smORF Question (II) No database evidence for substantially increased smORF discovery in eukaryotes or mammals The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals Although small proteins are less conserved i.e. evolve more rapidly, those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function

23. Pseudogenes: Potential false-positives in Genome Annotation Human upper estimate of 20,000 (Harrison et al. 2002) Mouse estimate of 14,000 with ~ 12% in gene set Completed chromosome teams annotate an average of ~ 1:4 pseudo:real genes But this varies between 10:1 for chrom 7 and 2.3:1 for 22 Conservative estimate for Ensembl would reduce human gene total to ~ 22,500 NCBI LocusLink 2172 with 80 (3.5%) expressed as mRNA Difficult to prove null-translation for pseudogenes with minor disablements such as premature stops

24. Experimental Transcript Skimming as Evidence for High Gene Numbers Exon arrays (Dunham et al. 1999) Gene arrays (Penn et al. 2000) RT-PCR (Das et al. 2001) SAGE-tags (Saha et al. 2002, Chen et al. 2002) Oligo tiling (Kapranov et al. 2002) No novel proteins were submitted to the primary databases There is increasing evidence for significant amounts of antisence and other non-ORF transcription in human and mouse It now becomes necessary to clone a full length ORF with the necessary features of gene anatomy, and submission to the public databases, before the discovery of novel genes can be claimed

25. Human Proteome Sampling with MS/MS Peptide Identification 615 from the human heart mitochondria (Taylor et al. 2003) 500 from breast cancer cell membranes (Adams et al. 2003) 491 from microsomal fractions (Han et al. 2001) 490 from blood serum (Adkins et al. 2003) 311 from the splicesome (Rappsilber et al. 2002) Total approaches ~ 10% of human genes No reported data on protein prediction confirmation Technical caveats on search space for novel gene detection by correlative algorithms One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0) Proteomics will have a key impact on characterising the proteome but there is no evidence so far for significant novel gene discovery

26. Gene Numbers for Individual Completed Chromosomes

27. Vertebrate Genome Annotation (VEGA) for Human 14, 20 and 22 Novel CDSs = where an ORF can be determined Novel transcripts = ORFs not frame-fixed by homology Putative transcripts = where spliced ESTs define intron/exon boundaries but not an ORF

28. Gene Numbers for Individual Completed Chromosomes Averaging the completed chromosomes exceeds Ensembl GP31 genes by ~12% This extrapolates to ~ 28,000 genes The five chromosomes still only cover 13% of genome The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers Future status of partial genes unclear

29. An Example of Disappearing Novelty Cysteine and tyrosine-rich 1 (CYYR1), a novel unpredicted gene on human chromosome 21 (21q21.2), encodes a cysteine and tyrosine- rich protein and defines a new family of highly conserved vertebrate- specific genes (Vitale et al. Gene 2002 May 15;290:141-51) Nineteen additional unpredicted transcripts from human chromosome 21 (Reymond et al.Genomics 2002 Jun;79(6):824-32) Human accessions BF676689 834 bp EST 21-DEC-2000 AY061853 2320 bp 154 aa 11-JUN-2002 Reymond et al., Geneva AF401639 2686 bp 154aa 13-JUN-2002 Vitale et al., Bologna AAM56646 154 aa 20-JUN-2002 US Patent 6368794 AL833200 3048 bp (no CDS)12-JUL-2002 German Genome Project AK054581 1678 bp 262aa 01-AUG-2002 NEDO human project BC036761 2000 bp 154aa 26-AUG-2002 NIH-MGC Project ENSG00000166265 154aa Sept 2002 Mouse BB621846 581 bp EST 26-OCT-2001 RIKEN AY061854 1064 bp 165aa 11-JUN-2002 Reymond et al. AF442733 498 bp 165aa 13-JUN-2002 Vitale et al.

30. Disappearing Novelty (II)

31. Current Numbers From Major Public Gene Sets Lower numbers explicitly non-redundant and exclude pseudogenes Higher numbers have increasing splice variant content

32. So What Would Constitute Gene Closure ? The human genome was closed on April 14th but yeast gene number still not closed after six years Comparative genomics will contribute to resolving the mammalian gene sets e.g. three-way human/mouse/rat Closure-by-clone from VEGA Proteomic closure by confirming at least one protein splice form from all plausible genes (expression in vitro and detection in vivo?) Likely to be remaining grey areas e.g. transcribed pseudogenes producing truncated proteins and apparently intact genes may have undetectable impairments that render them functionally superfluous and translationally silent Grey areas may not be numerically large

33. Conclusions The model eukaryotes have not shown post-genomic rises in gene number The Ensembl gene number has been essentially flat The pseudogene-adjusted Ensembl gene total on a largely complete GP is ~22,000 The five curated complete chromosomes extrapolate to ~28,000 but leave many �unclosed� annotations The massive increase in post-genomic transcript coverage is extending exons but predominantly re-sampling known genes Database submissions of novel human genes have slowed to a trickle Initial mouse & rat have lower gene numbers than human No evidence for large numbers of cryptic smORFs Widespread occurrence of non-protein transcripts could explain previous high gene estimates from transcript skimming Gene number closure likely to be well below 30,000

34. Acknowledgments Paul Kersey for IPI figures Lucas Wagner of the NCBI for the retrospective UniGene data Arek Kasprzyk of the EBI for historical and preview Ensembl release statistics Numerous other people at NCBI, EBI, and Sanger Centre who have graciously answered queries on their data collections The OGS Proteome Discovery Team for useful discussions

Has the Yo-yo Stopped A Human Gene Number Update Dr Christopher Southan, Proteome Discovery Oxford Glycosciences Pr

Has the Yo-yo Stopped A Human Gene Number Update Dr Christopher Southan, Proteome Discovery Oxford Glycosciences Pr

Presentation Transcript

New Protein Biomarkers Christopher Southan Oxford GLycosciences UK Ltd NATO Workshop, Prague, October 2002

Yo-Yo Ma

THE YO-YO

The Budget YO!

YO-YO Leader Election

Yo .

Yo-Yo Ma: A Great Cellist

Yo-yo-yo this your boy Reverend Parris

Yo ~!

Yo Yo Yo Phases of the Moon, Gee

Yo-Yo Ma

Yo - Yo Ma - the best cellist

Yo-Yo Ma

Acme Yo-yo Company

Yo-yo Ma

YO!

Yo Yo Presentation (1)

YO ELECTRON – eBike for the YO generation

Yo-Yo Ma

Example: Giant Yo-Yo