The wonderful complexity of the human transcriptome

Danielle and Jean Thierry-Mieg N.C.B.I. http://www.aceview.org The wonderful complexity of the human transcriptome

Phenotype and in vivo function, GO biol. proc. Protein family GO molecular function Gene-gene or Protein protein interactions Regulation Alt splicing position Promoters, post transcr Protein motifs GENE Protein conservation and orthologs Level of expression (cDNA counts microarrays) Cellular compartment Pattern of expression Multiply connecting annotation

How could we find the true genes • Use the experimental cDNA data, • All the cDNA data from public databases, • Only the cDNA data (no prediction) Trace each and every cDNA from the public databases to the very place on the genome where it was transcribed. Automatically deduce the genes from these alignments.

Methods

Tools and hardware • Alignment software: AceView (T-M2 96-) • Not Blast, Blat • Database: Acedb (Thierry-Mieg & Durbin, 89-) • Not Oracle/Sybase • Interactive graphics at all scales • Human intervention: • debug once, fix everywhere • Hardware: • 2 or 3 days elapsed per human build to align the genes + a week to annotate the function using BLASTP/TaxBlast/PFAM/PSORT/LocusLink/OMIM

AceView alignment specificities • 1: Masking replaced bySeedonrare words • 2: Auto-adaptative word hit extension • 3: Coalign the intron boundaries • 4: Team jump the loose ends • 5: Trim the read pairs • 6: Flag and mask the clone anomalies • 7: Aggressive clean up

A good alignment 4: Team jump first and last exons drives the noisy neighbors

6: Mask suspect cDNAs • We flagged suspected internal deletions in ~4% of all aligned clones • Flag internal priming • Reassess the strand • Search and clip vectors • Identify mosaic and rearranged clones

7:Clean up strategy Problem: the genome is full of repeats • We align the RNA in the most compact way • We measure (aligned length – errors) at each site, compare, keep only the best site by 5 bp. • As a result, only 0.95 % of the RefSeqs, 1.4% of the mRNAs and 2.2% of the ESTs cannot be attributed to a single gene. • About 1% of the transcribed genes are repeated. • This ‘cleanup’ procedure automatically excludes most non transcribed pseudogenes.

Post alignment clustering • We now have about 5M clones aligned on the genome • Goal: Recognize the genes set of transcripts sharing an intron boundary Distinguish the alternative transcripts

Transcript consistency graph • Each clone is a Vertex • Draw a GREEN ARC, if 2 clones A &B share a genomic base (friends) • Draw a RED ARC, if an intron of A matches an exon of B (foes) Gene<=> green connected component Transcript <=> maximal pacific sub graph i.e. friendly connected with no foes

A andBCare the 2maximal pacific sub graphs A C B

Alignment Quality

mRNA EST RefSeq % aligned and not filtered build 35/hg17 91.1% 191,058 75.9% 5,699,664 99.7% 23,973 98.8% 93.3% 99.2% %length aligned 99.78% 98.16% 99.90% %identity How good are the alignments? • AceView alignments of RNAs and ESTs on the genome are highly reliable

In 31 ENCODE test regions, 1,556 models have the same intron-exon structure in at least two of the seven tracks RefSeq, Known Gene, Ensembl, Gencode, AceView, ECgene and ExonWalk.

Gene counts

How many genes? • We align individually on the human genome 4,523,877 cDNA sequencesfrom public databases (August 2005) and cluster them into • 57,882 main genes (about 150,000 proteins) • 40,567 putative genes • and 251,183 “cloud” objects .

Classification of the 57,882 main genes from GenBank cDNAs

There are only 17,789 genes from the 23,459 RefSeq • RefSeq does not aim at completion: • it only represents 31% of the genes in GenBank, • With an average of 1.34 alternative variant per gene, it only shows 10% of the alternative transcripts submitted to the public databases. • It is clearly of high quality: they differ from the genome in only 66,529 positions (average 2 errors per 2840 bp/NM) • yet it represents a biased selection

RefSeq prefers large protein-coding genes

RefSeq prefers genes with introns

RefSeq prefers conserved genes (example: genes with conserved Pfam motif)

Intron structure We are glad to distribute exon-exon junction sequences for the 226,000 cDNA supported exon-exon junctions

Human intron sizes

Alternative Splicing

Of various variations… Last exon Promotor

Alternative splicing is heavily used to generate protein diversity • 77% of human spliced genes with >2 clones (24,709 genes) have alternative splicing or alternative promoters/last exon • The 44,749 “coding” genes produce 186,752 alternative variants putatively encoding products>100aa. • 141,207 of these variants (from 36,311 genes) are fully supported by single identified clones

Are we close to having them all? NO! • The more clones, the more variants. • It does not seem to saturate in human • But it does in worm

For comparison, % genes with a single variant as a function of number of cDNA clones in worm and human 187 exceptional human genes, highly expressed, are not subject to alternative splicing (out of 8111 genes with more than 128 cDNA clones)

Non redundant list of best full length cDNA clones data from dec04

Genome organization

Some genes are in antisense,and might use this as a means of negative co-regulation

Total genes assessed #genes with antisense % Worm 14154 1179 8% Human 31532 8952 28% How frequent? • Counted only genes with standard introns in antisense to genes with standard introns.

Antisense involves coding as well as non coding sequences

The cloud

The cloud They have no standard intron and do not obviously encode a protein. They cover 5% of the genome.

Under a gene with introns In between such genes % length of genome 50% 50% % cloud genes 77% 2/3 sense strand 1/3 antisense strand 23% The origin of the cloud…plain artefact or not? • Cloud genes tend to concentrate in introns of spliced genes and to avoid intergenic regionssimilar to chip results of Tom Gingeras (affymetrix)

Detection of proteins • cDNAs give experimental evidence for the transcription, but we have very little evidence about translation. • Where to start: which ATG ? NTG ? ANG? • How many products per mRNA ? The set of all cDNA supported AceView products is ready for download. Preliminary tests show that it increases the number of human mass-spectra that can be recognized.

Applications • Full Length cDNA collections • ORFeome/ secretome project • DNA chip design • Mass Spec • MAQC chip reproducibility project and we are happy to provide help on any complex question related to the human transcriptome

The Secretome project • Using PSORT (from Kenta Nakai) we annotate all the AceView products: thousands of the short new proteins contain a signal peptide and are likely secreted. • PFAM motifs identify families of secreted or extracellular proteins, acting at a distance (growth factors, hormones etc) • GO annotation (inherited from GO/LocusLink) adds a few With Marc Vidal, we are trying to clone in gateway a sample of 2000 complete proteins potentially secreted, half of them are new genes.

Microarray design • At a user’s request, we used AceView to select the regions in each gene most shared by the alternative transcripts • A recently released Array-it microarray was designed from that work • We would be happy to collaborate on new designs targeting for instance • The 226,000 mRNA-confirmed intron boundaries • The alternatively spliced forms (through their alternative exons and exon-exon boundaries) • The 9123 genes with introns antisense to genes with introns...

Calibrated RNA Samples A B QRT-PCR Microarrays Other Technologies QRT-PCR Datasets Microarray Datasets QC Metrics & Thresholds http://edkb.fda.gov/MAQC/ Leming.Shi@fda.hhs.gov The MAQC Project: MicroArray Quality Control Leming Shi et al, FDA Identification and correction of procedural failures User Accuracy Systematic biases Precision Cross-lab/platform comparability Evaluation of data analysis methods

Many thanks to • NCBI systemsfor their excellent support • The Psort2, Pfam, Blast, TaxBlast, OMIM and LocusLink developers, for their great tools and public annotations. • Adam Lowe, Vahan Simonyan, Mark Sienkiewiczwho collaborated on AceView one year each over the past 5 years • Maggie Cam, Mark Reimers, Leming Shi, Damir Herman for introducing us to microarrays • Yuji Kohara, Sumio Sugano, Yutaka Suzuki for continued collaboration on cDNAs

www.aceview.org www.ncbi.nlm.nih.gov/IEB/Research/acembly Enter a query PTEN, FGF, mitotic spindle, AF344604… We search everywhere using precomputed word triplets XXX and combs X_X_X and return the lists of all related genes. On exact matches, the search either stops or continues according to a tuneable heuristic. How to query AceView

Chromosomes are heterogeneous

All exact repeats where both copies have introns are intra chromosomal

The wonderful complexity of the human transcriptome

The wonderful complexity of the human transcriptome

Presentation Transcript

The evolution of complexity

The Science of Complexity

The Wonderful World of the Rainforest

Exploring the Human Transcriptome

The Architecture of Complexity

The wonderful life of

The Wonderful World of…

The Complexity of the Classroom

The Wonderful world Of…..

The complexity of the self and human behaviour

The Wonderful World of

The Wonderful World of…

The Transcriptome

Transcriptome

The Complexity of Change

Genomics I: The Transcriptome

Transcriptome

Glue Grant Human Transcriptome Array

The Complexity of the Classroom

The Complexity

The Complexity