1 / 72

The wonderful complexity of the human transcriptome

Danielle and Jean Thierry-Mieg N.C.B.I. http://www.aceview.org. The wonderful complexity of the human transcriptome. Phenotype and in vivo function, GO biol. proc. Protein family GO molecular function. Gene-gene or Protein protein interactions. Regulation Alt splicing position

roxy
Download Presentation

The wonderful complexity of the human transcriptome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Danielle and Jean Thierry-Mieg N.C.B.I. http://www.aceview.org The wonderful complexity of the human transcriptome

  2. Phenotype and in vivo function, GO biol. proc. Protein family GO molecular function Gene-gene or Protein protein interactions Regulation Alt splicing position Promoters, post transcr Protein motifs GENE Protein conservation and orthologs Level of expression (cDNA counts microarrays) Cellular compartment Pattern of expression Multiply connecting annotation

  3. How could we find the true genes • Use the experimental cDNA data, • All the cDNA data from public databases, • Only the cDNA data (no prediction) Trace each and every cDNA from the public databases to the very place on the genome where it was transcribed. Automatically deduce the genes from these alignments.

  4. Methods

  5. Tools and hardware • Alignment software: AceView (T-M2 96-) • Not Blast, Blat • Database: Acedb (Thierry-Mieg & Durbin, 89-) • Not Oracle/Sybase • Interactive graphics at all scales • Human intervention: • debug once, fix everywhere • Hardware: • 2 or 3 days elapsed per human build to align the genes + a week to annotate the function using BLASTP/TaxBlast/PFAM/PSORT/LocusLink/OMIM

  6. AceView alignment specificities • 1: Masking replaced bySeedonrare words • 2: Auto-adaptative word hit extension • 3: Coalign the intron boundaries • 4: Team jump the loose ends • 5: Trim the read pairs • 6: Flag and mask the clone anomalies • 7: Aggressive clean up

  7. A good alignment 4: Team jump first and last exons drives the noisy neighbors

  8. 6: Mask suspect cDNAs • We flagged suspected internal deletions in ~4% of all aligned clones • Flag internal priming • Reassess the strand • Search and clip vectors • Identify mosaic and rearranged clones

  9. 7:Clean up strategy Problem: the genome is full of repeats • We align the RNA in the most compact way • We measure (aligned length – errors) at each site, compare, keep only the best site by 5 bp. • As a result, only 0.95 % of the RefSeqs, 1.4% of the mRNAs and 2.2% of the ESTs cannot be attributed to a single gene. • About 1% of the transcribed genes are repeated. • This ‘cleanup’ procedure automatically excludes most non transcribed pseudogenes.

  10. Post alignment clustering • We now have about 5M clones aligned on the genome • Goal: Recognize the genes set of transcripts sharing an intron boundary Distinguish the alternative transcripts

  11. Transcript consistency graph • Each clone is a Vertex • Draw a GREEN ARC, if 2 clones A &B share a genomic base (friends) • Draw a RED ARC, if an intron of A matches an exon of B (foes) Gene<=> green connected component Transcript <=> maximal pacific sub graph i.e. friendly connected with no foes

  12. A andBCare the 2maximal pacific sub graphs A C B

  13. Alignment Quality

  14. mRNA EST RefSeq % aligned and not filtered build 35/hg17 91.1% 191,058 75.9% 5,699,664 99.7% 23,973 98.8% 93.3% 99.2% %length aligned 99.78% 98.16% 99.90% %identity How good are the alignments? • AceView alignments of RNAs and ESTs on the genome are highly reliable

  15. In 31 ENCODE test regions, 1,556 models have the same intron-exon structure in at least two of the seven tracks RefSeq, Known Gene, Ensembl, Gencode, AceView, ECgene and ExonWalk.

  16. Gene counts

  17. How many genes? • We align individually on the human genome 4,523,877 cDNA sequencesfrom public databases (August 2005) and cluster them into • 57,882 main genes (about 150,000 proteins) • 40,567 putative genes • and 251,183 “cloud” objects .

  18. Classification of the 57,882 main genes from GenBank cDNAs

  19. There are only 17,789 genes from the 23,459 RefSeq • RefSeq does not aim at completion: • it only represents 31% of the genes in GenBank, • With an average of 1.34 alternative variant per gene, it only shows 10% of the alternative transcripts submitted to the public databases. • It is clearly of high quality: they differ from the genome in only 66,529 positions (average 2 errors per 2840 bp/NM) • yet it represents a biased selection

  20. RefSeq prefers large protein-coding genes

  21. RefSeq prefers genes with introns

  22. RefSeq prefers conserved genes (example: genes with conserved Pfam motif)

  23. Intron structure We are glad to distribute exon-exon junction sequences for the 226,000 cDNA supported exon-exon junctions

  24. Human intron sizes

  25. Alternative Splicing

  26. Of various variations… Last exon Promotor

  27. Alternative splicing is heavily used to generate protein diversity • 77% of human spliced genes with >2 clones (24,709 genes) have alternative splicing or alternative promoters/last exon • The 44,749 “coding” genes produce 186,752 alternative variants putatively encoding products>100aa. • 141,207 of these variants (from 36,311 genes) are fully supported by single identified clones

  28. Are we close to having them all? NO! • The more clones, the more variants. • It does not seem to saturate in human • But it does in worm

  29. For comparison, % genes with a single variant as a function of number of cDNA clones in worm and human 187 exceptional human genes, highly expressed, are not subject to alternative splicing (out of 8111 genes with more than 128 cDNA clones)

  30. Non redundant list of best full length cDNA clones data from dec04

  31. Genome organization

  32. Some genes are in antisense,and might use this as a means of negative co-regulation

  33. Total genes assessed #genes with antisense % Worm 14154 1179 8% Human 31532 8952 28% How frequent? • Counted only genes with standard introns in antisense to genes with standard introns.

  34. Antisense involves coding as well as non coding sequences

  35. The cloud

  36. The cloud They have no standard intron and do not obviously encode a protein. They cover 5% of the genome.

  37. Under a gene with introns In between such genes % length of genome 50% 50% % cloud genes 77% 2/3 sense strand 1/3 antisense strand 23% The origin of the cloud…plain artefact or not? • Cloud genes tend to concentrate in introns of spliced genes and to avoid intergenic regionssimilar to chip results of Tom Gingeras (affymetrix)

  38. Detection of proteins • cDNAs give experimental evidence for the transcription, but we have very little evidence about translation. • Where to start: which ATG ? NTG ? ANG? • How many products per mRNA ? The set of all cDNA supported AceView products is ready for download. Preliminary tests show that it increases the number of human mass-spectra that can be recognized.

  39. Applications • Full Length cDNA collections • ORFeome/ secretome project • DNA chip design • Mass Spec • MAQC chip reproducibility project and we are happy to provide help on any complex question related to the human transcriptome

  40. The Secretome project • Using PSORT (from Kenta Nakai) we annotate all the AceView products: thousands of the short new proteins contain a signal peptide and are likely secreted. • PFAM motifs identify families of secreted or extracellular proteins, acting at a distance (growth factors, hormones etc) • GO annotation (inherited from GO/LocusLink) adds a few With Marc Vidal, we are trying to clone in gateway a sample of 2000 complete proteins potentially secreted, half of them are new genes.

  41. Microarray design • At a user’s request, we used AceView to select the regions in each gene most shared by the alternative transcripts • A recently released Array-it microarray was designed from that work • We would be happy to collaborate on new designs targeting for instance • The 226,000 mRNA-confirmed intron boundaries • The alternatively spliced forms (through their alternative exons and exon-exon boundaries) • The 9123 genes with introns antisense to genes with introns...

  42. Calibrated RNA Samples A B QRT-PCR Microarrays Other Technologies QRT-PCR Datasets Microarray Datasets QC Metrics & Thresholds http://edkb.fda.gov/MAQC/ Leming.Shi@fda.hhs.gov The MAQC Project: MicroArray Quality Control Leming Shi et al, FDA Identification and correction of procedural failures User Accuracy Systematic biases Precision Cross-lab/platform comparability Evaluation of data analysis methods

  43. Many thanks to • NCBI systemsfor their excellent support • The Psort2, Pfam, Blast, TaxBlast, OMIM and LocusLink developers, for their great tools and public annotations. • Adam Lowe, Vahan Simonyan, Mark Sienkiewiczwho collaborated on AceView one year each over the past 5 years • Maggie Cam, Mark Reimers, Leming Shi, Damir Herman for introducing us to microarrays • Yuji Kohara, Sumio Sugano, Yutaka Suzuki for continued collaboration on cDNAs

  44. www.aceview.org www.ncbi.nlm.nih.gov/IEB/Research/acembly Enter a query PTEN, FGF, mitotic spindle, AF344604… We search everywhere using precomputed word triplets XXX and combs X_X_X and return the lists of all related genes. On exact matches, the search either stops or continues according to a tuneable heuristic. How to query AceView

  45. Chromosomes are heterogeneous

  46. All exact repeats where both copies have introns are intra chromosomal

More Related