1 / 36

Comparative genome analysis

Comparative genome analysis. Hard data and soft interpretations?. Peer Bork EMBL & MDC Heidelberg & Berlin. bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/. Sequenced eukaryotic genomes. Bork and Copley Nature 409(01)818. Sources of uncertainties. (human genome draft).

ermin
Download Presentation

Comparative genome analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative genome analysis Hard data and soft interpretations? Peer Bork EMBL & MDC Heidelberg & Berlin bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/

  2. Sequenced eukaryotic genomes Bork and Copley Nature 409(01)818 www.bork.embl-heidelberg.de

  3. Sources of uncertainties (human genome draft) • Sequence coverage • Assembly accuracy • Polymorphism • Sequence accuracy • Annotation accuracy www.bork.embl-heidelberg.de

  4. 70% prediction accuracy is great!

  5. Comparative genome analysis Prediction of genes and pseudogenes Prediction of genes and pseudogenes Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de

  6. 10T 8T NEMAX50 index 6T 4T 2T Number of human genes in time 120 HGS, Incyte and co HGS Textbooks, public opinion 100 80 52 Basis for Feb 01 publications others 60 No human genes in thousands 39 Celera 40 HGP 38 20 32 27 24 22 0 Feb00 Aug00 Oct00 Dec00 Feb01 Apr01 www.bork.embl-heidelberg.de

  7. Hunting for pseudogenes: Homology search of all human intergenic regions HUMAN GENOME Masking for repetitive elements and ENSEMBL sequences 3.3·109 nucleotides 1.4·106 DNA fragments BLASTX vs nr95 prot. db. (cutoff E < e-8) 4.4·104 DNA fragments Filtering of query and database for Low Complexity Regions 3.6·104 DNA fragments BLASTX vs nr95 prot. db. (cutoff E < e-8) 2.3·104 DNA fragments Merging and extension of fragments Construction of gene structure BLASTX vs ENSEMBL database Removal of all virus derived sequences 12526 elements (pseudogenes or genes) with sequence similarity to known proteins

  8. Synonymous/non-synonymous (dS/dN) substitution rates of functional and pseudogenic human sequences 30 Pseudogenes reference set (856 seq.) 25 SWISSPROT (1935 seq.) RefSeq (1103 seq.) 20 % of sequences 15 10 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2 log (dS/dN)

  9. UNCERTAIN Synonymous/non-synonymous (dS/dN) substitution rates of unannotated regions with homology to known genes PSEUDOGENES GENES Analyzed 693 (19%) = 3712 1858 (50%) 1161 (31%) 4321 = 12526 8205 16 Total 14 693 novel genes detected; >4300 expected in ourset 12 10 % of sequences 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2 log (dS/dN)

  10. E value distribution of pseudogenic, uncertain and functional exons 300 (BLASTX vs nr95 database) 250 pseudogenes 200 3712 sequences uncertain functional Number of seons 150 100 50 0 < e-180 e-180 e-160 e-140 e-100 e-60 e-80 e-120 e-20 e-8 e-40 E value

  11. Comparative genome analysis Prediction of genes and pseudogenes Prediction of genes and pseudogenes Homology-based function prediction Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de

  12. Mycoplasma pneumoniae predictions

  13. Molecular Functions have to be defined on a domain basis i.e. separately for each structurally independent unit within a sequence Henikoff et al. 1997 Science 278, 609

  14. SMART Blast-like input - ID or AC sufficient - Access to different databases - Domain annotation www.smart.embl-heidelberg.de

  15. SMART Digested output -signal sequence -transmembrane regions -comparison of domain context www.smart.embl-heidelberg.de

  16. Domain organization of TAP Random mutagenesis Directed mutagenesis TAP L R R L R R L R R L R R NTF2-like UBA 619aa RNA-binding p15-binding np-bind. NTF2-like p15 100aa www.bork.embl-heidelberg.de Collaboration with Elisa Izaurralde

  17. Directed mutagenesis confirmspredicted TAP/p15 interaction Red - loss of binding Gray - alanine scan Blue - no effect on binding

  18. Top 10 domains* in human Species man fly worm yeast cress Total no genes 6100 13300 18200 25700 26500(26500) Immunoglobulin 765(381) 1 140 64 0 C2H2zinc finger 115 357 151 48 706(607) Protein kinase 1049 319 437 121 575(501) Rhod.-like GPCR 16 97 358 0 569(616) P-loop NTPase 331 198 183 97 433 Rev.transcriptase 80 10 50 6 350 RRM (RNA-binding) 255 157 96 54 300(224) WD40 (G-protein) 210 162 102 91 277(136) Ankyrin repeat 105 107 19 120 276(145) 148 109 9 118 267(160) Homeobox *Only no of genes given, no of domains higher; note that only around 90% is sequenced Nature 409 (01)860; Science 291(01)1304

  19. Comparative genome analysis Prediction of genes and pseudogenes Homology-based function prediction Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de

  20. Function prediction via genomic context information Gene context: - Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’) - Surrounding and shared regulatory elements Knowledge-based context: - Pathway data (can overrule homology!) - Gene expression data (co-expression etc.) - Protein interaction /localisation - Scientific literature www.bork.embl-heidelberg.de

  21. Context methods in Mycoplasma: Fusion, neighborhood, co-occurrence Presence in conserved operons: 213 MG total: 480 genes Fusion 27 54 178 Co-occurrence in genomes Conserved neighborhood www.bork.embl-heidelberg.de

  22. Orthology vs paralogy … within homology paralogy Genome A gene A1 gene A2 orthology Genome B gene B1 gene B2 history gene A1 gene 1 gene B1 gene gene A2 gene 2 www.bork.embl-heidelberg.de gene B2

  23. Exploiting the absence of genes www.bork.embl-heidelberg.de Huynen et al., 1998, FEBS Lett 426, 1-5

  24. Predicting functional interactions between proteins by the co-occurrence of their genes in genomes Distribution of four M.genitalium genes among 25 genomes MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG357(ackA) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 MG305(dnaK) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 Using the mutual information between genes as a scoring heuristic for their co-occurrence. M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase) M(dnaJ, dnaK)=0.55 (heat shock proteins) M(dnaJ, ackA)=0.19 www.bork.embl-heidelberg.de

  25. The phylogenetic distribution of cyaY (frataxin) is identical to that of hscB/Jac1, indicating a functional role of cyaY in iron-sulfur cluster assembly on proteins, specifically in conjunction with Jac1. A . a e B o u l c i c h u n R s e . M S C r p . . y a r A P D X H j j A . n o N P e a . . . . a . . t V e j H. pylori i M r . f E n w e p B . h u n C a a m c . m n . r c a . e a f s n . h d g M u u s a h l z r l e t c i i coli o M i e i l s u u g n e r . n o o t a d c l c n i i . e e b i l k o d i t n o i n y x h e s n t n i S t c o u u t i a i r i S s c o g . z i a i l i s r t b a . i e s p i l d a i a a c s t i e e C a s n o e a i n u e r . t d r c s m a u m i e H l u s s C b b v l . . e s i i o D.melan. e s c a s l i a i p e a s n i g e e s a n n s s (frataxin) cyaY Yfh1 hscB Jac1 hscA ssq1 iscS Nfs1 Huynen et al. Hum.Mol.Genet 2001 iscU Isu1-2 iscA Isa1-2 fdx Yah1 ORF1 ORF2 ORF3 Nfu1 Arh1 Phylogenetic distribution of iron-sulfur cluster assembly proteins

  26. Comparative genome analysis Prediction of genes and pseudogenes Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de

  27. Genome alignment

  28. Conservation of gene neighboorhood Pairwise comparison of 20 prokaryotic genomes (log) o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x (time) I I MG-MP EC-HI

  29. Nucleotide salvage/degradation pathway in gram-positive bacteria

  30. STRING server for context retrieval www.bork.embl-heidelberg.de/STRING www.bork.embl-heidelberg.de/STRING Tryptophan biosynthesis

  31. Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesis www.bork.embl-heidelberg.de

  32. Modularity in “genomic association space” tyrA asd aroB truA aroC aroE hemK hyp trpF trpC trpE trpG Shikimate pathway trpA trpD trpB hyp Tryptophan synthesis pathway 2c-rr Networks based on conserved gene neighborhood reveal ‘natural’ subsystems

  33. (pseudo)genes Yan Yuan Mikita Suyama David Torrents

  34. SMART Ivica Letunic Rich Copley

  35. www.bork.embl-heidelberg.de *Frank (D) Yan (C) Peer (D) *Martijn (NL) Tobias (D) *Gert (D) Richard (UK) Shamil (RU) *Luis (E) *Vassily (RU) *Birgit (D) Mikita (J) Miguel (E) *Jörg (D) *left EMBL Warren (US) Berend (NL) David (E), Ivica (Hr), Caroline (E), Steffen(D), Francesca(I),Jan (D), Parantu(In), Christian(D)

More Related