1 / 61

Predicting function from sequence

Predicting function from sequence. Peer Bork EMBL & MDC Heidelberg & Berlin. bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/. Bioinformatics. Generation of information (biophysics) Storage and retrieval of information (informatics for biodatabases)

darice
Download Presentation

Predicting function from sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting function from sequence Peer Bork EMBL & MDC Heidelberg & Berlin bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/

  2. Bioinformatics • Generation of information (biophysics) • Storage and retrieval of information (informatics for biodatabases) • Translation of information into knowledge (computational biology) www.bork.embl-heidelberg.de

  3. Chance of deducing structural and functional features by homology Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction

  4. Function prediction from sequence Quality and heterogeneity of data Quality and heterogeneity of data Prediction accuracy: 70% hurdle Function and domain prediction Function prediction by gene context www.bork.embl-heidelberg.de

  5. Algorithmic challenges versus data quality and biological diversity Challenges despite highly similar sequences due to sequencing errors and other artefacts Challenges due to low sequence similarity, paralogy and multiple domains www.bork.embl-heidelberg.de

  6. Number of human genes in time 120 HGS, Incyte and co Textbooks, public opinion 100 HGS 80 52 60 No human genes in thousands 39 others Celera 40 HGP 38 20 32 27 24 21 0 Mar00 Aug00 Oct00 Dec00 Feb01 Apr01 www.bork.embl-heidelberg.de

  7. Nature 304, 16. November 2000

  8. Heterogenous data from large scale approaches Gene expression (correlation to proteins poor) Yeast two hybrid (8% overlap with each other) Many others…. www.bork.embl-heidelberg.de

  9. Mycoplasma pneumoniae predictions Dandekar et al., 2000 NAR Sep

  10. Mycoplasma pneumoniae re-annotation1995 vs 1999 RNAs: +9 = 42 ORFs: +12 -1 = 688 ORFs with functions: +105 = 458 ORFs changed: 16 extended, 8 shortened Function changed: 30 more, 18 less specific 57% of all entries were re-annotated ! www.bork.embl-heidelberg.de

  11. Function prediction from sequence Quality and heterogeneity of data Quality and heterogeneity of data Prediction accuracy: 70% hurdle Prediction accuracy: 70% hurdle Function and domain prediction Function prediction by gene context www.bork.embl-heidelberg.de

  12. 70% prediction accuracy is great!

  13. Clear homology via Blast; yet, misleading annotation hampers automatic function prediction

  14. Phylogenetic tree of Blast hits reveals that no function prediction is possible

  15. Molecular Functions have to be defined on a domain basis i.e. separately for each structurally independent unit within a sequence Henikoff et al. 1997 Science 278, 609

  16. Function prediction from sequence Quality and heterogeneity of data Prediction accuracy: 70% hurdle Prediction accuracy: 70% hurdle Function and domain prediction Function and domain prediction Function prediction by gene context www.bork.embl-heidelberg.de

  17. Dotplot to reveal residue conservation HECTc WW Ned4 from Human WW WW WW C2 WW WW WW HECTc C2 Rsp5 from Yeast Conserved domain Repeat pattern Domain insertion Conserved domain www.bork.embl-heidelberg.de

  18. Function prediction for disease genes Breast cancer gene BRCA1 Positionally cloned 1994 (Miki et al. Science 266, pp66) Features originally deduced from the 1857aa sequence: Contains a RING finger (30aa, usually bind diverse proteins) Function unknown, even localization unclear www.bork.embl-heidelberg.de

  19. Localization experiments on BRCA1 Title/Journal Conclusion www.bork.embl-heidelberg.de

  20. Domain discovery in BRCA1

  21. Domain discovery in disease genes

  22. SMART SMART Blast-like input - ID or AC sufficient - Access to different databases - Domain annotation www.smart.embl-heidelberg.de

  23. SMART Digested output -signal sequence -transmembrane regions -comparison of domain context www.smart.embl-heidelberg.de

  24. Non-globuar functional features in protein sequences • Transmembrane regions • signal sequences • GPI anchors • coiled-coiled • other compositionally biased regions • (short internal repeats) www.bork.embl-heidelberg.de

  25. SMART Blast with “in between” regions -automatically cuts respective region -cut and paste for other programs -some specific output features www.smart.embl-heidelberg.de

  26. SMART Digested output -signal sequence -transmembrane regions -comparison of domain context www.smart.embl-heidelberg.de

  27. SMART Domain annotation -description -multiple alignment -consensus features -residue annotation -search options www.smart.embl-heidelberg.de

  28. SMART Species distribution -total occurrence -protein and domain statistics -taxonomic break down -model organisms www.smart.embl-heidelberg.de

  29. Annotation improvement using domain correlation • Query: VAV H. sapiens Find closest hit: selective SMART • Domain architecture of C35B8.2 C. elegans Evaluate correlation; scan genome region • Reconstructed structure of C35B8.2 SH3 www.bork.embl-heidelberg.de

  30. Domain organization of TAP Random mutagenesis Directed mutagenesis TAP L R R L R R L R R L R R NTF2-like UBA 619aa RNA-binding p15-binding np-bind. NTF2-like p15 100aa www.bork.embl-heidelberg.de Collaboration with Elisa Izaurralde

  31. Directed mutagenesis confirmspredicted TAP/p15 interaction Red - loss of binding Gray - alanine scan Blue - no effect on binding

  32. Human genome reveals whole TAP family TAP In 90% of the human genome: 6 homologues, but of these 1-2 pseudogenes Independent duplications in fly, worm and human

  33. Sequenced eukaryotic genomes Bork and Copley Nature 409(01)818 www.bork.embl-heidelberg.de

  34. History of signaling domain discovery: Novel nuclear and cytoplasmic domains Systematic approach by searching ‘in between’ regions

  35. Top 10 domains* in human Species man fly worm yeast cress Total no genes 6100 13300 18200 25700 26500(26500) Immunoglobulin 765(381) 1 140 64 0 C2H2zinc finger 115 357 151 48 706(607) Protein kinase 1049 319 437 121 575(501) Rhod.-like GPCR 16 97 358 0 569(616) P-loop NTPase 331 198 183 97 433 Rev.transcriptase 80 10 50 6 350 RRM (RNA-binding) 255 157 96 54 300(224) WD40 (G-protein) 210 162 102 91 277(136) Ankyrin repeat 105 107 19 120 276(145) 148 109 9 118 267(160) Homeobox *Only no of genes given, no of domains higher; note that only around 90% is sequenced Nature 409 (01)860; Science 291(01)1304

  36. Top 10 mobile domains in human Species man fly worm yeast cress Total no genes 13300 18200 6100 25700 26500 C2H2zinc finger 5653 255 1778 587 104 Immunoglobulin 2 457 530 0 1364 EGF 53 466 539 1 1207 WD40(G-protein) 1022 678 488 340 894 Ankyrin repeat 261 363 344 38 714 Cadherin domain 0 201 113 0 622 Protein kinases 1054 259 462 122 586 Fibronectin type 3 6 217 212 2 557 RRM (RNA-binding) 242 183 94 460 443 64 80 0 0 277 CCP/sushi/SCR Only no of domains given, no of proteins lower; note that only around 90% is sequenced SMART analysis of 31700 predicted human ORFs

  37. Correlation between domains other Marker PX extra nuclear intra

  38. Function prediction from sequence Quality and heterogeneity of data Prediction accuracy: 70% hurdle Function and domain prediction Function and domain prediction Function prediction by gene context Function prediction by gene context www.bork.embl-heidelberg.de

  39. Phenotypic features do not coincide with species evolution... yeast ...but gene content does

  40. Orthology vs paralogy … within homology paralogy Genome A gene A1 gene A2 orthology Genome B gene B1 gene B2 history gene A1 gene 1 gene B1 gene gene A2 gene 2 www.bork.embl-heidelberg.de gene B2

  41. Differential Genome Display H. influenzae genome Huynen et al., 1997 Trends Genet 13, 389

  42. Exploiting the absence of genes www.bork.embl-heidelberg.de Huynen et al., 1998, FEBS Lett 426, 1-5

  43. Predicting functional interactions between proteins by the co-occurrence of their genes in genomes Distribution of four M.genitalium genes among 25 genomes MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG357(ackA) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 MG305(dnaK) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 Using the mutual information between genes as a scoring heuristic for their co-occurrence. M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase) M(dnaJ, dnaK)=0.55 (heat shock proteins) M(dnaJ, ackA)=0.19 www.bork.embl-heidelberg.de

  44. The phylogenetic distribution of cyaY (frataxin) is identical to that of hscB/Jac1, indicating a functional role of cyaY in iron-sulfur cluster assembly on proteins, specifically in conjunction with Jac1. A . a e B o u l c i c h u n R s e . M S C r p . . y a r A P D X H j j A . n o N P e a . . . . a . . t V e j H. pylori i M r . f E n w e p B . h u n C a a m c . m n . r c a . e a f s n . h d g M u u s a h l z r l e t c i i coli o M i e i l s u u g n e r . n o o t a d c l c n i i . e e b i l k o d i t n o i n y x h e s n t n i S t c o u u t i a i r i S s c o g . z i a i l i s r t b a . i e s p i l d a i a a c s t i e e C a s n o e a i n u e r . t d r c s m a u m i e H l u s s C b b v l . . e s i i o D.melan. e s c a s l i a i p e a s n i g e e s a n n s s (frataxin) cyaY Yfh1 hscB Jac1 hscA ssq1 iscS Nfs1 Huynen et al. Hum.Mol.Genet 2001 iscU Isu1-2 iscA Isa1-2 fdx Yah1 ORF1 ORF2 ORF3 Nfu1 Arh1 Phylogenetic distribution of iron-sulfur cluster assembly proteins

  45. Function prediction via gene context information Genomic context information: - Conserved gene neighborhood in genomes - Gene fusion as distinct neighborhood subset - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’) - Surrounding and shared regulatory elements Knowledge-based context information: - Pathway data (can overrule homology!) - Gene expression data (co-expression etc.) - Protein interaction /localisation - Scientific literature www.bork.embl-heidelberg.de

  46. Evolution of genome organization www.bork.embl-heidelberg.de

More Related