1 / 50

Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis

Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis. Mikhail Gelfand Institute for Information Transmission Problems (The Kharkevich Institute), RAS Workshop at the Landau Instiute of Theoretical Physics, RAS

robert
Download Presentation

Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative genomics: functional characterization ofnew genes and regulatory interactions using computer analysis Mikhail Gelfand Institute for Information Transmission Problems(The Kharkevich Institute), RAS Workshop at the Landau Instiute of Theoretical Physics, RAS September 27-28, 2007, Moscow

  2. The genome is decyphered!

  3. Is it? To intercept a message does not mean to understand it

  4. Fragment of a genome (0.1% ofE. coli) A typical bacterial genome:several million nucleotides ~600 through~9,000genes(~90% of the genome encodes proteins)

  5. Propaganda sequences in GenBank (~genes) articles in PubMed (~experiments)

  6. More propaganda Most genes will never be studied in experiment Even in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized) • “Universally missing genes” – not a single known gene even for ~10% reactions of the central metabolism. No genes for >40% reactions overall. • “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function.

  7. The local goal: to characterize the genes • What? • function (rather, role) • When? • regulation (conditions) • gene expression • lifetime (mRNA, protein) • Where? • Localization • Cellular/membrane/secreted • How? • Mechanism of action • Specificity, regulation (biochemistry)

  8. Propaganda-2: complete genomes • 2007: • > 1200 bacterial genomes

  9. The global goal: to predict the organism’s properties given its genome (plus some additional information, e.g. the initial state after cell division) and “to understand”the evolution of genomes/organisms

  10. Haemophilus influenzae, 1995

  11. Vibrio cholerae, 2000

  12. The metabolic map, the bird’s view

  13. Metabolic pathways, the eagle’s view

  14. A submap (metabolism of arginine and proline)

  15. Approaches • Similarity => homology (common origin) • Homology => common function • “The Pearson Principle” (after Karl Pearson):important features are conserved • functional sites in proteins • regulatory (protein-binding) sites in DNA • not necessarily sequences: • structure of protein and RNA • gene localization on chromosomes • co-expression of genes • Allows one to annotate 50-75% of genes in a bacterial genome • Necessary first step, may be automated (to some extent)

  16. … but not so simple • Similarity≠homology • Low complexity regions, unstructured domains, transmembrane segments and other regions with non-strandard amino acid composition • The need for correct similarity measures • Does homology always follow from the structural similarity? • What is structural similarity?How can it be measured? • Convergent evolution of structures?Independent emergence of folds? • Homology≠same function • What is «the same function»? • Biochemical details and cellular role

  17. “The Fermi principle” (after Enrico Fermi) Purely homology-based annotation: boring (nothing radically new) It turns out, one can predict something completely new Comparative genomics

  18. Positional clustering • Genes that are located in immediate proximity tend to be involved in the same metabolic pathway or functional subsystem • caused by operon structure, but not only • horizontal transfer of loci containing several functionally linked operons • compartmentalisation of products in the cytoplasm • very weak evidence • stronger if observed in may unrelated genomes • May be measured • e.g. the STRING database/server (P.Bork, EMBL) • and other sources

  19. STRING: trpB – positional clusters

  20. Functionally dependent genes tend to cluster on chromosomes in many different organisms Vertical axis: number of gene pairs with association score exceeding a threshold. Control: same graph, random re-labeling of vertices

  21. More genomes (stronger links) => highly significant clustering

  22. Fusions • If two (or more) proteins form a single multidomain protein in some organism, they all are likely to be tightly functionally related • Very useful for the analysis of eukaryotes • Sometimes useful for the analysis of prokaryotes

  23. STRING: trpB – fusions

  24. Phyletic patterns • Functionally linked genes tend to occur together • Enzymes with the same function (isozymes) have complementary phyletic profiles

  25. STRING: trpB – co-occurrence (phyletic patterns)

  26. Phyletic patterns in the Phe/Tyr pathway shikimate kinase

  27. Archaeal shikimate-kinase Chorismate biosynthesis pathway (E. coli)

  28. 3-dehydroquinate dehydratase (EC 4.2.1.10): Class I (AroD) COG0710 aompkzyq---lb-e----n---i-- Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj---- Two forms combined aompkzyqvdrlbcefghsnuj-i-- + Shikimate kinase (EC 2.7.1.71): Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i-- Archaeal-type COG1685 aompkz-------------------- Two forms combined aompkzyqvdrlbcefghsnuj-i-- + Arithmetics of phyletic patterns Shikimate dehydrogenase (EC 1.1.1.25): AroE COG0169 aompkzyqvdrlbcefghsnuj-i-- 5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroACOG0128aompkzyqvdrlbcefghsnuj-i-- Chorismate synthase (EC 2.5.1.19) AroCCOG0082aompkzyqvdrlbcefghsnuj-i--

  29. Distribution of association scores: monotonic for subunits,bimodal for isozymes

  30. Comparative analysis of regulation • Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions • Consistency filtering: regulons (sets of co-regulated genes) are conserved => • true sites occur upstream of orthologous genes • false sites are scattered at random

  31. Riboflavin (vitamin B2) biosynthesis pathway

  32. 5’ UTR regionsof riboflavin genes from bacteria

  33. Conserved secondary structure of the RFN-element Capitals: invariant (absolutely conserved) positions. Lower case letters: strongly conserved positions. Dashes and stars: obligatory and facultative base pairs Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion

  34. RFN: the mechanism of regulation • Transcription attenuation • Translation attenuation

  35. Early observation: an uncharacterized gene (ypaA) with an upstream RFN element

  36. Phylogenetic tree of RFN-elements (regulation of riboflavin biosynthesis) no riboflavin biosynthesis duplications no riboflavin biosynthesis

  37. YpaA a.k.a. RibU: riboflavin transporterin Gram-positive bacteria • 5 predicted transmembrane segments => a transporter • Upstream RFN element (likely co-regulation with riboflavin genes) => transport of riboflaving or a precursor • S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin pathway => transport of riboflavin Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999) Validation: • YpaA transports flavines (riboflavin, FMN, FAD): by genetic analysis (Kreneva et al., 2000) by direct measurement (Burgess et al., 2006; Vogl et al., 2007) • ypaA is regulated by riboflavin: by microarray expression study (Lee et al., 2001) • … via attenuation of transcription (and to some extent inhibition of translaition) (Winkler et al., 2003)

  38. Conserved structures of riboswitches (circled: X-ray)

  39. Mechanisms gcvT: ribozyme, cleaves its mRNA (the Breaker group)THI-box in plants: inhibition of splicing (the Breaker and Hanamoto groups)

  40. Characterized riboswitches (more are predicted)

  41. Properties of riboswitches • Direct binding of ligands • High conservation • Including “unpaired” regions: tertiary interactions, ligand binding • Same structure – different mechanisms: transcription, translation, splicing, (RNA cleavage) • Distribution in all taxonomic groups • diverse bacteria • archaea: thermoplasmas • eukaryotes:plants and fungi • Correlation of the mechanism and taxonomy: • attenuation of transcription (anti-anti-terminator) – Bacillus/Clostridium group • attenuation of translation (anti-anti-sequestor of translation initiation) – proteobacteria • attenuation of translation (direct sequestor of translation initiation) – actinobacteria • Evolution: horizontal transfer, duplications, lineage-specific loss • Sometimes very narrow distribution: evolution from scratch?

  42. Conserved signal upstream of nrd genes

  43. Identification of the candidate regulator by the analysis of phyletic patterns COG1327: the only COG with exactly the same phylogenetic pattern as the signal • “large scale” on the level of major taxa • “small scale” within major taxa: • absent in small parasites among alpha- and gamma-proteobacteria • absent in Desulfovibrio spp. among delta-proteobacteria • absent in Nostoc sp. among cyanobacteria • absent in Oenococcus and Leuconostoc among Firmicutes • present only in Treponema denticola among four spirochetes

  44. COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway (RibX)?

  45. Additional evidence: co-localization nrdR is sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA

  46. Additional evidence: co-regulated genes In some genomes, candidate NrdR-binding sites are found upstream of other replication-related genes • dNTP salvage • topoisomerase I, replication initiator dnaA, chromosome partitioning, DNA helicase II

  47. Multiple sites (nrd genes): FNR, DnaA, NrdR

  48. Mode of regulation • Repressor (overlaps with promoters) • Co-operative binding: • most sites occur in tandem (> 90% cases) • the distance between the copies (centers of palindromes) equals an integer number of DNA turns: • mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns • 21 bp (2 turns) in Vibrio spp. • 41-42 bp (4 turns) in some Firmicutes

  49. Experimental validations

  50. Acknowledgements • Dmitry Rodionov (comparative genomics) • Andrei Mironov (software) • Alexei Vitreschak (riboswitches) • Funding: • Howard Hughes Medical Institute • Russian Foundation of Basic Research • RAS, program “Molecular and Cellular Biology” • INTAS

More Related