1 / 33

Seven clusters and four types of symmetry in microbial genomes

Seven clusters and four types of symmetry in microbial genomes. Andrei Zinovyev Bioinformatics service Math@Bio group of M.Gromov. Tatyana Popova R&D Centre in Biberach, Germany. Alexander Gorban Centre for

bern
Download Presentation

Seven clusters and four types of symmetry in microbial genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service Math@Bio group of M.Gromov Tatyana Popova R&D Centre in Biberach, Germany Alexander Gorban Centre for Mathematical Modelling

  2. Symbol of GofG’05

  3. tagggrcgcacgtggtgagctgatgctagggrcgacgtgg gggrcgccacgttggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: N = 4=41 N = 16=42 N = 64=43 N=256=44 t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tag ggr cgc acg tgg tga gct gat gct agg tagg grcg cacg tggt gagc tgat gcta gggr Genomic sequence as a text in unknown language ..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…

  4. From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 107 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~200-400 10000-20000 fragments RN

  5. R2 PCA plot R2 Method of visualizationprincipal components analysis RN

  6. singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961) Caulobacter crescentus

  7. First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

  8. gct gat gct agg grc gca cgt ctg atg cta ggg rcg cac gtg tga tgc tag ggr cgc acg tgg gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc

  9. Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat

  10. The flower-like 7 clusters structure is flat

  11. GenScan Seven classes vs Seven clusters TIGR Georgia Institute of Technology Stanford

  12. Accuracy >90% Computational gene prediction

  13. Mean-field approximationfor triplet frequencies FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers position-specific letter frequency + correlations : 12 numbers

  14. -+0 0+- +0- -0+ Why hexagonal symmetry? GC-content = PC + PG +-0 0-+

  15. ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies PI1 , PJ2 , PK3 Genome codon usageand mean-field approximation correct frameshift … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 64 frequencies FIJK

  16. PIJ are linear functions of GC-content eubacteria archae

  17. THE MYSTERY OF TWOSTRAIGHT LINES ??? R64 R12 FIJK = P1IP2JP3K + correlations

  18. Codon usage signature 0-+

  19. 19 possible eubacterialsignatures

  20. Example: Palindromic signatures

  21. eubacteria perpendicular triangles parallel triangles degenerated flower-like Four symmetry typesof the basic 7-cluster structure

  22. B.Halodurans (GC=44%) F.Nucleatum (GC=27%) E.Coli (GC=51%) S.Coelicolor (GC=72%)

  23. Web-site cluster structures in genomic sequences http://www.ihes.fr/~zinovyev/7clusters

  24. Human genome (chr19) triplets doublets singles non-repetitive sequences repetitive sequences

  25. Letter frequencies (3 dimensions) Purine- Pyrimidine (33%) Amino- Keto (17%) GC-content (50%) a c a t a g c g t c g t

  26. Non-linear good 2D representation(elastic principal manifolds) 100% T A 0% G C

  27. G G A A C C T T Measuring densities

  28. Contrasting density distribution (two ideas) • Noise is Gaussian • Noise is smooth

  29. G A C T Contrasted density G A C T

  30. G A C T Excluding repeats G A C T

  31. G A C T Excluding repeats G A C T

  32. Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome. 2005. Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. 2005. Physica A 353, 365-387 Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. 2005. In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics10 (4).

  33. People Dr. Tanya Popova Institute of Computational Modeling Russia ProfessorAlexander Gorban University of Leicester UK

More Related