1 / 40

GEBA Project Summary

GEBA Project Summary. Dongying Wu. Phylogenetic Tree Building (Martin Wu). Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes, 53 genomes. Phylogenetic Distance (PD). B. b. C. c. d. a. A. PD=sum of all the branch lengths PD{A,B,C}=a+b+c+d.

Download Presentation

GEBA Project Summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GEBA Project Summary Dongying Wu

  2. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes, 53 genomes

  3. Phylogenetic Distance (PD) B b C c d a A PD=sum of all the branch lengths PD{A,B,C}=a+b+c+d

  4. Phylogenetic Distance Contribution of GEBA genomes The total tree PD is 88.8, GEBA add 11.0 to the tree. 53 random non-GEBA taxa (from a pool of 667) contribute 3.15 to the tree PD (standard deviation:0.68 for 100 sampling) The 26 GEBA actinobacteria add 4.29 to the total PD (actinobacteria as a whole add 8.128 PD) 26 random non-GEBA actinobacteria (from a pool of 47) contribute 1.37 PD (standard deviation 0.28, 100 sampling)

  5. Gene Family Classification Blastp: E value cutoff 1e-10, report 10000 hits Only blastp hits that span 80% of the lengths of both genes are kept as links 227,562 genes from 56 genomes => 17,176,180 links

  6. MCL Clustering Algorithm Links (matrix of sequence identities) Expansion Inflation (I=2) equilibrium state

  7. 1 50 - 100 3 20 - 50 10 - 20 6 ( g 5 - 10 27 e F n a e m s 511 i / 1 - 5 l g y e S n 1305 20/56 - 1 i o z e m 1588 e 10/56 - 20/56 ) 2755 5/56 - 10/56 10601 2/56 - 5/56 46689 1/56 0 10000 20000 30000 40000 50000 Number of Families

  8. Evenness estimation genome Gene distribution ratio for family X 0.132 A 0.316 B 0.105 0.079 C 0.026 0.158 D 0 0.184 Median 0.184 0 E 0.184 0.031 F 0.215 0.031 G 0.158 dist: Distanceaverrage=0.087 -4 x dist Evenness= 100 x e

  9. Universality: ratio of genomes that a family appears in Evenness: even distribution of gene family members across genomes Size: number of members in a gene family

  10. Family size

  11. Large families: famID size functions F2669   4210 (75/genome) ABC-type transport system ATP-binding proteinF2670   1542 (27/genome) multi-sensor hybrid histidine kinase F2671   1367 (24/genome) short chain dehydrogenaseF2672   1157 (20/genome) acyl-CoA synthetaseF2673   782   (14/genome) serine/threonine protein kinaseF2674   755   (13/genome) two-component system response regulator (LuxR family)F2675   735   (13/genome) two-component system response regulator (winged helix family) F2676   614   (11/genome) drug resistance transporterF2677   606   (11/genome) transcriptional regulator, LacI familyF2678   568   (10/genome) two-component system sensor sensor histidine kinaseF2679   543   (10/genome) sugar ABC transporter, permease component

  12. Low universality large families: famID size organism family function taxonomy number F2682 461    7 outer membrane protein Bacteroidetes; Proteobacteria F2699 303 6 outer membrane protein Bacteroidetes F2736 180 6 anti-sigma factor Bacteroidetes; Proteobacteria F2760 153 6 transcriptional regulator, AraC family Bacteroidetes; proteobacteria F2772 147 5    RNA polymerase ECF-type sigma factor Bacteroidetes (Sphingobacteriales) F2801 129 11 DNA-binding protein Actinobacteria(Actinobacteridae) F2827 114 3 FtsX transmembrane transport protein Bacteroidetes (Sphingobacteriales) F2867 103 3 hypothetical protein Actinobacteria;(Coriobacteriaceae)

  13. 3 out of 9 largest families have very low evenness value ( < 5) two-component system response regulator (LuxR) short chain dehydrogenase acyl-CoA synthetase 0 0 60 6 6 50 0 50 5 0 0 0 4 4 4 0 0 0 3 3 3 0 0 0 2 2 2 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 6 5 4 3 2 1 8 0 9 6 4 3 2 1 8 7 5 4 2 2 8 0 6 1 1 1

  14. phylum specific family 26/56 Actinobacteria

  15. 712 families (size >=10) are phylum specific 350 300 42 670 250 200 Family size 150 100 50 0 0 5 10 15 20 25 Organism number

  16. Phylum-specific families from more than two organisms

  17. The largest 6 phylum-specific families F2699 Bacteroidetes=303; outer membrane protein *F2752 Actinobacteria=160; RNA polymerase, sigma-24 subunit, ECF family F2772 Bacteroidetes=147; putative ECF-type RNA polymerase sigma factor F2801 Actinobacteria=129; DNA-binding protein F2827 Bacteroidetes=114; FtsX-related transmembrane transport protein F2867 Actinobacteria=103; unknown functions * From 15 organisms

  18. Novel gene families: None of the genes in a family has a Genbank hit (e cutoff: 1e-5)

  19. Streptococcus agalactiae “pan-genome” Tettelin H. et.al. PNAS 2005;102:13950-13955

  20. 217,079 genes from 53 GEBA Bacterial genomes 60024 families N genomes Number of families with the selected genomes A: N from1 to 53 B: For every N, sample the families 100 times

  21. 70000 from GEBA project Bacteria 60000 50000 40000 3000 Gene Family Number (including families with single members) 2500 30000 2000 New Genome families 1500 20000 1000 500 10000 0 0 10 20 30 40 50 60 Number of Genomes 0 0 10 20 30 40 50 60 70 80 Genome Number

  22. Streptococcus agalactiae(8 strains) Enterobacteriaceae: (40 genomes) • Escherichia coli • Yersinia pestis • Salmonella enterica • 3 Shigella flexneri Actinobacteria: (73 genomes, including 26 GEBA genomes) Bacteria: (53 GEBA genomes)

  23. 70000 60000 Actinobacteria 50000 40000 30000 20000 Enterobacteriaceae 10000 S. agalactiae 0 0 10 20 30 40 50 60 70 80 from GEBA project Bacteria Gene Family Number (including families with single members) Genome Number

  24. 350000 300000 250000 200000 150000 100000 S. agalactiae Enterobacteriaceae 50000 Actinobacteria Bacteria from GEBA project 0 0 10 20 30 40 50 60 70 80 Total Gene Number Genome Number

  25. 70000 60000 50000 40000 30000 20000 10000 0 0 50000 100000 150000 200000 250000 300000 350000 Gene Family Number S. agalactiae Enterobacteriaceae Actinobacteria Bacteria from GEBA project Total Gene Number

  26. Calculate the PD (Phylogenetic Diversity) Of a sub-tree

  27. 18 16 14 12 10 8 Actinobacteria 6 4 2 Enterobacteriaceae 0 0 10 20 30 40 50 60 70 80 Bacteria from GEBA project Phylogenetic Diversity Genome Number

  28. 70000 60000 50000 40000 30000 20000 Actinobacteria Enterobacteriaceae 10000 0 0 2 4 6 8 10 12 14 16 18 Gene Family Number from GEBA project Bacteria Phylogenetic Diversity

  29. How far down the road GEBA has to go in terms of PD coverage 232812 Bacterial/Archaeal ss-rRNA from Greengenes MCL 99% Identity at 80% span 45997 clusters Filter out ss-rRNA from Genome Porjects 99% identity cutoffs Filter out 18751 low-quality sequences short sequences <=1200nt low-quality sequences duplicates chimerics 667 Combo Bacterial ss-rRNA 50 Combo Archaeal ss-rRNA 56 GEBA ss-rRNA 42426 Greengenes Bacterial/Archaeal ss-rRNA Retrieve alignments from greengenes Trim by the greengenes mask QuickTree Distant Tree for all representatives

  30. 74437 non-environmental Bacterial/Archaeal ss-rRNA from Greengenes MCL 99% Identity at 80% span 10397 clusters Filter out ss-rRNA from Genome Porjects 99% identity cutoffs Filter out low-quality sequences short sequences <=1200nt low-quality sequences duplicates chimerics 667 Combo Bacterial ss-rRNA 50 Combo Archaeal ss-rRNA 56 GEBA ss-rRNA 9946 Greengenes Bacterial/Archaeal ss-rRNA Retrieve alignments from greengenes Trim by the greengenes mask QuickTree Distant Tree for non-environmental representatives

  31. Pre-GEBA *start from Haemophilus influenzae Rd KW20 **In each group, the taxa are sorted by their PD contributions in descending order GEBA Greengenes

  32. 1200 GEBA genomes pre-GEBA genomes 1000 Organisms from the greengenes database Organisms from the greengenes database (excluding environmental samples) 800 100 80 600 Phylogenetic Diversity 60 400 40 20 200 0 0 400 800 1200 0 0 5000 10000 15000 20000 25000 30000 35000 40000 Organism Numbers

  33. The slopes of the linear regression Lines represent the PD contribution of the genomes (each window contains 50 genomes)

  34. Non-environmental Tree Only the top 150 PD contributors out of 717 pre-GEBA genomes have an average PD contribution greater than the GEBA genomes. The genome sequencing efforts have only covered 11.5% phylogenetic diversity to date in this study. We can pick an additional 550 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes To increase PD coverage to 50%, we need to sequence at least 1520 more genomes

  35. All-representative Tree Current genome sequences only cover 2.2% of the PD We can pick an additional 4400 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes To cover 50% of the phylogenetic diversity, we have to sequences 9218 more genomes

  36. rbcL

  37. rbcL Active sites Catalytic RuBP binding

  38. Calvin cycle CO 2 Ribulose-5-P rpe Ribulose-1,5-P rbcL Xylulose-P Glycerate-3-P tktA glpX gap pgk tpiA Fructose-6-P P-glyceroyl-P GAP DHAP Fructose-1,6-P

  39. Calvin Cycle * Finished genome

More Related