Improving Gene Function PredictionUsing Gene Neighborhoods Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington, IN
Introduction : PLATCOM (A Platform for Computational Comparative Genomics) • PLATCOM is a system for the comparative analysis of multiple genomes. • PLATCOM consists of 3 components: • Databases of biological entities • e.g. fna, faa, ptt, gbk… • Databases of relationships among entities • e.g. genome-genome, protein-protein pairwise comparison • Mining tools over the databases • The web interface of PLATCOM system is located at http://biokdd.informatics.indiana.edu/kwchoi/platcom/
Background :What is operon ? http://biocyc.org:1555/ECOLI/new-image?object=Transcription-Units • The operon structure was found in 1960 by 2 French biologists. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol., 3, 318–356. • An operon is a group of genes that encodes functionally linked proteins. Its components are : • Adjacent (200-300 nt) • On the same strand (+ or -) • Co-expressed by one promoter.
Background :How to identify or predict operon structure? • When a promoter and terminator are known : • Gene clusters = Transcription Units • Classical concept of operon • When a promoter is not known : • Gene clusters = Directrons • Hypothetical operon candidates • Depending on direction and proper intergenic distance (200-300 nt) • Computational methods have been developed to find gene clusters in bacterial genomes.
PCBBH and PCH R.Overbeek et al. PNAS, 1999, Vol.96, pp.2896-2901 PCBBH : Pair of Close Bidirectional Best Hits BBH : Bidirectional Best Hits PCH : Pair of Close Homologs COG : Clusters of Orthologous Genes
Background :Über-operon: P.Bork et al. Treds. Biochem. Sci., Vol. 25, pp. 474-479 • Über-operon : A set of genes with a close functional and regulatory contexts that tends to be conserved despite numerous rearrangements. • This concept focus on the functional themes of operons, not a specific genes or gene order.
Background :Why gene clusters are conserved ? • Certain operons, particularly those that encode subunits of multiprotein complexes (e.g. ribosomal proteins) are conserved in phylogenetically distant bacterial genomes. • These gene clusters might have been conserved since the last universal common ancestor. Why? • Selfish-operon hypothesis :Horizontal transfer of an entire operon is favored by natural selection over transfer of individual genes because co-expression and co-regulation are preserved.
Background : Problems in Operon Prediction. • Over 150 genomes have been fully sequenced until today, but The biological functions of some genes are still unknown. • There is only a few promoter detection algorithms, but they are not fully satisfactory. • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.) • Operon tends to undergo multiple rearrangements during evolution. • As a result, gene order at a lever above is poorly conserved. (e.g. genes involved in de novo purine synthesis)
Background : Problems in Computational Algorithms to Predict Operons • Direct Signal Finding • Experiment-based approach • Transcription promoters (5’-end) and terminators (3’-end) were searched. • Only be effective for species whose transcription signals are well known, E.coli. • Combination of gene expression data, functional annotation and other experimental data. • Literature-based approach • Primarily applicable to well studied genomes such as E.coli, because data files are incomplete for other genomes. • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)
Procedure • As a part of PLATCOM project, an integrated whole genome analysis system was built on BIOKDD server. • Web interface for all-to-all pairwise comparison DB and tools are also provided. • Several tools for multiple genomes analysis were written in Perl and then gene neighborhoods was reconstructed from the clustering data. • My gene clustering algorithm was used to compensate the defect of the literature-based approach. • Connected gene neighborhoods were analyzed to predict gene function and functional coupling between clusters.
Materials/ Tools • Raw Data • 22 genomes were chosen for this study. (14 groups) • Protein-Protein Pairwise Comparison Data • e.g.http://biokdd.informatics.indiana.edu/kwchoi/Thesis/L42023.faa.U00096.faa.cmp.txt • PTT files from NCBI site • e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/U00096.ptt.txt • Data Generated by Web Tools • Gene Clustering Data (based on sequence homology) • e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/clustering_13321_23_750.txt • Gene Clusters generated from PTT file (given intergenic distance) • e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/candidates_22211.htm • E. coli database for reality check • http://biocyc.org/ • http://ecocyc.org/
Procedure My Approach to reconstruct Genomic Neighborhoods • The idea underlying this study is that • Different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods • By generating a “Tiling Path”, the entire neighborhood can be reconstructed. • Genomic context of well-known genome (e.g. E.coli ) is used as a contextual framework. • Start with looking at this framework and then search a group of similar gene neighborhoods in the target genomes. • “Genomic context” means the pattern of series of COG. If COG is not given, we can predict the function of a unknown gene based on my gene clustering data. • We can also identify some “Hitchhikers”. “Hitchhikers” are inserted genes that are originated from different contexts/themes.
Tiling PathV.Koonin et al. Nucleic Acids Research, 2002, Vol.30, No.10, pp. 2212-2223
Results • Case 1 • Relationship between Gene Order and Phylogenetic Distance • Case 2 • One theme : Typical Operon (rbs operon) • Reconstruct gene neighborhoods • Find missing components from the reconstructed gene clusters. • Case 3 • Two or more themes : FunctionalCoupling ? • Find genomic hitchhikers • Predict gene function of uncharacterized protein • Predict functional coupling
Case 1 :Gene Order and Phylogenetic Distance • If gene order of two genome is well conserved, the sequence of homologs should appear as a line on the genome comparison diagonal plot. • What is the relationship between phylogenetic distance and the conservation of gene order?
Phylogenetic TreeV.Daubin et al. Genome Research, Vol 12, Issue 7, 1080-1090
Genome Comparison Diagonal Plot: Phylogenetically-Distant Species (Z-score = over 500)
Genome Comparison Diagonal Plot: Phylogenetically-Close Species (Z-score > 1000)
Case 1 : Conclusion • Gene order in phylogenetically-distant species are poorly conserved. • But this observation does not mean that gene order is conserved very well among the phylogenetically-close species. • In case of very close species (e.g. E.coli vs. H.influenza), gene orders are completely scattered. • In most cases, only a small number of genes are observed as a short line or cluster and we may consider it as a putative operon. • In next step, this possibility will be investigated deeply.
Case 2 :Rbs Operon (Typical Operon) • Theme : Ribose transport across membrane • COG1869 D-ribose high-affinity transport system; membrane-associated protein • COG1129 ATP-binding component of D-ribose high-affinity transport system • COG1172 D-ribose high-affinity transport system • COG1879 D-ribose periplasmic binding protein • COG0524 ribokinase • COG1609 regulator for rbs operon http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU00206
Case 2 : Conclusion • All components are involved in ribose transport across bacterial cell membrane • In Rbs operon system, gene order pattern is 1869-1129-1172-1879-0524-1609. • 10 out of 22 genomes have this operon system. • Exceptsome cases, this gene order pattern is conserved very well. • So it is possible that there exists a kind of “General Contextual Framework” of gene order.
Case 3 : Functional Coupling of 2 or more themes • Theme 1 : Transcription • COG0779 Uncharacterized Conserved Protein • COG0195 Transcription elongation factor • COG2740 Predicted nucleic-acid-binding protein (transcription termination?) • Theme 2 : Translation • COG1358 Ribosomal protein S17E • COG0532 Translation initiation factor 2 (GTPase) • COG1550 Uncharacterized Conserved Protein • COG0858 Ribosome-binding factor A • COG0184 Ribosomal protein S15P/S13E • COG0130 tRNA Pseudouridine synthase • Hitchhiker ? • COG0196 FDA Synthase (Hitchhiker?) http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU341
Case 3 : Functional CouplingZ-score = over 750, Intergenic Distance = 300
Case 3 : Conclusion • Functional Coupling : • In bacteria, transcription, translation and RNA modification/degradation are coupled and the advantages of co-regulation the corresponding genes are obvious. • COG0779(Uncharacterized) is almost inseparable from the COG0195(Transcription Elongation Factor), so it is likely to be a functional partner of COG0195. • Hitchhiker : • The association of the COG0196(FDA synthase) is not as tight as the connections between the genes belonging to the theme. • Gene function prediction : • The functions of 3 genes in AE0004092 genomes can be predicted by reading genomic context.
Conclusion • Genome Comparison Diagonal Plot visualizes the sequence comparison of 2 genomes. It is a simple tool, but presents a very strong intuition to understand the genome structure. • Conserved gene neighborhoods reconstructed from many genomes by the Tiling Path Method can be used to predict the functions of uncharacterized genes and functional coupling between well-characterized genes in those genomes. • Ultimately, We can use this methods to reconstruct metabolic and functional subsystems.
Acknowledgements • Haifeng Zhao • Genome Pairwise Comparison DB • Scott Martin • Server Management and Technical Suppor • Dr. Sun Kim • Graduate Advisor and P.I.