Improving gene function prediction using gene neighborhoods
1 / 31

Improving Gene Function Prediction Using Gene Neighborhoods - PowerPoint PPT Presentation

  • Updated On :
  • Presentation posted in: Home / Garden

Improving Gene Function Prediction Using Gene Neighborhoods. Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington, IN. Introduction : PLATCOM (A Platform for Computational Comparative Genomics).

Related searches for Improving Gene Function Prediction Using Gene Neighborhoods

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Improving Gene Function Prediction Using Gene Neighborhoods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Improving Gene Function PredictionUsing Gene Neighborhoods

Kwangmin Choi

Bioinformatics Program

School of Informatics

Indiana University, Bloomington, IN

Introduction : PLATCOM (A Platform for Computational Comparative Genomics)

  • PLATCOM is a system for the comparative analysis of multiple genomes.

  • PLATCOM consists of 3 components:

    • Databases of biological entities

      • e.g. fna, faa, ptt, gbk…

    • Databases of relationships among entities

      • e.g. genome-genome, protein-protein pairwise comparison

    • Mining tools over the databases

  • The web interface of PLATCOM system is located at

PLATCOM Web Interface Frontpage of Genome Plot

Background :What is operon ?

  • The operon structure was found in 1960 by 2 French biologists. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol., 3, 318–356.

  • An operon is a group of genes that encodes functionally linked proteins. Its components are :

    • Adjacent (200-300 nt)

    • On the same strand (+ or -)

    • Co-expressed by one promoter.

Background :How to identify or predict operon structure?

  • When a promoter and terminator are known :

    • Gene clusters = Transcription Units

    • Classical concept of operon

  • When a promoter is not known :

    • Gene clusters = Directrons

    • Hypothetical operon candidates

    • Depending on direction and proper intergenic distance (200-300 nt)

  • Computational methods have been developed to find gene clusters in bacterial genomes.


R.Overbeek et al. PNAS, 1999, Vol.96, pp.2896-2901

PCBBH : Pair of Close Bidirectional Best Hits

BBH : Bidirectional Best Hits

PCH : Pair of Close Homologs

COG : Clusters of Orthologous Genes

Background :Über-operon: P.Bork et al. Treds. Biochem. Sci., Vol. 25, pp. 474-479

  • Über-operon : A set of genes with a close functional and regulatory contexts that tends to be conserved despite numerous rearrangements.

  • This concept focus on the functional themes of operons, not a specific genes or gene order.

Background :Why gene clusters are conserved ?

  • Certain operons, particularly those that encode subunits of multiprotein complexes (e.g. ribosomal proteins) are conserved in phylogenetically distant bacterial genomes.

  • These gene clusters might have been conserved since the last universal common ancestor. Why?

  • Selfish-operon hypothesis :Horizontal transfer of an entire operon is favored by natural selection over transfer of individual genes because co-expression and co-regulation are preserved.

Background : Problems in Operon Prediction.

  • Over 150 genomes have been fully sequenced until today, but The biological functions of some genes are still unknown.

  • There is only a few promoter detection algorithms, but they are not fully satisfactory.

  • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)

  • Operon tends to undergo multiple rearrangements during evolution.

    • As a result, gene order at a lever above is poorly conserved. (e.g. genes involved in de novo purine synthesis)

Background : Problems in Computational Algorithms to Predict Operons

  • Direct Signal Finding

    • Experiment-based approach

    • Transcription promoters (5’-end) and terminators (3’-end) were searched.

    • Only be effective for species whose transcription signals are well known, E.coli.

  • Combination of gene expression data, functional annotation and other experimental data.

    • Literature-based approach

    • Primarily applicable to well studied genomes such as E.coli, because data files are incomplete for other genomes.

    • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)


  • As a part of PLATCOM project, an integrated whole genome analysis system was built on BIOKDD server.

    • Web interface for all-to-all pairwise comparison DB and tools are also provided.

  • Several tools for multiple genomes analysis were written in Perl and then gene neighborhoods was reconstructed from the clustering data.

    • My gene clustering algorithm was used to compensate the defect of the literature-based approach.

  • Connected gene neighborhoods were analyzed to predict gene function and functional coupling between clusters.

Materials/ Tools

  • Raw Data

    • 22 genomes were chosen for this study. (14 groups)

    • Protein-Protein Pairwise Comparison Data

      • e.g.

    • PTT files from NCBI site

      • e.g.

  • Data Generated by Web Tools

    • Gene Clustering Data (based on sequence homology)

      • e.g.

    • Gene Clusters generated from PTT file (given intergenic distance)

      • e.g.

  • E. coli database for reality check




Procedure My Approach to reconstruct Genomic Neighborhoods

  • The idea underlying this study is that

    • Different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods

    • By generating a “Tiling Path”, the entire neighborhood can be reconstructed.

  • Genomic context of well-known genome (e.g. E.coli ) is used as a contextual framework.

    • Start with looking at this framework and then search a group of similar gene neighborhoods in the target genomes.

    • “Genomic context” means the pattern of series of COG. If COG is not given, we can predict the function of a unknown gene based on my gene clustering data.

    • We can also identify some “Hitchhikers”. “Hitchhikers” are inserted genes that are originated from different contexts/themes.

Tiling PathV.Koonin et al. Nucleic Acids Research, 2002, Vol.30, No.10, pp. 2212-2223

Gene Neighborhoods


  • Case 1

    • Relationship between Gene Order and Phylogenetic Distance

  • Case 2

    • One theme : Typical Operon (rbs operon)

      • Reconstruct gene neighborhoods

      • Find missing components from the reconstructed gene clusters.

  • Case 3

    • Two or more themes : FunctionalCoupling ?

      • Find genomic hitchhikers

      • Predict gene function of uncharacterized protein

      • Predict functional coupling

Case 1 :Gene Order and Phylogenetic Distance

  • If gene order of two genome is well conserved, the sequence of homologs should appear as a line on the genome comparison diagonal plot.

  • What is the relationship between phylogenetic distance and the conservation of gene order?

Phylogenetic TreeV.Daubin et al. Genome Research, Vol 12, Issue 7, 1080-1090

Genome Comparison Diagonal Plot: Phylogenetically-Distant Species (Z-score = over 500)

Genome Comparison Diagonal Plot: Phylogenetically-Close Species (Z-score > 1000)

Fragmented Gene Clusters

Case 1 : Conclusion

  • Gene order in phylogenetically-distant species are poorly conserved.

  • But this observation does not mean that gene order is conserved very well among the phylogenetically-close species.

    • In case of very close species (e.g. E.coli vs. H.influenza), gene orders are completely scattered.

  • In most cases, only a small number of genes are observed as a short line or cluster and we may consider it as a putative operon.

  • In next step, this possibility will be investigated deeply.

Case 2 :Rbs Operon (Typical Operon)

  • Theme : Ribose transport across membrane

    • COG1869 D-ribose high-affinity transport system; membrane-associated protein

    • COG1129 ATP-binding component of D-ribose high-affinity transport system

    • COG1172 D-ribose high-affinity transport system

    • COG1879 D-ribose periplasmic binding protein

    • COG0524 ribokinase

    • COG1609 regulator for rbs operon

Case 2 : Rbs OperonZ-score = over 750, Intergenic Distance = 300

Case 2 : Conclusion

  • All components are involved in ribose transport across bacterial cell membrane

  • In Rbs operon system, gene order pattern is 1869-1129-1172-1879-0524-1609.

    • 10 out of 22 genomes have this operon system.

    • Exceptsome cases, this gene order pattern is conserved very well.

  • So it is possible that there exists a kind of “General Contextual Framework” of gene order.

Case 3 : Functional Coupling of 2 or more themes

  • Theme 1 : Transcription

    • COG0779 Uncharacterized Conserved Protein

    • COG0195 Transcription elongation factor

    • COG2740 Predicted nucleic-acid-binding protein (transcription termination?)

  • Theme 2 : Translation

    • COG1358 Ribosomal protein S17E

    • COG0532 Translation initiation factor 2 (GTPase)

    • COG1550 Uncharacterized Conserved Protein

    • COG0858 Ribosome-binding factor A

    • COG0184 Ribosomal protein S15P/S13E

    • COG0130 tRNA Pseudouridine synthase

  • Hitchhiker ?

    • COG0196 FDA Synthase (Hitchhiker?)

Case 3 : Functional CouplingZ-score = over 750, Intergenic Distance = 300

Case 3 : Conclusion

  • Functional Coupling :

    • In bacteria, transcription, translation and RNA modification/degradation are coupled and the advantages of co-regulation the corresponding genes are obvious.

    • COG0779(Uncharacterized) is almost inseparable from the COG0195(Transcription Elongation Factor), so it is likely to be a functional partner of COG0195.

  • Hitchhiker :

    • The association of the COG0196(FDA synthase) is not as tight as the connections between the genes belonging to the theme.

  • Gene function prediction :

    • The functions of 3 genes in AE0004092 genomes can be predicted by reading genomic context.


  • Genome Comparison Diagonal Plot visualizes the sequence comparison of 2 genomes. It is a simple tool, but presents a very strong intuition to understand the genome structure.

  • Conserved gene neighborhoods reconstructed from many genomes by the Tiling Path Method can be used to predict the functions of uncharacterized genes and functional coupling between well-characterized genes in those genomes.

  • Ultimately, We can use this methods to reconstruct metabolic and functional subsystems.


  • Haifeng Zhao

    • Genome Pairwise Comparison DB

  • Scott Martin

    • Server Management and Technical Suppor

  • Dr. Sun Kim

    • Graduate Advisor and P.I.

  • Login