improving gene function prediction using gene neighborhoods
Skip this Video
Download Presentation
Improving Gene Function Prediction Using Gene Neighborhoods

Loading in 2 Seconds...

play fullscreen
1 / 31

Improving Gene Function Prediction Using Gene Neighborhoods - PowerPoint PPT Presentation

  • Uploaded on

Improving Gene Function Prediction Using Gene Neighborhoods. Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington, IN. Introduction : PLATCOM (A Platform for Computational Comparative Genomics).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Improving Gene Function Prediction Using Gene Neighborhoods' - Michelle

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving gene function prediction using gene neighborhoods

Improving Gene Function PredictionUsing Gene Neighborhoods

Kwangmin Choi

Bioinformatics Program

School of Informatics

Indiana University, Bloomington, IN

introduction platcom a platform for computational comparative genomics
Introduction : PLATCOM (A Platform for Computational Comparative Genomics)
  • PLATCOM is a system for the comparative analysis of multiple genomes.
  • PLATCOM consists of 3 components:
    • Databases of biological entities
      • e.g. fna, faa, ptt, gbk…
    • Databases of relationships among entities
      • e.g. genome-genome, protein-protein pairwise comparison
    • Mining tools over the databases
  • The web interface of PLATCOM system is located at
background what is operon http biocyc org 1555 ecoli new image object transcription units
Background :What is operon ?
  • The operon structure was found in 1960 by 2 French biologists. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol., 3, 318–356.
  • An operon is a group of genes that encodes functionally linked proteins. Its components are :
    • Adjacent (200-300 nt)
    • On the same strand (+ or -)
    • Co-expressed by one promoter.
background how to identify or predict operon structure
Background :How to identify or predict operon structure?
  • When a promoter and terminator are known :
    • Gene clusters = Transcription Units
    • Classical concept of operon
  • When a promoter is not known :
    • Gene clusters = Directrons
    • Hypothetical operon candidates
    • Depending on direction and proper intergenic distance (200-300 nt)
  • Computational methods have been developed to find gene clusters in bacterial genomes.

R.Overbeek et al. PNAS, 1999, Vol.96, pp.2896-2901

PCBBH : Pair of Close Bidirectional Best Hits

BBH : Bidirectional Best Hits

PCH : Pair of Close Homologs

COG : Clusters of Orthologous Genes

background ber operon p bork et al treds biochem sci vol 25 pp 474 479
Background :Über-operon: P.Bork et al. Treds. Biochem. Sci., Vol. 25, pp. 474-479
  • Über-operon : A set of genes with a close functional and regulatory contexts that tends to be conserved despite numerous rearrangements.
  • This concept focus on the functional themes of operons, not a specific genes or gene order.
background why gene clusters are conserved
Background :Why gene clusters are conserved ?
  • Certain operons, particularly those that encode subunits of multiprotein complexes (e.g. ribosomal proteins) are conserved in phylogenetically distant bacterial genomes.
  • These gene clusters might have been conserved since the last universal common ancestor. Why?
  • Selfish-operon hypothesis :Horizontal transfer of an entire operon is favored by natural selection over transfer of individual genes because co-expression and co-regulation are preserved.
background problems in operon prediction
Background : Problems in Operon Prediction.
  • Over 150 genomes have been fully sequenced until today, but The biological functions of some genes are still unknown.
  • There is only a few promoter detection algorithms, but they are not fully satisfactory.
  • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)
  • Operon tends to undergo multiple rearrangements during evolution.
    • As a result, gene order at a lever above is poorly conserved. (e.g. genes involved in de novo purine synthesis)
background problems in computational algorithms to predict operons
Background : Problems in Computational Algorithms to Predict Operons
  • Direct Signal Finding
    • Experiment-based approach
    • Transcription promoters (5’-end) and terminators (3’-end) were searched.
    • Only be effective for species whose transcription signals are well known, E.coli.
  • Combination of gene expression data, functional annotation and other experimental data.
    • Literature-based approach
    • Primarily applicable to well studied genomes such as E.coli, because data files are incomplete for other genomes.
    • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)
  • As a part of PLATCOM project, an integrated whole genome analysis system was built on BIOKDD server.
    • Web interface for all-to-all pairwise comparison DB and tools are also provided.
  • Several tools for multiple genomes analysis were written in Perl and then gene neighborhoods was reconstructed from the clustering data.
    • My gene clustering algorithm was used to compensate the defect of the literature-based approach.
  • Connected gene neighborhoods were analyzed to predict gene function and functional coupling between clusters.
materials tools
Materials/ Tools
  • Raw Data
    • 22 genomes were chosen for this study. (14 groups)
    • Protein-Protein Pairwise Comparison Data
      • e.g.
    • PTT files from NCBI site
      • e.g.
  • Data Generated by Web Tools
    • Gene Clustering Data (based on sequence homology)
      • e.g.
    • Gene Clusters generated from PTT file (given intergenic distance)
      • e.g.
  • E. coli database for reality check
procedure my approach to reconstruct genomic neighborhoods
Procedure My Approach to reconstruct Genomic Neighborhoods
  • The idea underlying this study is that
    • Different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods
    • By generating a “Tiling Path”, the entire neighborhood can be reconstructed.
  • Genomic context of well-known genome (e.g. E.coli ) is used as a contextual framework.
    • Start with looking at this framework and then search a group of similar gene neighborhoods in the target genomes.
    • “Genomic context” means the pattern of series of COG. If COG is not given, we can predict the function of a unknown gene based on my gene clustering data.
    • We can also identify some “Hitchhikers”. “Hitchhikers” are inserted genes that are originated from different contexts/themes.
  • Case 1
    • Relationship between Gene Order and Phylogenetic Distance
  • Case 2
    • One theme : Typical Operon (rbs operon)
      • Reconstruct gene neighborhoods
      • Find missing components from the reconstructed gene clusters.
  • Case 3
    • Two or more themes : FunctionalCoupling ?
      • Find genomic hitchhikers
      • Predict gene function of uncharacterized protein
      • Predict functional coupling
case 1 gene order and phylogenetic distance
Case 1 :Gene Order and Phylogenetic Distance
  • If gene order of two genome is well conserved, the sequence of homologs should appear as a line on the genome comparison diagonal plot.
  • What is the relationship between phylogenetic distance and the conservation of gene order?
case 1 conclusion
Case 1 : Conclusion
  • Gene order in phylogenetically-distant species are poorly conserved.
  • But this observation does not mean that gene order is conserved very well among the phylogenetically-close species.
    • In case of very close species (e.g. E.coli vs. H.influenza), gene orders are completely scattered.
  • In most cases, only a small number of genes are observed as a short line or cluster and we may consider it as a putative operon.
  • In next step, this possibility will be investigated deeply.
case 2 rbs operon typical operon
Case 2 :Rbs Operon (Typical Operon)
  • Theme : Ribose transport across membrane
    • COG1869 D-ribose high-affinity transport system; membrane-associated protein
    • COG1129 ATP-binding component of D-ribose high-affinity transport system
    • COG1172 D-ribose high-affinity transport system
    • COG1879 D-ribose periplasmic binding protein
    • COG0524 ribokinase
    • COG1609 regulator for rbs operon

case 2 conclusion
Case 2 : Conclusion
  • All components are involved in ribose transport across bacterial cell membrane
  • In Rbs operon system, gene order pattern is 1869-1129-1172-1879-0524-1609.
    • 10 out of 22 genomes have this operon system.
    • Exceptsome cases, this gene order pattern is conserved very well.
  • So it is possible that there exists a kind of “General Contextual Framework” of gene order.
case 3 functional coupling of 2 or more themes
Case 3 : Functional Coupling of 2 or more themes
  • Theme 1 : Transcription
    • COG0779 Uncharacterized Conserved Protein
    • COG0195 Transcription elongation factor
    • COG2740 Predicted nucleic-acid-binding protein (transcription termination?)
  • Theme 2 : Translation
    • COG1358 Ribosomal protein S17E
    • COG0532 Translation initiation factor 2 (GTPase)
    • COG1550 Uncharacterized Conserved Protein
    • COG0858 Ribosome-binding factor A
    • COG0184 Ribosomal protein S15P/S13E
    • COG0130 tRNA Pseudouridine synthase
  • Hitchhiker ?
    • COG0196 FDA Synthase (Hitchhiker?)

case 3 conclusion
Case 3 : Conclusion
  • Functional Coupling :
    • In bacteria, transcription, translation and RNA modification/degradation are coupled and the advantages of co-regulation the corresponding genes are obvious.
    • COG0779(Uncharacterized) is almost inseparable from the COG0195(Transcription Elongation Factor), so it is likely to be a functional partner of COG0195.
  • Hitchhiker :
    • The association of the COG0196(FDA synthase) is not as tight as the connections between the genes belonging to the theme.
  • Gene function prediction :
    • The functions of 3 genes in AE0004092 genomes can be predicted by reading genomic context.
  • Genome Comparison Diagonal Plot visualizes the sequence comparison of 2 genomes. It is a simple tool, but presents a very strong intuition to understand the genome structure.
  • Conserved gene neighborhoods reconstructed from many genomes by the Tiling Path Method can be used to predict the functions of uncharacterized genes and functional coupling between well-characterized genes in those genomes.
  • Ultimately, We can use this methods to reconstruct metabolic and functional subsystems.
  • Haifeng Zhao
    • Genome Pairwise Comparison DB
  • Scott Martin
    • Server Management and Technical Suppor
  • Dr. Sun Kim
    • Graduate Advisor and P.I.