improving gene function prediction using gene neighborhoods
Download
Skip this Video
Download Presentation
Improving Gene Function Prediction Using Gene Neighborhoods

Loading in 2 Seconds...

play fullscreen
1 / 31

Improving Gene Function Prediction Using Gene Neighborhoods - PowerPoint PPT Presentation


  • 209 Views
  • Uploaded on

Improving Gene Function Prediction Using Gene Neighborhoods. Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington, IN. Introduction : PLATCOM (A Platform for Computational Comparative Genomics).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Improving Gene Function Prediction Using Gene Neighborhoods' - Michelle


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving gene function prediction using gene neighborhoods

Improving Gene Function PredictionUsing Gene Neighborhoods

Kwangmin Choi

Bioinformatics Program

School of Informatics

Indiana University, Bloomington, IN

introduction platcom a platform for computational comparative genomics
Introduction : PLATCOM (A Platform for Computational Comparative Genomics)
  • PLATCOM is a system for the comparative analysis of multiple genomes.
  • PLATCOM consists of 3 components:
    • Databases of biological entities
      • e.g. fna, faa, ptt, gbk…
    • Databases of relationships among entities
      • e.g. genome-genome, protein-protein pairwise comparison
    • Mining tools over the databases
  • The web interface of PLATCOM system is located at http://biokdd.informatics.indiana.edu/kwchoi/platcom/
background what is operon http biocyc org 1555 ecoli new image object transcription units
Background :What is operon ? http://biocyc.org:1555/ECOLI/new-image?object=Transcription-Units
  • The operon structure was found in 1960 by 2 French biologists. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol., 3, 318–356.
  • An operon is a group of genes that encodes functionally linked proteins. Its components are :
    • Adjacent (200-300 nt)
    • On the same strand (+ or -)
    • Co-expressed by one promoter.
background how to identify or predict operon structure
Background :How to identify or predict operon structure?
  • When a promoter and terminator are known :
    • Gene clusters = Transcription Units
    • Classical concept of operon
  • When a promoter is not known :
    • Gene clusters = Directrons
    • Hypothetical operon candidates
    • Depending on direction and proper intergenic distance (200-300 nt)
  • Computational methods have been developed to find gene clusters in bacterial genomes.
slide6
PCBBH and PCH

R.Overbeek et al. PNAS, 1999, Vol.96, pp.2896-2901

PCBBH : Pair of Close Bidirectional Best Hits

BBH : Bidirectional Best Hits

PCH : Pair of Close Homologs

COG : Clusters of Orthologous Genes

background ber operon p bork et al treds biochem sci vol 25 pp 474 479
Background :Über-operon: P.Bork et al. Treds. Biochem. Sci., Vol. 25, pp. 474-479
  • Über-operon : A set of genes with a close functional and regulatory contexts that tends to be conserved despite numerous rearrangements.
  • This concept focus on the functional themes of operons, not a specific genes or gene order.
background why gene clusters are conserved
Background :Why gene clusters are conserved ?
  • Certain operons, particularly those that encode subunits of multiprotein complexes (e.g. ribosomal proteins) are conserved in phylogenetically distant bacterial genomes.
  • These gene clusters might have been conserved since the last universal common ancestor. Why?
  • Selfish-operon hypothesis :Horizontal transfer of an entire operon is favored by natural selection over transfer of individual genes because co-expression and co-regulation are preserved.
background problems in operon prediction
Background : Problems in Operon Prediction.
  • Over 150 genomes have been fully sequenced until today, but The biological functions of some genes are still unknown.
  • There is only a few promoter detection algorithms, but they are not fully satisfactory.
  • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)
  • Operon tends to undergo multiple rearrangements during evolution.
    • As a result, gene order at a lever above is poorly conserved. (e.g. genes involved in de novo purine synthesis)
background problems in computational algorithms to predict operons
Background : Problems in Computational Algorithms to Predict Operons
  • Direct Signal Finding
    • Experiment-based approach
    • Transcription promoters (5’-end) and terminators (3’-end) were searched.
    • Only be effective for species whose transcription signals are well known, E.coli.
  • Combination of gene expression data, functional annotation and other experimental data.
    • Literature-based approach
    • Primarily applicable to well studied genomes such as E.coli, because data files are incomplete for other genomes.
    • In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)
procedure
Procedure
  • As a part of PLATCOM project, an integrated whole genome analysis system was built on BIOKDD server.
    • Web interface for all-to-all pairwise comparison DB and tools are also provided.
  • Several tools for multiple genomes analysis were written in Perl and then gene neighborhoods was reconstructed from the clustering data.
    • My gene clustering algorithm was used to compensate the defect of the literature-based approach.
  • Connected gene neighborhoods were analyzed to predict gene function and functional coupling between clusters.
materials tools
Materials/ Tools
  • Raw Data
    • 22 genomes were chosen for this study. (14 groups)
    • Protein-Protein Pairwise Comparison Data
      • e.g.http://biokdd.informatics.indiana.edu/kwchoi/Thesis/L42023.faa.U00096.faa.cmp.txt
    • PTT files from NCBI site
      • e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/U00096.ptt.txt
  • Data Generated by Web Tools
    • Gene Clustering Data (based on sequence homology)
      • e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/clustering_13321_23_750.txt
    • Gene Clusters generated from PTT file (given intergenic distance)
      • e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/candidates_22211.htm
  • E. coli database for reality check
    • http://biocyc.org/
    • http://ecocyc.org/
procedure my approach to reconstruct genomic neighborhoods
Procedure My Approach to reconstruct Genomic Neighborhoods
  • The idea underlying this study is that
    • Different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods
    • By generating a “Tiling Path”, the entire neighborhood can be reconstructed.
  • Genomic context of well-known genome (e.g. E.coli ) is used as a contextual framework.
    • Start with looking at this framework and then search a group of similar gene neighborhoods in the target genomes.
    • “Genomic context” means the pattern of series of COG. If COG is not given, we can predict the function of a unknown gene based on my gene clustering data.
    • We can also identify some “Hitchhikers”. “Hitchhikers” are inserted genes that are originated from different contexts/themes.
results
Results
  • Case 1
    • Relationship between Gene Order and Phylogenetic Distance
  • Case 2
    • One theme : Typical Operon (rbs operon)
      • Reconstruct gene neighborhoods
      • Find missing components from the reconstructed gene clusters.
  • Case 3
    • Two or more themes : FunctionalCoupling ?
      • Find genomic hitchhikers
      • Predict gene function of uncharacterized protein
      • Predict functional coupling
case 1 gene order and phylogenetic distance
Case 1 :Gene Order and Phylogenetic Distance
  • If gene order of two genome is well conserved, the sequence of homologs should appear as a line on the genome comparison diagonal plot.
  • What is the relationship between phylogenetic distance and the conservation of gene order?
case 1 conclusion
Case 1 : Conclusion
  • Gene order in phylogenetically-distant species are poorly conserved.
  • But this observation does not mean that gene order is conserved very well among the phylogenetically-close species.
    • In case of very close species (e.g. E.coli vs. H.influenza), gene orders are completely scattered.
  • In most cases, only a small number of genes are observed as a short line or cluster and we may consider it as a putative operon.
  • In next step, this possibility will be investigated deeply.
case 2 rbs operon typical operon
Case 2 :Rbs Operon (Typical Operon)
  • Theme : Ribose transport across membrane
    • COG1869 D-ribose high-affinity transport system; membrane-associated protein
    • COG1129 ATP-binding component of D-ribose high-affinity transport system
    • COG1172 D-ribose high-affinity transport system
    • COG1879 D-ribose periplasmic binding protein
    • COG0524 ribokinase
    • COG1609 regulator for rbs operon

http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU00206

case 2 conclusion
Case 2 : Conclusion
  • All components are involved in ribose transport across bacterial cell membrane
  • In Rbs operon system, gene order pattern is 1869-1129-1172-1879-0524-1609.
    • 10 out of 22 genomes have this operon system.
    • Exceptsome cases, this gene order pattern is conserved very well.
  • So it is possible that there exists a kind of “General Contextual Framework” of gene order.
case 3 functional coupling of 2 or more themes
Case 3 : Functional Coupling of 2 or more themes
  • Theme 1 : Transcription
    • COG0779 Uncharacterized Conserved Protein
    • COG0195 Transcription elongation factor
    • COG2740 Predicted nucleic-acid-binding protein (transcription termination?)
  • Theme 2 : Translation
    • COG1358 Ribosomal protein S17E
    • COG0532 Translation initiation factor 2 (GTPase)
    • COG1550 Uncharacterized Conserved Protein
    • COG0858 Ribosome-binding factor A
    • COG0184 Ribosomal protein S15P/S13E
    • COG0130 tRNA Pseudouridine synthase
  • Hitchhiker ?
    • COG0196 FDA Synthase (Hitchhiker?)

http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU341

case 3 conclusion
Case 3 : Conclusion
  • Functional Coupling :
    • In bacteria, transcription, translation and RNA modification/degradation are coupled and the advantages of co-regulation the corresponding genes are obvious.
    • COG0779(Uncharacterized) is almost inseparable from the COG0195(Transcription Elongation Factor), so it is likely to be a functional partner of COG0195.
  • Hitchhiker :
    • The association of the COG0196(FDA synthase) is not as tight as the connections between the genes belonging to the theme.
  • Gene function prediction :
    • The functions of 3 genes in AE0004092 genomes can be predicted by reading genomic context.
conclusion
Conclusion
  • Genome Comparison Diagonal Plot visualizes the sequence comparison of 2 genomes. It is a simple tool, but presents a very strong intuition to understand the genome structure.
  • Conserved gene neighborhoods reconstructed from many genomes by the Tiling Path Method can be used to predict the functions of uncharacterized genes and functional coupling between well-characterized genes in those genomes.
  • Ultimately, We can use this methods to reconstruct metabolic and functional subsystems.
acknowledgements
Acknowledgements
  • Haifeng Zhao
    • Genome Pairwise Comparison DB
  • Scott Martin
    • Server Management and Technical Suppor
  • Dr. Sun Kim
    • Graduate Advisor and P.I.
ad