Scale of the ‘unknown’ gene problem
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Scale of the ‘unknown’ gene problem PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Scale of the ‘unknown’ gene problem. Comparative genomics outline. Shared plant-prokaryote genes. Comparative genomics When Blast tells you nothing…. The ‘guilt by association’ principle ‘Two-dimensional’ gene annotation SEED subsystems. Plant-prokaryote examples

Download Presentation

Scale of the ‘unknown’ gene problem

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Scale of the unknown gene problem

Scale of the ‘unknown’ gene problem

Comparative genomics outline

  • Shared plant-prokaryote genes

  • Comparative genomics

    • When Blast tells you nothing….

    • The ‘guilt by association’ principle

    • ‘Two-dimensional’ gene annotation

    • SEED subsystems

  • Plant-prokaryote examples

    • Filling ‘pathway holes’ – FolQ

    • Linking new functions to known systems – COG0354


Whole genome sequencing progress

10000

www.genomesonline.org

9000

8000

7000

6000

Number of genomes

5000

4000

3000

Ongoing

Complete

2000

1000

0

Jul 1999

Jul 2002

Apr 2005

Apr 2003

Oct 2004

Oct 2005

Jun 2000

Jan 2001

Jan 2003

Jun 2004

Feb 2004

Mar 2011

Aug 2006

Sep 2003

May 2007

May 2008

Aug 2009

Dec 1997

Sep 2001

Whole genome sequencing progress

● Functional annotation of genes has nowhere near kept pace

● Functional annotations are often absent, vague, or wrong


Scale of the unknown gene problem

1437/3736 enzymes (38%) with EC numbers have no associated genes

Orphan genes

Orphan enzymes

  • 20-60% of genes in any given genome have no known function or only a vague one (‘esterase’ etc)


Scale of the unknown gene problem

Percentage of unknown proteins encoded by diverse genomes

100

80

Unknown

Known

60

Percent of proteins

40

20

0

Human

Solibacter

usitatus

Escherichia

coli

Chlamydia

trachomatis

Pyrococcus

abyssi

Haloarcula

marismortui

Arabidopsis

Lactobacillus

casei

Synechocystis

Staphylococcus

aureus

Acidobacterium

Bacteria

Archaea

Eukarya

The unknown protein problem in various groups

Data from The SEED http://theseed.uchicago.edu/


Scale of the unknown gene problem

Source of genes

Number of genes

% of genome

Cyanobacteria

5470

21.0

Proteobacteria

1170

4.6

Total

11170

43.4

Gram+ bacteria

2280

9.1

Other bacteria

1160

4.6

Archaea

1090

4.4

Plants & prokaryotes share many (unknown) genes

● Estimates for Arabidopsis vary – but all are many thousands

● Functions of most shared genes are metabolic

From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)

● Shared genes identifiably from various groups

● Plants are conglomerates of microbial metabolic genes

● Many opportunities for comparative genomics


Scale of the unknown gene problem

The power of comparative genomics

●Suppose you have an unknown plant protein:

●BlastP search gives various prokaryote hits

●None of them have clear functions

 Dead end

●No! This is the beginning of comparative genomics

●Predicts functions via ‘guilt by association’ principle

●Genes of related function are associated in various ways

●e.g. Enzymes in a pathway, proteins in a complex

●Whatever a gene’s associates do, it probably does too


Scale of the unknown gene problem

Association

evidence

Gene W

A

B

C

D

Gene X

Gene Y

Gene Z

Gene clustering

Co-expression

Orf X

Orf Y

Orf XY

A

Gene fusion

B

A

C

V

M

Predictions

XYYX

Organelle proteomes

Protein-protein

interactions

B

XYYX

C

XYYX

D

XYYX

Testing (genetics,

biochemistry)

Shared regulatory sites

Essentiality & other phenome data

+

+

+

+

+

Structures

Phylogenetic occurrence

Genomic evidence

Post-genomic evidence


Scale of the unknown gene problem

Two-dimensional gene annotation

  • ‘Dimensions’ are:

    • Molecular function (e.g., an enzyme activity with EC no.)

    • Functional context (e.g., other enzymes of a pathway)

  • ‘2-Dimensions good, 1-dimension bad’

    • Even an EC no. function may be wrong if pathway not there

    • Pathway context may be wrong if certain enzymes missing

  • GenBank etc annotations are 1-dimensional (mol. function)


Scale of the unknown gene problem

Folate biosynthesis subsystem

Pathway

hole

SEED subsystems

  • Subsystems (SSs) capture both annotation dimensions

  • SSs cover many genomes, have form of spreadsheet:

    • Columns are molecular functions

    • Rows are genomes

    • Each cell identifies the genes for proteins with the specific molecular functional role in the designated genome

  • Sets of molecular functions (e.g. enzymes) that together implement a specific biological process (e.g. a pathway)


Scale of the unknown gene problem

Plant – prokaryote examples

  • Prokaryote association evidence is mainly genomic

  • Plant association evidence is mainly post-genomic

  • Post-genomic evidence is noisier but very useful

  • Superb plant post-genomic resources:

    • Microarrays, RNAseq (organ- and environment-specific)

    • Organellar targeting prediction, proteomics (location can r/o function)

    • Phenome databases (chlorosis, lethality can support function)

    • Huge EST databases

    • Vast plant metabolism bibliome


Folq filling a pathway hole

Folate synthesis pathway

FolE

FolQ

[P-ase]

FolB

FolK

FolP

FolC

FolA

HMDHP-P2

THF

HMDHP

GTP

DHN-P

DHN

DHP

DHF

DHN-P3

Glu

PabAB

PabC

pABA

Chrorismate

ADC

folEK

folP

ylgG

folC

Lactococcus lactis

folate gene cluster

FolQ – Filling a pathway hole

  • FolQ universally missing (prokaryotes, plants, fungi, protists)

  • Missing step known to be a pyrophosphohydrolase, ~17 kDa

    • Search genomes for small hydrolase clustered with fol genes

    • YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa

  • YlgG has a plant homolog – At1g68760


Folq experimental tests

Folate synthesis pathway

0.9

FolE

FolQ

[P-ase]

FolB

FolK

FolP

FolC

FolA

HMDHP-P2

THF

HMDHP

GTP

DHN-P

DHN

DHP

DHF

DHN-P3

0.6

Glu

PabAB

PabC

pABA

Chrorismate

ADC

0.3

0

Recombinant proteins release DHN-P + PPi

WT

KO

240

YlgG

At1g68760

DHN-P3

200

1.5

160

1.0

Fluorescence

120

Product formation (nmol/assay)

DHNP3

80

0.5

40

0

0

DHNP

DHNP

Pi

Pi

PPi

PPi

2

4

6

2

4

6

Minutes

FolQ – Experimental tests

  • YlgG& At1g68760 act on DHN-P3

  • ylgG KO accumulates DHN-P3


Scale of the unknown gene problem

Rickettsia

Ehrlichia

Anaplasma

Bradyrhizobium

Burkholderia

Neisseria

Xanthomonas

Psychrobacter

E. coli

Shewanella

Thermus

Deinococcus

Synechocystis

Synechococcus

Nostoc

Haloarcula

Natronomonas

Corynebacterium

Streptomyces

Solibacter

GcvT

Blastopirellula

Yeast GcvT

Pirellula

Mouse GcvT

Arabidopsis GcvT

Rice GcvT

COG0354 – Linking a new function to known system

COG0354 – A folate protein for Fe/S cluster repair in oxidative stress

  • In all kingdoms of life

Mouse

Fly

  • - Bacteria

Yeast

Leishmania

  • - Archaea

At4g12130

  • Plants

  • Animals

  • - Fungi

  • 2 plant proteins

  • - 1 related to rickettsias (mitochondria)

  • - 1 related to cyanobacteria (plastids)

  • Homolog of GcvT protein

At1g60990

  • - But clearly a distinct clade

Folate-dependent


Scale of the unknown gene problem

Mitochondrial COG0354

Ferritin 2

Mitochondrial Frataxin

Mitochondrial COG0354

COG0354 – Comparative genomics & post-genomic data

  • Co-expression in Arabidopsis

Arabidopsis Transcriptome DB

(Max Planck Institute, Golm)

  • - Mitochondrial COG0354 expression correlates with frataxin (Fe/S assembly)

Developmental series

  • - And with ferritin 2 (Fe storage)


Scale of the unknown gene problem

COG0354

Fe/S protein

Fe/S partner

● Nif cluster in Methylococcus capsulatus

● Suf cluster in Rubrobacter xylanophilus

0354

sufC

sufB

sufS

sufD

thiC

0354

nifQ

fd

nifX

nifN

nifE

fd

nifK

nifD

nifH

● Sdh operon in Stenotrophomonas maltophila

0354

sdhC

sdhD

sdhB

sdhA

● NAD synthesis cluster in Pelagibacter ubique

0354

nadA

nadC

● MiaB (Radical SAM) in Buchnera aphidicola

0354

MiaB

COG0354 – Comparative genomics & post-genomic data

  • Co-expression in Arabidopsis

  • Clusters with Fe/S proteins


Scale of the unknown gene problem

COG0354

IscA

Bacteria

Clostridiales

Firmicutes

Mollicutes

Lactobacillales

Staphylococcaceae

Listeriaceae

Bacillaceae

Fusobacteria

Actinobacteria

Bifidobacterium

Cyanobacteria

Acidobacteria

Campylobacterales

δ/ε-Proteobacteria

Bdellovibrionales

Desulfobacterales

Desulfovibrionales

Desulfuromonadales

Myxococcales

Syntrophobacterales

α-Proteobacteria

β-Proteobacteria

γ-Proteobacteria

Magnetococcus

Spirochaetes

Planctomycetes

Chlamydiales

Chlorobi

Bacteroidetes

Bacteroidales

Flavobacteria

Sphingobacteria

Deinococcus/Thermus

Chloroflexi

Thermotogae

Archaea

Nanoarcheota

Crenarchaeota

Euryarchaeota

Archaeoglobi

Halobacteria

Methanobacteria

Methanococci

Gene present

Methanomicrobia

Gene absent

Methanopyri

Thermococci

Thermoplasmata

COG0354 – Comparative genomics & post-genomic data

  • Co-expression in Arabidopsis

  • Clusters with Fe/S proteins

  • Only occurs if IscA is present

  • - IscA proteins are scaffolds in Fe/Scluster assembly


Scale of the unknown gene problem

COG0354 – Comparative genomics & post-genomic data

  • Co-expression in Arabidopsis

  • Clusters with Fe/S proteins

  • Only occurs if IscA is present

  • Associated with aerobic lifestyle


Scale of the unknown gene problem

–Mycobacterium tuberculosis

–Haemophilus influenzae

–Pseudomonas aeruginosa

–E. coli (slow growth)

– Yeast (petite)

COG0354 – Comparative genomics & post-genomic data

  • Co-expression in Arabidopsis

● Essential gene in:

  • Clusters with Fe/S proteins

  • Only occurs if IscA is present

● Important gene in:

  • Associated with aerobic lifestyle

  • H2O2-induced in E. coli

● Plant proteins both expressed

●Cyano-like protein in plastids

  • High-throughput screens

●E. coli protein has folate site

  • - Essentiality & phenomics

  • - Proteomics


Scale of the unknown gene problem

Controls

Plant & mammal

Fungi, protist, Archaea

Plant C

Protist

Archaea

E. coli

Vector

Plant M

E. coli

Mammal

Yeast

LB + plumbagin (oxidative stress)

COG0354 – Predictions & Experimental Validation

COG0354 PREDICTIONS

● Folate mutations abolish activity

● Is a folate-dependent enzyme

● Combats oxidative stress

● Mutant oxidative stress-sensitive

● Mutant many Fe/S enzyme defects

● Helps make/repair Fe/S clusters

● Function is ancient & ubiquitous

(like Fe/S proteins themselves)

● Complementation by all kingdoms


Scale of the unknown gene problem

Hypothesis that connects

and unifies observations

The power of comparative genomics

“The facts are known but they are insulated and unconnected…. The pearls are there but they will not hang together until some one provides the string”

William Whewell (1794-1866)

English Scientist, Philosopher, Anglican priest

An early influence on Charles Darwin

Coined the term “scientist”


  • Login