Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts and Features:
Download
1 / 56

Genomes highlight the Finiteness of the “Parts” in Biology - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts and Features: Surveys of a Finite Parts List. Mark Gerstein Molecular Biophysics & Biochemistry and Computer Science, Yale University H Hegyi, J Lin, B Stenger, P Harrison, N Echols, J Qian, A Drawid, D Greenbaum, R Jansen

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Genomes highlight the Finiteness of the “Parts” in Biology' - payton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts and Features:Surveys of a Finite Parts List

Mark GersteinMolecular Biophysics & Biochemistry andComputer Science, Yale UniversityH Hegyi, J Lin, B Stenger, P Harrison, N Echols, J Qian, A Drawid, D Greenbaum, R Jansen

Transcriptome 2000, Paris8 November 2000


1995 Occurrence of Parts and Features:

Genomes highlight the Finitenessof the “Parts” in Biology

  • Bacteria, 1.6 Mb, ~1600 genes [Science269: 496]

1997

  • Eukaryote, 13 Mb, ~6K genes [Nature 387: 1]

1998

  • Animal, ~100 Mb, ~20K genes [Science282: 1945]

2000?

  • Human, ~3 Gb, ~100K genes [???]

‘98 spoof

real thing, Apr ‘00


Simplifying the Complexity of Genomes: Occurrence of Parts and Features:Global Surveys of aFinite Set of PartsfromMany Perspectives

~100000 genes

~1000 folds

(human)

(T. pallidum)

~1000 genes

Same logic for sequence families, blocks, orthologs, motifs, pathways, functions....

Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....


A parts list approach to bike maintenance
A Parts List Approach to Bike Maintenance Occurrence of Parts and Features:


A parts list approach to bike maintenance1
A Parts List Approach to Bike Maintenance Occurrence of Parts and Features:

How many roles can these play?

How flexible and adaptable are they mechanically?

What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)?

Where are the parts located?

Which parts interact?


Analysis of genomes transcriptomes in terms of the occurrence of parts features

1 Using Parts to Interpret Genomes. Occurrence of Parts and Features:Shared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


Integratedanalysis system x ref parts with genomes

Folds: Occurrence of Parts and Features:scop+automaticOrthologs: COGs“Families”: homebrew, ProtoMap

finding parts in genome sequencesblast, y-blast, fasta, TM, low-complexity, &c (Altschul, Pearson, Wooton)

IntegratedAnalysis System:X-ref Parts withGenomes

One approach of many...Much previous work on Sequence & Structure ClusteringCATH, Blocks, FSSP, Interpro, eMotif, Prosite, CDD, Pfam, Prints, VAST, TOGA…

Remington, Matthews ‘80; Taylor, Orengo ‘89, ‘94; Thornton,CATH; Artymiuk, Rice, Willett ‘89; Sali, Blundell, ‘90; Vriend, Sander ‘91; Russell, Barton ‘92; Holm, Sander ‘93+ (FSSP); Godzik, Skolnick ‘94; Gibrat, Bryant ‘96 (VAST); F Cohen, ‘96; Feng, Sippl ‘96; G Cohen ‘97; Singh & Brutlag, ‘98

part occurrence profiles


Shared folds

worm Occurrence of Parts and Features:

35

21

42

149

43

8

16

yeast

E. coli

Shared Folds

of 339


Cluster trees grouping initial genomes on basis of shared folds

20 Occurrence of Parts and Features:

10

30

Cluster Trees Grouping Initial Genomes on Basis of Shared Folds

“Classic” Tree

Fold Tree

D=S/T

S = # shared folds

20 Genomes

D = shared fold dist. betw. 2 genomes

T= total # folds in both

D=10/(20+10+30)


Distribution of folds in various classes
Distribution of Folds in Various Classes Occurrence of Parts and Features:

FoldTree

Unusual distribution of all-beta folds


Compare with ortholog occurrence trees part ortholog v fold
Compare with Ortholog Occurrence Trees Occurrence of Parts and Features:(Part = ortholog v fold)

Ortholog

Tree

Fold Tree

(based on COGs scheme of Koonin & Lipman, similar approaches by Dujon, Bork, &c.)


Common folds in genome varies betw genomes
Common Folds in Genome, Varies Betw. Genomes Occurrence of Parts and Features:

Depends on comparison method, DB, sfams v folds, &c (new top superfamilies via y-Blast, Intersection of top-10 to get shared and common)


Common shared folds bab structure

336 Occurrence of Parts and Features:: 42

Common, Shared Folds: bab structure

All share a/b structure with repeated R.H. bab units connecting adjacent strands or nearly so (18+4+2 of 24)

HI, MJ, SC vs scop 1.32

P-loop hydrolase

Flavodoxin like

Rossmann Fold

Thiamin Binding

TIM-barrel


Analysis of genomes transcriptomes in terms of the occurrence of parts features1

1 Using Parts to Interpret Genomes. Occurrence of Parts and Features:Shared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


Pseudogenomics surveying dead parts
Pseudogenomics: Surveying Occurrence of Parts and Features:“Dead” Parts

Example of a potential YG with frameshift in mid-domain

(Our def’n: YG = obvious homolog to known protein with frameshift or stop in mid-domain)


Folds in pseudogenes
Folds in Pseudogenes Occurrence of Parts and Features:

YG identification pipeline to Summary of Pseudogenes in worm

G=19K

GE=8K

YG=4K (2K)

Example of a potential YG with frameshift in mid-domain


Most common worm pseudofolds 1
Most Common Worm “Pseudofolds” #1 Occurrence of Parts and Features:


Most common worm pseudofolds 2
Most Common Worm “Pseudofolds” #2 Occurrence of Parts and Features:


Pseudogene distribution on chomo somes

Y Occurrence of Parts and Features:G--G

29%

(max)

Pseudogene Distribution on Chomo-somes

YG--G

16%

(min)

~50% YG in terminal 3Mb vs ~30% G


Decayed lines of genes
Decayed Lines of Genes? Occurrence of Parts and Features:

Num. genes in family

D1022.6 has 90 dead fragments of itself – a disused line of chemo-receptors?

1

90

Num. pseudogenes in family


Completely dead families

0 Occurrence of Parts and Features:

1

Completely Dead Families


Analysis of genomes transcriptomes in terms of the occurrence of parts features2

1 Using Parts to Interpret Genomes. Occurrence of Parts and Features:Shared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


Gene expression datasets the yeast transcriptome
Gene Expression Datasets: the Yeast Transcriptome Occurrence of Parts and Features:

Yeast Expression Data: 6000 levels! Integrated Gene Expression Analysis System: X-ref. Parts and Features against expression data...

Young, Church…

Affymetrix GeneChips Abs. Exp.

Also: SAGE (mRNA); 2D gels for Protein Abundance (Aebersold, Futcher)

Brown, marrays, Rel. Exp. over Timecourse

Snyder, Transposons, Protein Abundance


Common parts the transcriptome
Common Parts: the Transcriptome Occurrence of Parts and Features:

51

+

-

-

+

+

+

+

-

Leu-zipper

-

18

-

-

118

715


Folds that change a lot in frequency are not common

Common Folds Occurrence of Parts and Features:

Freq.

Change

Foldsthatchangea lotinfrequencyarenotcommon

Changing Folds


Most versatile folds relation to interactions
Most Versatile Folds – Relation to Interactions Occurrence of Parts and Features:

Similar results Martin et al. (1998)

The number of interactions for each fold = the number of other folds it is found to contact in the PDB


Composition of transcriptome in terms of functional classes
Composition of Transcriptome in terms of Functional Classes Occurrence of Parts and Features:

Prot. Syn. 

cell structure energy 

unclassified 

transcription 

transport 

signaling 

Transcriptome Enrichment

Functional Category(MIPS)

ab

TMs


Composition of genome vs transcriptome

Genome Occurrence of Parts and Features:Composition

TranscriptomeComposition

Composition of Genome vs. Transcriptome

NS 

Transcriptome Enrichment

VGA 

Amino Acid


Relating the transcriptome to cellular protein abundance translatome
Relating the Transcriptome to Cellular Protein Abundance (Translatome)

What is Proteome? Protein complement in genome or cellular protein population?

2D-gel electrophoresis Data sets: Futcher (71), Aebersold (156), scaled set with 171 proteinsNew effect is dealing with gene selection bias


Amino acid enrichment
Amino Acid Enrichment (Translatome)

Protein

mRNA

Simple story is translatome is enriched in same way as transcriptome


Amino acid enrichment complexities
Amino Acid Enrichment – (Translatome)Complexities

mRNA

Protein


Analysis of genomes transcriptomes in terms of the occurrence of parts features3

1 Using Parts to Interpret Genomes. (Translatome)Shared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


Composition of transcriptome in terms of broad structural classes
Composition of Transcriptome in terms of Broad Structural Classes

Membrane (TM) Protein 

Transcriptome Enrichment

ab protein 

Fold Class

of Soluble Proteins

# TM helices in yeast protein




6000 yeast genes with expression levels but only 2000 with localization
~6000 yeast genes Classeswith expression levelsbut only ~2000 with localization….


Bayesian system for localizing proteins

Feature Vects ClassesP(feature|loc)

State Vects

Bayesian System for Localizing Proteins

loc=

18 Features: Expression Level (absolute and fluctuations), signal seq., KDEL, NLS, Essential?, aa composition

Represent localization of each protein by the state vector P(loc) and each feature by the feature vector P(feature|loc). Use Bayes rule to update.


Bayesian system for localizing proteins1

Feature Vects ClassesP(feature|loc)

State Vects

Bayesian System for Localizing Proteins

loc=

Represent localization of each protein by the state vector P(loc) and each feature by the feature vector P(feature|loc). Use Bayes rule to update.


Results on testing data
Results on Testing Data Classes

Individual proteins: 75% with cross-validation

Carefully clean training dataset to avoid circular logic

Testing, training data, Priors: ~2000 proteins from

Swiss-Prot Master List

Also, YPD, MIPS, Snyder Lab


Results on testing data 2
Results on Testing Data #2 Classes

Compartment Populations. Like QM, directly sum state vectors to get population. Gives 96% pop. similarity.


Extrapolation to compartment populations of whole yeast genome 4000 predicted 2000 known
Extrapolation to Compartment Populations of Whole Yeast Genome: ~4000 predicted + ~2000 known

ytoplasmic

Mem.

uclear


Analysis of genomes transcriptomes in terms of the occurrence of parts features4

1 Using Parts to Interpret Genomes. Genome: Shared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


Do expression clusters relate to protein function
Do Expression Clusters Relate to Protein Genome: Function?

Can they predict functions?

  • Clustering of expression profiles

  • Grouping functionally related genes together (?)

  • Botstein (Eisen), Lander, Haussler, and Church groups, Eisenberg

Func. A

Func. B


Distributions of gene expression correlations for all possible gene groupings
Distributions Genome: of Gene Expression Correlations, for All Possible Gene Groupings

Sample for Diauxic shift Expt. (Brown),

Ex. Ravg,G=3 =

[ R(gene-1,gene-3) + R(gene-1,gene-4) + R(gene-5,gene-7) ] / 3


Distributions of gene expression correlations for all possible gene groupings 2

P-value for specific 10-gene func. group Genome:

Distributions of Gene Expression Correlations, for All Possible Gene Groupings 2

Sample for Diauxic shift Expt. (Brown),

Ex. Ravg,G=3 =

[ R(gene-1,gene-3) + R(gene-1,gene-4) + R(gene-5,gene-7) ] / 3


Based on distributions correlation of established functional categories computer clusterings

Correlation: Genome:

Always

Significant

Sometimes

Significant

(depends

on expt.)

Never

Significant

Based on Distributions, Correlation of Established Functional Categories, Computer Clusterings


Protein protein interactions expression
Protein-Protein Interactions & Genome: Expression

Use same formalism to assess how closely related expression timecourses to sets of known p-p interactions

Sets of interactions

(all pairs)(control)

(from MIPS)

(Uetz et al.)

between selected expression timecourses in CDC28 expt. (Davis)

(strong interaction, clearly diff.)


Relation of p p interactions to abs expression level
Relation of P-P Interactions to Abs. Expression Level Genome:

Distribution of Normalized Expression Levels

Sets of Interacting Proteins

for


Can we define function well enough to relate to expression

Fold, Localization, Interactions & Regulation Genome: are attributes of proteins that are much more clearly defined

Can we define FUNCTION well enough to relate to expression?

Problems defining function:

Multi-functionality: 2 functions/protein (also 2 proteins/function)

Conflating of Roles: molecular action, cellular role, phenotypic manifestation.

Non-systematic Terminology:‘suppressor-of-white-apricot’ & ‘darkener-of-apricot’

vs.


Phenotype orf clusters from transposon expt

20 Conditions Genome:

20 Conditions

28 ORFs in cluster

28 ORFs in cluster

Phenotype ORF Clusters from Transposon Expt.

Transposon insertions into (almost) each yeast gene to see how yeast is affected in 20 conditions. Generates a phenotype pattern vector, which can be treated similarly to expression data

k-means clustering of ORFs based on “phenotype patterns,” cross-ref. to MIPs Functional Classes

Cluster showing cold phenotype (containing genes most necessary in cold) is enriched in metabolic functions

Metabolism

M Snyder, A Kumar, et al….

Cold


Analysis of genomes transcriptomes in terms of the occurrence of parts features5

1 Using Parts to Interpret Genomes. Genome: Shared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


Genecensus
GeneCensus Genome:

bioinfo.mbb.yale.edu

Detailed Tables

Alignment Database

Alignment Server

ORF Query

Ranks

Trees

PDB Query


Partslist ranking viewers
PartsList Ranking Viewers Genome:

J Qian, B Stenger, J Lin....

Rank Folds by Genome Occurrence, Expression, Fold Clustering, Length, &c



Genecensus dynamic tree viewers

Recluster organisms based on folds, composition, &c and compare to traditional taxonomy

GeneCensus Dynamic Tree Viewers


Analysis of genomes transcriptomes in terms of the occurrence of parts features6

1 Using Parts to Interpret Genomes. compare to traditional taxonomyShared and/or unique parts. Venn Diagrams, Fold tree with all-b diff. Ortholog tree. Top-10 folds.

2 Using Parts to Interpret Pseudogenomes.In worm, top Y-folds (DNAse, hydrolase) v top-folds (Ig). chr. IV enriched, dead and dying families (90YG v 1G)

3 Using Parts to Interpret Transcriptomes: Expression & Structure.Top-10 parts in mRNA. Enriched in transcriptome: ab folds, energy, synthesis,TIM fold, VGA. Depleted: TMs, transport, transcription, Leu-zip, NS. Compare with prot. abundance.

4 Expression & Localization.Enriched : Cytoplasmic. Depleted: Nuclear. Bayesian localizer

5 Expression & Function. Expression relates to structure & localization but to function, globally? P-value formalism. Weak relation to protein-protein interactions.

5(G)1(T)

Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts & Features

7(G)15(T)

1(G)2(Y)

bioinfo.mbb.yale.edu

H Hegyi, J Lin, B Stenger,P Harrison, N Echols, R Jansen, A Drawid,J Qian, D Greenbaum, M Snyder


ad