predicting the function of a protein form either a sequence or a structure is not trivial n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Predicting the function of a protein form either a sequence or a structure (is not trivial) PowerPoint Presentation
Download Presentation
Predicting the function of a protein form either a sequence or a structure (is not trivial)

Loading in 2 Seconds...

play fullscreen
1 / 49

Predicting the function of a protein form either a sequence or a structure (is not trivial) - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Predicting the function of a protein form either a sequence or a structure (is not trivial). Adam Godzik The Sanford-Burnham Medical Research Institute. Summary - overview. Homology based methods Analogy based methods Physics based methods Why function prediction?. Multilevel definition

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Predicting the function of a protein form either a sequence or a structure (is not trivial)


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
predicting the function of a protein form either a sequence or a structure is not trivial

Predicting the function of a protein form either a sequence or a structure(is not trivial)

Adam Godzik

The Sanford-Burnham Medical Research Institute

summary overview
Summary - overview
  • Homology based methods
  • Analogy based methods
  • Physics based methods
  • Why function prediction?
what we mean by function
Multilevel definition

Phenotype

Cellular function

Molecular function (activity)

Substrates

Inhibitors

cofactors

Several attempts to develop a unified function classification

EC classification for enzymes 4.2.31.101

Merops (proteases), CAZY (hydrolases)

Gene ontology

What we mean by function
both are amazingly large and diverse
Both are amazingly large and diverse
  • Organisms (species)
    • About 1.5M known today, 10-100 million species estimated to exists, depending on the definition of species and other assumptions
    • Their relations can be described in a tree of life, at least for eukaryotes.
    • Bacterial and archeal tree of life is much more controversial, some even dispute the concepts of species for bacteria
  • Proteins
    • With 20 amino acid alphabet, the number of possible protein sequences is very large (20100 i.e. 1.2*10130 short proteins(!))
    • Total number: >10billions?
      • 10-100M species, with ~4K genes in a bacterial and ~10K in an eukaryotic genome
    • Over 25 million known today, i.e. ~0.2%
      • Representative sample?
from the 25 million proteins known today
From the 25 million proteins known today
  • Direct experimental data is available for few thousand proteins
  • Indirect experimental data are available for perhaps few hundred thousand
  • Structures of ~60 thousands have been solved
many proteins like species are close relatives
Many proteins (like species) are close relatives
  • Histone H1 (human) - histone H1 (chicken)
  • SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA
  • | | || || || ||| ||| | |||||||||||||||||| ||| |||||| ||
  • SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT
  • similarity: 77% id, BLAST e.value 0.0
  • function: two H1 histones from different species (orthologs)
  • Their functions and structures are obviously very similar
how many protein families are still out there
How many protein families are still out there?

Number of protein clusters (modeling families) grows linearly in number of protein sequences (and exponentially in time) – cumulative total

From Yooseph et al, PloS Biology, (2007) 5:e16

how far can we go
How far can we go?
  • Histone H5 - histone H1
  • TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA
  • | | | | | | | | | ||| | | | |||| ||||||||
  • SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS
  • similarity: 40% seq id, BLAST e.value 10-15
  • function: two histones (paralogs)
  • Structures still very similar, functions somewhat different, but obviously similar
this is surely too far
This is surely too far?
  • Histone H5 - TRANSCRIPTION FACTOR E2F-4
  • PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |
  • GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW
  • similarity :7% seq id, BLAST e.value 1
is it
Structure – obviously similar (2.4 Å RMSD over 80 aa)

function – clearly related (both bind DNA)

More subtle similarity can be detected with more sophisticated methods

Is it?
most function assignments are provided by predicted homology
Unknown protein

GLLTTKFVSLLQEAKDGVLDLKLAADTLAVRQKRRIYDITNVLEGIGLIEKKSKNSIQW

Well studied protein

SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA

most “function assignments” are provided by predicted homology

Similarity

->

homology

similarity

?

prediction

similarity homology based annotations
Recognition of close and/or distant homologs based on similarity

Sequence

Sequence/profile, profile/profile

Structure

Problems

How to predict differences? Even homologous proteins evolve and change!

Similarity -> homology based annotations
prediction by homology
Prediction by homology

Are there any well characterized

proteins similar to my protein?

Can we assume they are homologous?

Recognition

What is the position-by-position

target/template equivalence

Alignment

Structure of my protein is similar to the other one

Modeling

Function

prediction

Function of my protein is similar to the other one

we could predict
We could predict

Role in the whole organism

Structure of a

complex

activity

3D structure

important distinction
Important distinction
  • Similarity
      • Two proteins have similar sequences/structures/functions if by some metric the s/s/f of one protein is more similar to the s/s/f of another than to a randomly chosen protein
  • Homology
      • Two proteins are homologous if they have evolved from a common ancestor
  • Common error
    • Two proteins are 65% homologous
      • What we really meant
    • The sequences of two proteins are 65% similar, therefore we can safely assume they are homologous, why else they would be so similar?
if life would be easy this is how it would look like
If life would be easy, this is how it would look like

similar

homologous

not similar

unrelated

not obviously similar but probably homologous
Not (obviously) similar, but (probably) homologous
  • Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding)
  • PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |
  • GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW
similar but not homologous
Similar, but not homologous
  • phosphoribosyltransferaseand viral coat protein, identity: 42%, different folds, different functions
  • . . . . .

99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173

: ||. ||| || |. || | : | | | | || | || |:| | ||.| |

214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279

similarity vs homology
Similarity vs. homology

similar

homologous

not similar

homologous

similar

not homologous

not similar

unrelated

can we return to this simple picture by redefining similarity
Can we return to this simple picture by redefining similarity?

similar

homologous

not similar

unrelated

are these two protein families related
New protein (target)

KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEIEAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLMIDEVGEMMHSNIEKAKLCLQ

Known protein (template)

VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTSSRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF

Are these two protein families related?
slide27

Profile-profile similarity

Compare as

vectors in 21

dimensional

space (FFAS)

how to validate a protocol 1 recognition
How to validate a protocol1. Recognition
  • Folding benchmarks
    • from structural clustering of PDB (several sets, 700 pairs used here)compared to sequence based clustering of the same group of proteins
    • correct predictions vs. wrong predictions
  • CASP meetings, CAFASP, LiveBench
  • published and/or publicly available predictions, fold prediction servers, available prediction programs
summary overview1
Summary - overview
  • Homology based methods
  • Analogy based methods
  • Physics based methods
  • Why function prediction?
similarity analogy based annotations
Recognition of potential analogs based on similarity in

Genome organization (non homologous replacements)

Genomic fingerprints

Expression patterns

Specific features

Charge distribution

Presence of specific patterns

Problems

Is this similarity related to function?

Similarity -> analogy based annotations
tm0449 thy1 from prediction to proof
TM0449 (thy1) – from prediction to proof
  • TM0449
    • Hypothetical, uncharacterized protein
    • Multiple homologs in pathogenic and thermophilic bacteria
    • Novel fold
  • evidence
    • Phylogenetic profile complementing thymidylate synthase
    • A homolog complements TS in Dictyostelium
  • Confirmed experimentally
summary overview2
Summary - overview
  • Homology based methods
  • Analogy based methods
  • Physics based methods
  • Why function prediction?
slide36
We know the structure of one protein in the family and functions of some others – is the function conserved?

Newly

solved

target

Gallery of models

and come up with new predicted functions
And come up with new (predicted) functions

Phospholipid vs. retinol vs. short peptide binding

summary overview3
Summary - overview
  • Homology based methods
  • Analogy based methods
  • Physics based methods
  • Why function prediction?
why my interest in function prediction
Why my interest in function prediction?
  • Structural genomics: the structure is often the easiest experimental information to obtain (after sequence)
function vs function
Function vs function

1970

1990

2005

2010 ?

  • We witnessed dramatic technological advances in sequencing and now structure determination, function analysis remain a painstaking, manual effort.
  • We used to know a lot about function even before we started working on a protein. Well, not anymore

1 year

Function discovery

Structure determination

Sequencing

structure determination is now done on an assembly line
Structure determination is now done on an assembly line

target

selection

3 X

1 X

2 X

1 X

1 X

1 X

7 X

1 X

2 X

1 X

1 X

2 X

1 X

1 X

1 X

2 X

5 X

xtal screening

bl xtal mounting

data collection

phasing

tracing

expression

imaging

purification

crystallization

harvesting

cloning

struc. validation

struc. refinement

annotation

publication

PDB

even few years ago functional annotation seemed trivial
Even few years ago functional annotation seemed trivial

target

selection

3 X

1 X

2 X

1 X

1 X

1 X

7 X

1 X

2 X

1 X

1 X

2 X

1 X

1 X

1 X

2 X

5 X

xtal screening

bl xtal mounting

data collection

phasing

tracing

expression

imaging

purification

crystallization

harvesting

cloning

struc. validation

struc. refinement

annotation

publication

PDB

after few years the reality seems to be very different
After few years, the reality seems to be very different

target

selection

2 X

1 X

2 X

1 X

1 X

1 X

1 X

1 X

7 X

xtal screening

bl xtal mounting

data collection

phasing

tracing

expression

imaging

purification

crystallization

harvesting

cloning

struc. validation

struc. refinement

annotation

publication

PDB

reverse order of function and structure determination and it s challenges
The classical way

1. A function is discovered and studied

2. The gene responsible in this function is identified

3. Function is confirmed

4. Product of this gene is isolated, crystallized solved.

5. we have a whole story!

Structure “rationalizes” function and provides molecular details

Post-genomic

1. a new, uncharacterized gene is found in a genome

2. predictions or high-throughput methods prioritize this gene for further studies

3. the protein is studied in detail

Structure is solved in a high throughput center

Structure is the first experimental information about the “hypothetical” protein

“reverse order” of function and structure determination and it’s challenges
summary
Summary
  • For some, function prediction is a practical, day to day problem
  • Analogy based approaches dominate the field
    • Homology seen from sequence similarity
    • structural similarities
    • Potential active sites, clefts, surface features
  • Many useful tools exists, but they are very scattered and not very user-friendly
summary 2
Summary (2)
  • Avoid overconfidence - “easy” predictions contain many surprises
  • Only synergy of several independent lines of reasoning can give a correct answer
  • Elimination of “easy”, but inconsistent predictions is critical
  • So far, AFP doesn’t even come close to expert analysis