Protein Classification and Meta-organization.
Download
1 / 82

Protein Classification and Meta-organization. Methods for Global Organization of the Protein Universe - PowerPoint PPT Presentation


  • 200 Views
  • Uploaded on

Protein Classification and Meta-organization. Methods for Global Organization of the Protein Universe. Golan Yona. Department of Computer Science Cornell University. Ismb02. Outline. Introduction Sequence-based classifications of proteins Methods Sequence comparison algorithms

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Protein Classification and Meta-organization. Methods for Global Organization of the Protein Universe' - hye


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Protein Classification and Meta-organization.Methods for Global Organization of the Protein Universe

Golan Yona

Department of Computer Science

Cornell University


Slide2 l.jpg

Ismb02

Outline

  • Introduction

  • Sequence-based classifications of proteins

    • Methods

      • Sequence comparison algorithms

      • Mathematical models of protein families

    • Databases

      • Domain-based protein sequence classification

      • Protein-wise protein sequence classification

    • Alternative representations of proteins

  • Structure-based classifications of proteins

    • Structure comparison algorithms

    • Databases

  • Other classifications


Slide3 l.jpg

C

M

A

L

V

L

E

C

V

Ismb02

What are proteins ?

  • Structural framework (keratin, collagen)

  • Transport and storage of small molecules (hemoglobin)

  • Transmit information (hormones, receptors)

  • Antibodies

  • Blood clotting factors

  • EnzymesThe protein is created in the cell as a unique sequence

  • of amino acids


Slide4 l.jpg

Sequence

ACMVLLCEVEKYP…

folding

Structure

Function

?????

Ismb02


Slide5 l.jpg

Ismb02

Background and Problem definition

About protein sequences are known

today (non-redundant database).

This number keeps rapidly growing

(large scale sequencing projects).

!

The function of 40-50% of the new proteins is unknown.

Understanding biological function is important for:

  • Study of fundamental biological processes

  • Drug design

  • Genetic engineering


Slide6 l.jpg

Ismb02

Basic approach

  • Calculate pairwise distances/similarities with sequences of known proteins

  • Pairwise similarities are computed using the rigorous dynamic programming algorithm (Smith-Waterman), or heuristic algorithms such as Fasta and Blast.

V MCTKKLLVYD

VRSM KLLLGF E


Slide7 l.jpg

Database search

  • A sequence is compared with all sequences in the database

  • Its properties are extrapolated from neighboring sequences (nearest neighbor approach to learning and generalization)

MFGPVRACAATFD...

MCTKKLVTD....

RVSMLILLGFH.....

KAACGCGLMDRFYVVLVV....

.....


Slide8 l.jpg

BLASTP 2.2.2 [Jan-08-2002]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query= nr|001640000880 trembl: (Q935W2) DNA mismatch repair protein

(Fragment).

(164 letters)

Database: /local/databases/nr/nr

933,075 sequences; 292,319,403 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

nr|008510000010 swissprot: (Q99XL8) DNA mismatch repair protein ... 316 5e-86

nr|001640000880 trembl: (Q935W2) DNA mismatch repair protein (Fr... 298 1e-80

nr|001640000882 trembl: (Q936C5) DNA mismatch repair protein (Fr... 290 3e-78

nr|001640000877 trembl: (Q933P8) DNA mismatch repair protein (Fr... 283 4e-76

nr|001640000883 trembl: (Q936C9) DNA mismatch repair protein (Fr... 280 5e-75

nr|001350001337 trembl: (Q9ETZ4) DNA mismatch repair protein (Fr... 248 2e-65

nr|001350001345 trembl: (Q9EVZ2) DNA mismatch repair protein (Fr... 247 3e-65

nr|001350000805 trembl: (Q933M4) Putative DNA mismatch repair en... 246 7e-65


Slide9 l.jpg

BLASTP 2.2.2 [Jan-08-2002]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query= nr|001640000690 trembl: (Q22102) Hypothetical 19.0 kDa protein.

(164 letters)

Database: /local/databases/nr/nr

933,075 sequences; 292,319,403 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

nr|001640000690 trembl: (Q22102) Hypothetical 19.0 kDa protein. 310 5e-84

nr|009250000047 trembl: (Q9GUG1) Hypothetical 104.9 kDa protein. 230 4e-60

nr|009660000030 trembl: (Q20175) Hypothetical 110.0 kDa protein. 225 1e-58

nr|008350000033 trembl: (Q17506) Hypothetical 95.8 kDa protein. 152 8e-37

nr|011390000015 trembl: (O76601) H02F09.4 protein. 152 1e-36

nr|002650001251 trembl: (Q9XUR0) T05F1.12 protein. 148 2e-35

nr|002140000359 trembl: (O16440) Hypothetical 24.8 kDa protein. 139 2e-32

nr|002210001099 trembl: (Q9N5M4) Hypothetical 25.9 kDa protein. 135 1e-31


Slide10 l.jpg

BLASTP 2.2.2 [Jan-08-2002]

Query= nr|001640000685 trembl: (Q19708) F22B5.4 protein.

(164 letters)

Database: /local/databases/nr/nr

933,075 sequences; 292,319,403 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

nr|001670000430 trembl: (P90860) F36A2.7 protein. 153 8e-37

>nr|001670000430 trembl: (P90860) F36A2.7 protein.

Length = 167

Score = 153 bits (386), Expect = 8e-37

Identities = 83/157 (52%), Positives = 108/157 (67%), Gaps = 2/157 (1%)

Query: 8 RKTMFRFVLSRNASTSNVPSPARIQLKKPAEAGHFQYSRNWSRDPRFVKVAIQKGDTPYQ 67

+ T+ R VL RNAS S + +K + G F+Y R+ SRD R+ A + GDT +

Sbjct: 13 KNTIARIVLVRNAS-SGLTLKHEQTIKINDQQGFFKYQRDVSRDTRYSNPA-KPGDTAAR 70

Query: 68 FLVRRLGHAYEVYPLFVLTAAWFVLFCSASYWSFGKAEIWLDRSNSKAPWDWERLRDTYW 127

F+ R+LGHAYE+YPLF L A W VLF ++SF KAEIWLDRS++ APWDWER+R+ YW

Sbjct: 71 FMFRKLGHAYEIYPLFGLLAIWCVLFGYTVWYSFEKAEIWLDRSHTVAPWDWERIRNNYW 130


Slide11 l.jpg

Smith Waterman (Adv. App. Math 1981)

Based on the dynamic programming algorithm. Finds the maximal local similarity given the parameters for point mutations s(ai,bj) and the gap penalties g

The optimal alignment of the a1a2a3….aiwithb1b2b3….bj must end in one of three possible ways ai ORai OR-

bj -bj

If S(i,j) is the best score over all possible alignments of a1a2a3….aiwithb1b2b3….bj then

S(i,j) = MAX{S(i-1,j-1)+s(ai,bj), S(i-1,j)-g, S(i,j-1)-g}


Slide12 l.jpg

Smith Waterman (Adv. App. Math 1981)

Implementation:

0 . . . j . . m

0

.

i

.

n

S(i-1,j-1)+s(ai,bj)

S(i-1,j)-g

S(i,j-1)-g


Slide13 l.jpg

FASTA (Pearson & Lipman, PNAS 1988)

Create a hash table with all k-tuples that appear in the query sequence, index vectors and an offset vector. Run a bounded dynamic programming around the best diagonal/region

Library sequence 0 . . . j . . m

0

.

i

.

n


Slide14 l.jpg

BLAST and Gapped-BLAST

(Altschul et al., JMB 1990, Altschul et al. NAR 1997)

For each word length k in the query generate all k-tuples with score > T. Use these k-tuples to search the library sequence. Extend the seeds to generate HSPs (high scoring segment pairs without gaps). Combine HSPs

0 . . . j . . m

0

.

i

.

n


Slide15 l.jpg

p1 p2 … pn

PSI-BLAST (Altschul et al. NAR 1997)

  • Iterative procedure that uses a position-specific scoring matrix.

  • Generate a profile P = p1p2…pn from hits detected at iteration I, to search the database in iteration I+1, until convergence.

  • The profile stores the probabilities to observe the 20 amino acids at each position along the query sequence (derived from hits with related sequences, after weighting and addition of pseudo-counts)

1 2 n

1

2

.

.

20


Slide16 l.jpg

Limitations of sequence comparison

In many cases sequences have diverged to the extent that their sequence similarity is undetectable with the above algorithms.

More than 50% of the 1,000,000 proteins that are known today have an unidentified function

Iterative procedures such as PSI-BLAST are more powerful, but are more susceptible to parameter tuning and have higher false positive rate

Use reliable mathematical models of protein families to classify new genes


Slide17 l.jpg

Protein classification

More

specific

Sub-families (rab, ras, ran, rac, rho, ARF1, transducin,..)

Families (G-proteins, Motor proteins, DNA helicases,..)

Super-families (P-loop containing nucleotide triphosphate

hydrolases)

Fold families (P-loop containing hydrolases, beta/alpha tim barrel)

Classes (alpha and beta, all-alpha, ...)

Common

evolutionary

origin


Slide18 l.jpg

MCTKKLVTD....

Why is protein classification important?

  • Protein classification can help to elucidate the function of new genes.

  • Comparing a sequence with a database of protein families is more effective than a standard database search.

  • Each family is represented as a model

    • regular expression

    • profile

    • HMM

  • Those models can capture subtle similarities. Protein classification is useful in structure and function prediction, and especially important in large-scale annotation efforts.

Family1

Family2

Family3


Slide19 l.jpg

Databases of protein families

  • Domain-based clusterings:

    • Prosite (Bairoch & Bucher ISMB 1994, Laurent et al. NAR 2002).

    • Pfam (Sonnhammer et al. Proteins 1997, Bateman et al. NAR 2002).

    • ProDom (Gouzy et al. Computers and Chemistry 1999, Corpet et al. NAR 2000).

    • Prints (Attwood et al. NAR 2002).

    • Domo (Gracy & Argos Bioinformatics 1998).

    • Blocks (Henikoff et al. NAR 1999).

  • Protein-based clusterings:

    • ProtoMap (Yona et at. Proteins 1999, Yona et al. NAR 2000).

    • Cogs (Tatusov et al. Science 1997, Tatusov et al. NAR 2001).

    • Systers (Krause & Vingron Bioinformatics 1998, Krause et al. NAR 2002).

    • PIR (Barker et al. NAR 2000, Wu et al. NAR 2002).

  • Structural classifications:

    • SCOP (Murzin et al. JMB 1995).

    • CATH (Orengo et al. Structure 1997).

    • FSSP (Holm & Sander NAR 1999).

Significant overlap, but no two databases are identical.


Slide20 l.jpg

Mathematical representations of protein families

  • Regular expression

  • Position specific scoring matrices (profiles)

  • Hidden Markov Models

  • Probabilistic suffix trees

  • Sparse Markov transducers


Slide21 l.jpg

Regular expression

G-x(2,3)-[MLIV]-x-P-{K,H}-x(2)-C

GAY_LDPAKSC

GMF_VEPNTSC

GKRFVRPFHNC

GKSTVDPFHNC

G - a match with amino acid G

x – a match with any amino acid

x(i,j) – a match with any amino acid at least i times and no more

than j times

[MLIV] – a match with any amino acid of the four M, L, I or V

{K,H} – a match with any amino acid other than K and H


Slide22 l.jpg

Regular expression - examples

ACTININ_1. [EQ]-x(2)-[ATV]-[FY]-x(2)-W-x-N.

HOMEOBOX_1. [LIVMFYG]-[ASLVR]-x(2)-[LIVMSTACN]-x- [LIVM]-x(4)-[LIV]-[RKNQESTAIY]-

[LIVFSTNKH]-W-[FYVC]-x-[NDQTAH]-x(5)- [RKNAIMW].

HELIX_LOOP_HELIX. [DENSTAP]-[KR]-[LIVMAGSNT]- {FYWCPHKR}-[LIVMT]-[LIVM]-x(2)- [STAV]-[LIVMSTACKR]-x-[VMFYH]- [LIVMTA]-{P}-{P}-[LIVMRKHQ].


Slide23 l.jpg

Profile (Gribskov et al. PNAS 1987)

GAA_TGAAGTC

GGT_TTAGGCC

GGCAATAGGGC

GTTAATAACAC

Position specific scoring matrix

A 0 0.25 0.25 1 0.5 0 1 0.5 0 0.25 0

C 0 0 0.25 0 0 0 0 0 0.25 0.25 1

G 1 0.5 0 0 0 0.25 0 0.5 0.75 0.25 0

T 0 0.25 0.5 0 0.5 0.75 0 0 0 0.25 0

Add pseudo counts

Na(position s)= Ntotal(s)* Σamino acid i p(a/i) p(i/s)

Final score:

score(a/position s) = Σamino acid i p(i/s) score(i,a)


Slide24 l.jpg

v1

v2

v3

v4

Hidden Markov Models

v1

v2

a22

v3

v4

w2

a11

a12

a23

w1

w3

b11

v1

b14

v2

v3

v4

  • Hidden states w and observations v

    • w(t) – the state at time t

  • Transition probabilities

  • P(wj(t+1)/wi(t)) = aij

  • Emission probabilities

    • P(vk(t)/wj(t)) = bjk

  • Transition and emission probabilities are learned from training data


  • Slide25 l.jpg

    A

    C

    G

    T

    Hidden Markov Models (cont.)

    a11

    a22

    a12

    w1

    w2

    b11

    A

    a21

    b14

    C

    T

    G

    • A simple intron/exon model for DNA:

    • Hidden states are w1=exon and w2=intron.

    • Observations (visible symbols) are A, C, G and T


    Slide26 l.jpg

    Hidden Markov Models (cont.)

    • HMMs used to model protein families are profile HMMs with three different types of hidden states: Match (M), delete (D) and insertion (I) states (Krogh et al. JMB 1996)

    • The observations (visible symbols) are the amino acids


    Slide27 l.jpg

    Hidden Markov Models (cont.)

    Given a sequence of length T, what is the probability that the sequence was generated by the model?

    Answer: Sum over all possible paths that can generate this sequence

    Define the forward variable αj(t) to be the probability to observe the sequence Vt = v(1)v(2)…v(t) and to reach hidden state j at time t

    Note: P(VT) = Σjαj(T)

    The HMM forward algorithm:

    Initialize: αj(1) = P0(j) bjk where vk=v(1)

    Iteration: for t=1 till T-1

    αj(t+1) = Σi αi(t) aijbjk where vk=v(t+1)


    Slide28 l.jpg

    Hidden Markov Models (cont.)

    What is the most probable sequence of hidden states that generated it?

    The Viterbi algorithm:

    Define the variable δj(t) to be the probability to observe the sequence

    Vt = v(1)v(2)…v(t) and to reach hidden state j at time t maximized over all sequences of hidden states of length t-1.

    Define the pointer variable pathj(t) that contains the pointer to the most probable hidden state at time t-1 from which we can reach state j at time t

    Initialize: δj(1) = P0(j) bjk where vk=v(1)

    pathj(1) = 0

    Iteration: for t=1 till T-1

    δj(t+1) = maxi {δi(t) aij}bjkwhere vk=v(t)

    pathj(t+1) = arg maxi {δi(t) aij}

    Output: probability P = maxi {δi(T)}

    final statewT =arg maxi {δi(T)} and trace back


    Slide29 l.jpg

    Probabilistic Suffix Trees

    (Ron et al. Machine Learning 1996)

    • Identify short significant patterns (contiguous segments)

    • Does not require multiple alignment

    • Induces a probability distribution on the next symbol to appear right after the segment (short term memory)

    • Variable memory length

    • More efficient than order L Markov chains

    • Longer memory length compared to first-order HMMs, and easier to learn


    Slide30 l.jpg

    Probabilistic Suffix Trees (cont)

    • The learning algorithm

    • Initialize: let tree T consist of a single root node (with empty label)

    • let S = {σ/ σ in Σ and P(σ) > Pmin }

    • Building the PST: while S is not empty, pick s in S

    • Remove s from S

    • If there exists a symbol σ in Σ such that

    • P(σ/s) > Tmin

    • andP(σ/s) > r

    • P(σ/suf(s)) < 1/r

    • then add to T the node s and all the nodes on the path to s from the deepest node of T that is a suffix of s

    • If |s| < L then add the strings {σs / σ in Σ and P(σs) > Pmin } to S

    • Finalize: smooth tree probabilities

    { or


    Slide31 l.jpg

    Databases of protein domains

    • Prosite http://www.expasy.ch/prosite/

    • Pfam http://www.sanger.ac.uk/Software/Pfam/

    • Blocks http://www.blocks.fhcrc.org/

    • ProDom http://prodes.toulouse.inra.fr/prodom/doc/prodom.html

    • Prints http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

    • Domo http://www.infobiogen.fr/services/domo/

    • InterPro http://www.ebi.ac.uk/interpro/

    • Smart http://smart.embl-heidelberg.de/

    • eMotif http://dna.stanford.edu/identify


    Slide32 l.jpg

    Motifs and domains

    • Motif: a simple combination of a few consecutive secondary structure elements with a specific geometric arrangement (e.g., helix-loop-helix). Some, but not all motifs are associated with a specific biological function.

    • Domain: the fundamental unit of structure folding and evolution. It combines several secondary elements and motifs, not necessarily contiguous, which are packed in a compact globular structure. A domain can fold independently into a stable 3D structure, and it has a specific function.

    • Domain family: proteins that share a domain (possibly in combination with other domains)

    • Protein family: proteins that have the same combination of domains


    Slide33 l.jpg

    General approaches

    • Motif based databases

      • Prosite, Prints, Blocks, eMotif, InterPro

    • Domain-based databases

      • Pfam, ProDom, Domo, Smart

    • Manual/Semi-manual

      • Prosite

    • Semi-automatically

      • Pfam, Smart

    • Fully automatic

      • ProDom, Blocks, Domo, eMotif

    • Use different models (regular expressions, profiles, HMMs)

    • Based on each other


    Slide34 l.jpg

    Prosite

    • A dictionary of functional and structural motifs and domains

    • Valuable biological information on each family

    • Each motif/domain/family is represented as a regular expression, a rule or a profile

    • Models are generated from (usually published) multiple alignments, manually calibrated to ensure selectivity and sensitivity

    • Patterns do not always cover complete domains whereas profiles usually span the whole domain

    • Some families have more than one signature pattern and /or profiles

    • Pattern length can vary from a few amino acids to several hundreds

    • As of June 2002 contains 1564 patterns and profiles describing 1144 families or domains

    1 2 3 4 5 6 7 8 9 10 11

    A 0 0.25 0.25 1 0.5 0 1 0.5 0 0.25 0

    C 0 0 0.25 0 0 0 0 0 0.25 0.25 1

    G 1 0.5 0 0 0 0.25 0 0.5 0.75 0.25 0

    T 0 0.25 0.5 0 0.5 0.75 0 0 0 0.25 0

    OR

    G-x(2,3)-[MLIV]-x-P-{K,H}-x(2)-C


    Slide35 l.jpg

    Prosite - example

    Profile:

    ID TEST; MATRIX.

    AC PS50999;

    MA /DISJOINT: DEFINITION=PROTECT; N1=4; N2=27;

    MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.1126; R2=0.02183468; TEXT='NScore';

    MA /CUT_OFF: LEVEL=0; SCORE=110; N_SCORE=8.5; MODE=1;

    MA /DEFAULT: M0=-8; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105;

    MA /I: B1=0; BI=-105; BD=-105;

    CC A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, Z;

    MA /M: SY='R'; M=-12, -9,-30, -9, -6,-24, 10, -8,-33, 16,-24,-13, 0,-19, 0, 36, -7,-13,-23,-20,-17, -6;

    MA /M: SY='R'; M= -2, -4,-24,-10, -4,-19,-16, -6,-14, 8,-14, -7, 4,-17, 4, 23, -2, -3,-12,-23,-12, -3;

    MA /M: SY='G'; M= -5, -7,-27, -8, -7,-18, 18,-15,-30, 7,-26,-14, -1,-16, -8, 0, -2,-11,-21,-19,-15, -7;

    MA /M: SY='G'; M= -9, 4,-27, -2, -9,-24, 24, -6,-32, 0,-27,-17, 16,-20, -6, 12, 0,-12,-27,-25,-21, -9;

    MA /M: SY='K'; M=-11, -1,-30, -1, 9,-29,-20, -9,-30, 48,-29,-10, 0,-11, 10, 34,-10,-10,-20,-20,-10, 9;

    MA /M: SY='K'; M= -7, -6,-24, -7, -1,-19, -4,-12,-20, 9,-19,-10, -5,-15, -4, 5, -3, -1,-11,-21,-10, -3;

    MA /M: SY='V'; M= -1,-30,-12,-31,-30, 0,-31,-30, 32,-21, 11, 11,-29,-29,-29,-21,-11, -1, 48,-29, -9,-30;

    MA /M: SY='T'; M= -1, -3, 7,-13,-13,-11,-21,-21,-13,-13,-11,-11, -3,-14,-13,-13, 16, 42, -1,-33,-13,-13;

    MA /M: SY='V'; M= -6,-18,-19,-21,-16, -5,-26,-20, 9,-11, 8, 4,-15,-21,-13, -6, -7, 6, 12,-25, -7,-16;

    MA /M: SY='V'; M= -4,-30,-17,-33,-29, 1,-33,-29, 35,-24, 17, 14,-27,-27,-26,-23,-15, -4, 40,-26, -6,-29;

    MA /M: SY='S'; M= -3, 4,-19, 1, 7,-22,-12, -5,-20, 3,-22,-14, 9,-11, 10, 5, 16, 13,-16,-31,-15, 8;

    MA /M: SY='G'; M= -2, 0,-28, -4,-16,-28, 56,-14,-36,-16,-30,-20, 12,-20,-16,-16, 2,-16,-30,-24,-28,-16;

    MA /M: SY='L'; M=-10,-29,-21,-33,-23, 14,-31,-21, 23,-28, 36, 20,-27,-28,-21,-21,-25, -9, 15,-17, 3,-23;

    MA /M: SY='E'; M= -6, 20,-26, 26, 30,-31,-12, -1,-28, 3,-23,-20, 11, -7, 12, -4, 4, -6,-26,-32,-19, 21;

    MA /M: SY='L'; M= 5,-15,-18,-19, -8, -8,-19,-15, 3,-10, 10, 10,-16,-18, -9,-12, -9, -2, 5,-23, -9, -8;

    MA /M: SY='F'; M= -6, -7,-21,-11,-12, 21,-18,-10,-12,-10, -9, -9, -4,-21,-17,-11, -5, -5, -9, -9, 11,-13;

    MA /I: I= -6; MD=-32;

    MA /M: SY='G'; M= -2, 3,-25, 7, 2,-27, 18,-10,-27, -9,-18,-14, 0,-14, -2,-12, -1,-12,-22,-23,-20, 0; D=-6;

    MA /I: I= -6; MI=-32; IM=-32; DM=-32;

    MA /M: SY='I'; M= 0,-21,-20,-25,-20, 4,-25,-20, 17,-22, 16, 12,-19,-22,-18,-21,-15, -7, 14,-19, -3,-20;

    MA /M: SY='D'; M= -7, 27,-27, 36, 21,-32, -3, -3,-32, 0,-25,-23, 14, -9, 2, -9, 2, -9,-27,-34,-21, 12;

    MA /M: SY='L'; M= -9,-23,-27,-21,-11, -8,-27,-18, 6,-20, 14, 5,-22, 8, -9,-18,-18, -6, -4,-23,-10,-12;

    MA /M: SY='K'; M=-11, 2,-27, 3, 10,-22,-20, -4,-24, 23,-21,-11, 0,-12, 10, 16, -4, -2,-19,-19, -3, 9;

    MA /M: SY='K'; M= 2, -2,-20, -3, 5,-22,-13,-12,-21, 16,-22,-12, 0,-10, 3, 6, 8, 5,-13,-27,-14, 4;

    MA /M: SY='L'; M= -5,-28,-19,-31,-23, 10,-28,-21, 21,-26, 32, 18,-26,-27,-21,-20,-22, -8, 17,-19, -1,-22;

    MA /M: SY='A'; M= 31,-14, 0,-22,-15,-17, -2,-22, -8,-15, -6, -8,-13,-17,-15,-21, 2, -3, 1,-24,-19,-15;

    MA /M: SY='A'; M= 16, -5,-17, -9, -2,-22, -7, -3,-20, 5,-21,-11, -1,-12, 0, 1, 10, 0,-11,-26,-13, -2;

    MA /M: SY='E'; M= 3, 4,-23, 5, 13,-23,-16, -9,-13, -3,-12, -7, -2, -9, 5,-10, 1, -2,-12,-27,-15, 9;

    MA /I: E1=0; IE=-105; DE=-105;

    Pattern:

    K-x-V-[TC]-x-[IVL]-x(2)-[LMF]

    Courtesy of Nicolas Hulo


    Slide36 l.jpg

    Prints

    • A database of gene- and domain-family fingerprints in the form of collections of short motifs aligned without gaps

    • Fingerprints comprise several motifs from different parts of a multiple alignment

    • Discrimination power increases with the number of motifs in a fingerprint

    • Start from a seed alignment of a few sequences. Search the database and refine the fingerprint based on the set of sequences that match the fingerprint completely (add sequences and/or add/remove motifs), until convergence

    • Partial match – possibly subfamily members

    • As of July 2002, contains ~10,500 motifs grouped into 1750 families


    Slide37 l.jpg

    Prints - a fingerprinting overview

    PRINTS

    annotation

    N

    C

    Courtesy of Terri Attwood


    Slide38 l.jpg

    Prints - example

    SUMMARY INFORMATION

    37 codes involving 8 elements

    0 codes involving 7 elements

    0 codes involving 6 elements

    0 codes involving 5 elements

    0 codes involving 4 elements

    1 codes involving 3 elements

    0 codes involving 2 elements

    COMPOSITE FINGERPRINT INDEX

    8| 37 37 37 37 37 37 37 37

    7| 0 0 0 0 0 0 0 0

    6| 0 0 0 0 0 0 0 0

    5| 0 0 0 0 0 0 0 0

    4| 0 0 0 0 0 0 0 0

    3| 1 0 0 0 1 1 0 0

    2| 0 0 0 0 0 0 0 0

    -+-----------------------------------------

    | 1 2 3 4 5 6 7 8

    True positives..

    PRIO_COLGU PRIO_MACFA PRIO_CEREL PRIO_ODOHE

    PRIO_GORGO PRIO_PANTR PRIO_HUMAN O46648

    PRIO_SHEEP PRIO_CALJA PRIO_BOVIN PRP2_BOVIN

    PRIO_ATEPA PRIO_SAISC PRIO_PREFR PRIO_PONPY

    O75942 PRIO_CAPHI PRIO_CEBAP PRIO_CAMDR

    PRIO_FELCA PRP1_TRAST PRIO_RABIT PRP2_TRAST

    PRIO_PIG PRIO_CANFA PRIO_CRIGR PRIO_CRIMI

    Q15216 PRIO_RAT PRIO_CERAE PRIO_MUSPF

    PRIO_MUSVI PRIO_MESAU PRIO_MOUSE O46593

    PRIO_TRIVU

    Subfamily: Codes involving 3 elements

    Subfamily True positives..

    PRIO_CHICK

    Courtesy of Terri Attwood


    Slide39 l.jpg

    Prints - website

    Courtesy of Terri Attwood


    Slide40 l.jpg

    ProDom

    • A database of automatically generated protein domains

    • Old procedure:

      • Identify high-scoring segment pairs (HSPs) using BLAST for all-versus-all comparison of SwissProt

      • Construct homologous segment sets by transitive closure of overlapping HSPs

      • Break into domains

    • New procedure:

      • Construct homologous segment sets by PSI-BLAST recursive search

    • Generate a multiple alignment and a consensus sequence for each family

    • May break real domains into smaller domains (more than 305,000 domain families as of January 2002)


    Slide41 l.jpg

    ith iteration

    DB

    query

    internal repeat detection

    yes

    no

    query

    PSI-BLAST

    no match

    matches

    repeat matches

    DB changes

    remove newly found domains

    split modified sequences

    sort by size

    (i+1)th iteration

    DB

    query

    Courtesy of Daniel Kahn


    Slide42 l.jpg

    ProDom - website

    • Domain analysis

      • Which proteins contain a given domain?

      • Which proteins share a homologous domain with a given protein?

    Domain motif

    Links to other representations of the family

    Summary alignment

    Parameters to define sub-family levels

    Courtesy of Daniel Kahn


    Slide43 l.jpg

    ProDom - website

    • Synthetic views of alignments and trees

    • Homology search with graphical output

    Courtesy of Daniel Kahn


    Slide44 l.jpg

    Courtesy of Daniel Kahn


    Slide45 l.jpg

    Blocks

    • A collection of blocks (short ungapped alignments)

    • Each represent a conserved region of a protein family (not necessarily associated with a known function)

    • Blocks are generated from multiple alignments of groups of related proteins.

    • Groups are derived from protein families in Prosite, Smart, Pfam, ProDom

    • Used only the SWISSPROT entries to define the blocks (to avoid false positive sequences from TrEMBL)

    • Complementary to PRINTS

    • As of August 2001 contains 8656 blocks representing 2101 groups


    Slide46 l.jpg

    Blocks (cont)

    GAYKLDPAKSC

    GMFMVEPNTSC

    GKRFVRPFHNC

    GKSTVDPFHNC

    • Detection:

      • Identify short motifs using the algorithm by Smith 1990

      • Short patterns are then extended in both directions until the similarity score drops. Maximal length of a block is 60

    • Assembly

      • Create an ordered set of non overlapping blocks (“path”) that occur in a critical number of sequences

    • Scoring

      • Score individual blocks based on length, similarity, #sequences, etc

      • Score a path based on scores of the constituent blocks and #sequences

    DKGAYKLDPAKSCNL

    HSGMFMVEPNTSCHF

    GLGKRFVRPFHNCWM

    MMGKSTVDPFHNCHD

    DKGAYKLDPAKSCNL

    HSGMFMVEPNTSCHF

    GLGKRFVRPFHNCWM

    MMGKSTVDPFHNCHD

    KHAYKLDAKSNLL

    VHMFMVDNTSHFL

    LHKRFVDFHSWMM

    KKVLDEAKSCN

    KLILEENTSCH

    SLFLREFHNCW

    AAVLDEFHNCH

    SKLLEEFSECH


    Slide47 l.jpg

    GAY_LDPAKSC

    GMF_VEPNTSC

    GKRFVRPFHNC

    GKSTVDPFHNC

    Database

    build

    search

    refine

    Pfam

    • A database of Hidden Markov Models for protein domains/families generated semi-automatically

    • Start from a seed multiple alignment (published or from other databases, such as Prosite and ProDom).

    • The alignment usually represent complete domains (confirmed with structural data when available).

    • After manual inspection a profile HMM is built from the alignment

    • The HMM is used to search the database. If a true member is missed, then it is added to the seed alignment and the process is repeated.

    • A full alignment is created for all members of the family. If it is not correct the process is repeated with another seed alignment or another alignment method

    • As of May 2002, contains 3849 families, domains, repeats and motifs (in addition Pfam B derived from ProDom)


    Slide48 l.jpg

    Domo

    • Combines compositional and local similarity search followed by multiple sequence alignments

    • Start by detecting global similarities from the comparison of amino acid and dipeptide composition of each protein.

    • Group similar proteins into clusters

    • Pick one representative from each cluster and compile into a suffix tree

    • Compare the suffix tree with itself local similarities

    • Cluster local similarities and align (without gaps)

    • Analyze alignments to detect domain boundaries and split

    • Create multiple alignments

    • Iterative family extension using automatically generated regular expressions

    • Contains 8877 entries


    Slide49 l.jpg

    InterPro

    • Combined resource of proteins domain and motifs from ProSite, ProDom, Prints, Smart and Pfam


    Slide50 l.jpg

    Databases of protein clusters

    • Cogs http://www.ncbi.nlm.nih.gov/COG/

    • Systers http://systers.molgen.mpg.de/

    • PIR http://wwwnbrf.georgetown.edu/pirwww/pirhome.shtml

    • ProtoMap http://protomap.cornell.edu/

      What are the differences?


    Slide51 l.jpg

    General approach

    • Guide lines:

    • Two proteins are homologousif they evolved from the same ancestor protein.

    • We can deduce homology based on statistically significant sequence similarities

    • Homology is a transitive relation


    Slide52 l.jpg

    protein 1

    protein 2

    protein 3

    Transitive closure leads to false connections

    • Similarity is not transitive and it does not necessarily entail homologyMain traps:

    • 1) Overestimates of the statistical significance of similarity scores.

    • Chance similarities

    • 2) Multi-domain proteins


    Slide53 l.jpg

    Systers

    • Run all-against-all comparison (Smith-Waterman) of database sequences

    • Apply the single linkage clustering algorithm to construct hierarchy of all sequneces

    • Derive superfamilies from the structure of the hierarchy

    • Construct distance graph for each superfamily

    • Cut weak connections in each superfamily distance graph to derive family clusters

      290,811 non-redundant sequences (without inclusions and duplicates) from Swiss-Prot, TrEMBL, PIR, wormbase, FlyBase, etc. (as of August 2000)

      cluster into 64,282 superfamilies,

      split into 82,449 family clusters (including 55,182 single sequence clusters of mostly fragmental sequences),

      annotated with Pfam domains

    Seed

    1e-80

    1e-72

    1e-57

    1e-42

    1e-38

    1e-31

    1e-14

    1e-08

    1e-01

    Seed

    1e-85

    1e-76

    1e-40

    1e-36

    1e-22

    1e-02


    Slide54 l.jpg

    Single Linkage

    Hierarchy

    Superfamilies

    Superfamily

    Distance Graph

    Family

    Cluster

    Systers

    Superfamilies as well as family clusters are derivedfrom the structure generated by the data itself no need for a user defined static cutoff

    Courtesy of Antje Krause


    Slide55 l.jpg

    COGs

    • Based on comparison of 43 complete genomes (as of May 2002)

    • Apply single linkage clustering algorithm to get

    • COGs - clusters of orthologous proteins (genes in different species that evolved from the same ancestral protein) or orthologous sets of paralogs (genes from the same genome which are related by duplication).

    • Each COG consists of at least 3 species.

    • Start by forming minimal COGs (triangles)

    • Merge COGs that share a side edge

    • Refine – split COGs that were incorrectly merged due to multi-domain proteins or other reasons

    • COGNITOR: a tool to add new proteins to COGs on the basis of genome-specific best hits; used to incorporate new genomes.

    • Total of 3307 COGs as of May 2002 (including 74,059 of 104,101 proteins encoded in the analyzed genomes)


    Slide56 l.jpg

    e-value > 10-100

    This process is repeated at thresholds 10-95 ,10-90,10-85 ,10-80 ..... 10-5 ,10-0 (=1)

    e-value < 10-100

    ProtoMap

    A graph-based approach: sequence space is represented as a directed graph where vertices = protein sequences and edges represent relationships between proteins. The width (weight) of an edge reflects the significance.

    1 Starting from a very high resolution classification

    Only connections of very high statistical significance (10-100) are considered.

    The graph splits into connected components

    2 Lower the threshold (consider weaker similarities). Merge together clusters that are strongly connected obtain the classification at the next level of confidence

    Myoglobin

    Hemoglobin alpha

    Leghemoglobin

    Hemoglobin beta


    Slide57 l.jpg

    rejected

    Phase I: identify pairs of clusters candidate

    for merging (local test)

    phase II: identify groups of clusters which

    are strongly connected (global test)

    rejected

    The clustering algorithm (cont.)

    This process is repeated at thresholds 10-95 ,10-90,10-85 ,10-80 ..... 10-5 ,10-0 (=1)

    365,174 proteins are classified into 18,140 clusters (+29,000 singletons)


    Slide58 l.jpg

    ProtoMap - website

    Results of the search on keyword “electron transport”.






    Slide63 l.jpg

    Alternative representations

    • Is there a unique universal representation of proteins?


    Slide64 l.jpg

    Alternative representations

    • Studies based on dipeptide composition - a 400 dimensional vector

    • van Heel 1991:

      • Compared sequences by means of the correlation of their representations (advantages..)

      • Linear projection and clustering (k-means) into fixed number of clusters

    • Ferran et al. 1994

      • Used a two-layer neural network to classify sequences based on their dipeptide composition into a fixed number of clusters (trained using the Kohonen learning algorithm)

      • Meaning of weight vectors..


    Slide65 l.jpg

    Alternative representations

    • Combination of compositional properties and other physical/chemical properties (Hobohm & Sander 1995)

      • Composition of the 20 amino acids

      • Composition of different groups of amino acids (polar, charged, aromatic, bulky, etc)

      • Isoelectric point

      • Molecular weight

      • Average hydrophobicity

      • Length

      • Dipeptide composition (used reduced representation)

    • The combination is weighted so as to achieve maximal separation over the train set

    • Can induce alternative similarity measures between proteins and used to search for close relatives


    Slide66 l.jpg

    Alternative representations

    Casari, et al. (1995) A method to predict functional residues in proteins

    Input: a multiple-sequence alignment

    • each sequence is converted to a vector of size (20 * L) where L is length of the alignment

      Generation of of N x (20*L) matrix

    • one sequence produces a vector of dimensions 20*L

    • N sequences to produce N vectors of dimension 20*L

      Use Principal Component Analysis

    • calculate the covariance matrix- indicates how different dimensions are correlated to one another

    • largest eigenvalues and corresponding eigenvectors are the principal components

    • take the three largest (the largest of which represents consensus sequence)

    • project their 20*L dimensional data onto those 3 dimensions (minimize covariance by projection onto a new basis of the eigenvectors of the covariance matrix)

    • this can be used to predict a protein subfamily for a given protein and functional residues


    Slide67 l.jpg

    Structural classifications

    • SCOP http://scop.mrc-lmb.cam.ac.uk/scop/

    • CATH http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html

    • FSSP http://www.ebi.ac.uk/dali/fssp/fssp.html

    • Structure comparison algorithms

    • Dali

    • CE

    • Structal


    Slide68 l.jpg

    1

    2

    3

    Dali (Holm & Sander JMB 1993)

    • Comparison of protein structures using 2D distance matrices to represent 3D structures

    ROP

    CYC b56

    CYC b56

    ROP


    Slide69 l.jpg

    Dali (cont.)

    • Assignment of equivalent residue pairs


    Slide70 l.jpg

    S = S Sf(i,j)

    L

    L

    i=1

    j=1

    | dAij – dBij |

    w(d*ij)

    q E -

    d*ij

    fE(i,j) =

    q E

    Dali (cont.)

    • The similarity score of two structures

    • i and j are labeled pairs of equivalent (matched) residues (i.e. i = iA,iB).

    • f = similarity measure based on Ca-Ca distances dAij and dBij

    • Largest S corresponds to optimal set of equivalencies.

    • Rigid similarity scores:

    • q R = zero level of similarity

    • Elastic similarity score:

    • d*ij = the average of dAij and dBij

    • q E = tolerance of 20% deviation

    • w(r) = envelope function = exp(-r2/a2)

    fR(i,j) = q R – | dAij – dBij |


    Slide71 l.jpg

    CE (Shindyalov & Bourne, Protein Eng. 1998)

    • Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path

      Define alignment fragment pair (AFP) as a continuous segment of protein A aligned against a continuous segment of protein B (without gaps).

    • An alignment is a path of AFPs s.t. for every two consecutive AFPs there may be gaps inserted into either A or B, but not into both. That is, for every two consecutive AFPs i and i+1

    • and

    • or and

    • or and

    • where piAis the starting position of AFP i in protein A


    Slide72 l.jpg

    CE

    What is a “good”AFP?

    Define the distance between two different AFPs i and j as:

    dA(p,q) represents the distance between the alpha carbon atoms at positions p and q in protein A.

    If you already have n-1 AFPs and consider adding the n-th AFN, do so only if

    Protein B

    i j

    i

    j

    Protein A


    Slide73 l.jpg

    CE (cont.)

    • Select an initial AFP.

    • Build an alignment path by incrementally adding “good” AFPs that satisfy the conditions of paths

    • Repeat step (2) until the proteins are completely matched, or until no good AFPs remain.

    • To assess the significance of the alignment, compare it to the alignment of a random pairs of structures, and compute the Z-score based on the RMSD and number of gaps in the final alignment.

    Protein B

    Protein A


    Slide74 l.jpg

    Structal (Levitt & Gerstein, PNAS 1998)

    An initial equivalence is chosen, based on matching the ends of the two structures.

    Repeat until convergence:

    • Superimpose the two structures so as to minimize the RMS, given the equivalence

    • Given the superposition, calculate the distances dij between any atom i in the first protein and any atom j in the second protein

    • Transform distances into similarities sij = M/[1+ (dij/d0)2] where M=20 and d0 = 2.24A

    • Apply dynamic programming to define a new set of equivalences


    Slide75 l.jpg

    Structal (cont)

    2) Superimpose to minimize RMS

    1) Alignment fixed

    4) Use dynamic prog. to find the best set

    of equivalences

    3) Calculate distances

    between all atoms

    5) Superimpose given

    the new alignment

    6) Recalculate distances

    between all atoms


    Slide76 l.jpg

    SCOP

    Structural classification of proteins with 5 level hierarchy:

    Domains: the individual entries

    Family: homologous proteins with significant sequence similarity

    Superfamily: protein families that share weak sequence similarity but with conserved functional residues (e.g. in active sites) – believed to be evolutionary related

    Fold: protein superfamilies that share he same fold (not necessarily due to common evolutionary ancestry)

    Class: all-alpha, all-beta, alpha/beta, alpha+beta, membrane proteins, small proteins

    The classification is based on manual analysis by experts (Dr. Alexy Murzin)

    As of May 2002, 7 main classes, 686 folds, 1073 superfamilies, 1827 families


    Slide77 l.jpg

    CATH

    Structural classification of proteins with 5 level hierarchy:

    Protein chains: the individual entries

    Homologous superfamily: proteins with highly similar structures and functions.

    Topology: clusters according to the topological connections and numbers of secondary structures.

    Architecture: describes the gross orientation of secondary structures, independent of connectivities (assigned manually).

    Class:derived from secondary structure content, is assigned for more than 90% of protein structures automatically.

    The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.

    As of Jan 2002, 8 main classes, 46 architectures, 1453 topologies, more than 2000 superfamilies.


    Slide78 l.jpg

    FSSP

    Structural classification of proteins into a tree hierarchy:

    Protein domains: the individual entries (defined using the algorithm of Holm and Sander 1994)

    Start with all-vs-all structure comparison of protein domains

    Domains are clustered automatically into clusters using the single linkage algorithm based on the z-scores of the structure similarity scores

    3242 families of more than 30,000 structures as of June 2002


    Slide79 l.jpg

    Combined sequence and structure studies

    • Han & Baker 1996

      • Multiple alignments of protein families

      • Create profiles

      • Cluster profile columns to clusters

      • Each cluster (signature pattern) is mapped to a structural motif

    • Rigoutsos et al. 1999

      • Analysis of 17 complete genomes

      • Unsupervised pattern discovery (represented as extended regular expressions) – 1D dictionary

      • Locate motifs in the PDB database and align the corresponding substructures

      • If the RMSD is below 2.5A enter to 3D dictionary

    • Elofsson & Shonnhammer 1999

      • Mapping of Pfam domains to SCOP domains and vice versa.

    • Kasuya and Thoronton 1999

    • Salamov et al. 1999

    • Jonassen et al. 2000


    Slide80 l.jpg

    Type1 clusters (structure-based)

    Type2 clusters

    BioSpace (Yona & Levitt, ISMB 2000)

    MCAVSST...

    RASWIILTC...

    DYEEPEERT...


    Slide81 l.jpg

    Summary

    Many different classification systems

    • Different source databases, different objects

    • Different representations

    • Different methodologies

      Significant overlap, but differences are too fundamental to accommodate in a single classification system.

      Is there a universal definition and characterization of domain families and protein families?


    Slide82 l.jpg

    Acknowledgments

    Terri Attwood (prints)

    Jerome Gracy (domo)

    Steven Henikoff (blocks)

    Nicolas Hulo (prosite)

    Daniel Kahn (prodom)

    Eugene Koonin (cogs)

    Antje Krause (systers)

    Erik Sonnhammer (Pfam)

    Chris Chau, Nick DeNunzio, Leonid Meyerguz


    ad