Compositionally adjusted substitution matrices for protein database searches
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

Compositionally Adjusted Substitution Matrices for Protein Database Searches PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

Compositionally Adjusted Substitution Matrices for Protein Database Searches. Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Collaborators. Yi-Kuo YuAlejandro Sch ä ffer John WoottonRicha Agarwala

Download Presentation

Compositionally Adjusted Substitution Matrices for Protein Database Searches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Compositionally adjusted substitution matrices for protein database searches

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Stephen Altschul

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health


Collaborators

Collaborators

Yi-Kuo YuAlejandro Schäffer

John WoottonRicha Agarwala

Mike GertzAleksandr Morgulis

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

See: Yu, Wootton & Altschul (2003) PNAS100:15688-15693;

Yu & Altschul (2005) Bioinformatics 21: 902-911;

Altschul et al. (2005) FEBS J. 272:5101-5109.


Log odds scores

Log-odds scores

The scores of any local-alignment substitution

matrix can be written in the form

where the piare background amino acid

frequencies, the qij are target frequencies

and λ is an arbitrary scale factor.

(PNAS87:2264-2268)


The blosum 62 matrix

The BLOSUM-62 matrix

PNAS89:10915-10919

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V


Compositionally adjusted substitution matrices for protein database searches

Amino acid compositional bias

Some sources of bias:

Organismal bias

AT-rich genome:tend to have more amino acids FLINKYM

GC-rich genome: tend to have more amino acids PRAWG

Protein family bias

Transmembrane proteins: more hydrophobic residues

Cysteine-rich proteins: more Cysteines than usual


Construction of an asymmetric log odds substitution matrix

Construction of an asymmetric log-odds substitution matrix

Given a (not necessarily symmetric) set of target

frequencies qij, define two sets of background

frequencies pi and p’j as the marginal sums of the qij:

The substitution scores are then defined as

We call this matrix valid in the context of the pi and p’j.


Substitution matrix validity theorem

Substitution matrix validity theorem

A substitution matrix can be valid for only a unique set

of target and background frequencies, except in certain

degenerate cases.(Proof omitted)

One can determine efficiently whether an

arbitrary substitution matrix can be valid in

some context and, if so, one can extract its

unique target and background frequencies,

and scale. (Proof and algorithms omitted)


Compositionally adjusted substitution matrices for protein database searches

Choosing new target frequencies

Given new sets of background frequencies Pi and P’j, how

should one choose appropriate target frequencies Qij ?

Consistency constraints:

Close to original qij:

Sometimes, it is

desirable to constrain

the relative entropy H


Substitution matrices compared

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D:Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.


Compositionally adjusted substitution matrices for protein database searches

Performance evaluation(mode D vrs. mode A)


Compositionally adjusted substitution matrices for protein database searches

BLOSUM-62 and sequence specific background frequencies

Amino P. falciparumM. tuberculosis

Acid BLOSUM62 #16805184 #15607948

----- --------------- --------------- -----------------

A 7.4 4.8 13.9

R 5.2 4.1 7.4

N 4.5 8.9 2.8

D 5.3 5.6 5.9

C 2.5 2.1 1.9

Q 3.4 3.0 3.6

E 5.4 7.0 6.1

G 7.4 6.2 9.5

H 2.6 3.1 1.7

I 6.8 9.0 4.4

L 9.9 8.2 9.3

K 5.8 8.2 1.9

M 2.5 1.3 1.5

F 4.7 5.1 2.5

P 3.9 3.8 5.3

S 5.7 7.4 4.4

T 5.1 2.3 5.7

W 1.3 1.0 0.8

Y 3.2 4.6 2.8

V 7.3 4.4 8.7


Compositionally adjusted substitution matrices for protein database searches

Difference between a scaled, standard BLOSUM-62

and a compositionally adjusted BLOSUM-62

P. falciparum

A -15-55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34

R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8

N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34

D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6

C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9

Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8

E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7

G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42

H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39

I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23

L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16

K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29

M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22

F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55

P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44

S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2

T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1

W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47

Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41

V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9

A R N D C Q E G H I L K M F P S T W Y V

Entries shown: score of standard matrix subtracted from the adjusted one


Compositionally adjusted substitution matrices for protein database searches

Optimal alignments implied by modes A and D

Mode A: 29.7 bits(H = 0.51 nats)

Mode D: 31.8 bits (H = 0.51 nats)

Mode C: 33.1 bits (H = 0.44 nats)


Substitution matrices compared1

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D:Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.


Performance of various matrices on 143 pairs of related sequences febs j 272 5101 5109

Performance of various matrices on 143 pairs of related sequences(FEBS J. 272:5101-5109)


Empirical rules for invoking compositional adjustment when comparing two sequences

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.


One metric definition of distance between two composition vectors

One metric definition of distance between two composition vectors

(IEEE Trans. Info. Theo. 49:1858-1860)


Empirical rules for invoking compositional adjustment when comparing two sequences1

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.


Law of cosines

Law of cosines

In a triangle with sides of length a,b and c, the angle opposite the side of length c is


Empirical rules for invoking compositional adjustment when comparing two sequences2

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.

3: The angleθmade by the compositions of the two sequences with the standard composition is less than 70o.


Compositionally adjusted substitution matrices for protein database searches

ROCn curves for Aravind set (NAR29: 2994-3005)

b


Compositionally adjusted substitution matrices for protein database searches

ROCn curves for SCOP set (Proc IEEE9: 1834-1847)


Future directions

Future directions

  • Possible less extensive use of SEG when compositional adjustment is invoked.

  • Application to PSI-BLAST.


  • Login