Compositionally adjusted substitution matrices for protein database searches
Download
1 / 23

Compositionally Adjusted Substitution Matrices for Protein Database Searches - PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on

Compositionally Adjusted Substitution Matrices for Protein Database Searches. Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Collaborators. Yi-Kuo Yu Alejandro Sch ä ffer John Wootton Richa Agarwala

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Compositionally Adjusted Substitution Matrices for Protein Database Searches' - southern-olson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compositionally adjusted substitution matrices for protein database searches

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Stephen Altschul

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health


Collaborators
Collaborators Database Searches

Yi-Kuo Yu Alejandro Schäffer

John Wootton Richa Agarwala

Mike Gertz Aleksandr Morgulis

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

See: Yu, Wootton & Altschul (2003) PNAS100:15688-15693;

Yu & Altschul (2005) Bioinformatics 21: 902-911;

Altschul et al. (2005) FEBS J. 272:5101-5109.


Log odds scores
Log-odds scores Database Searches

The scores of any local-alignment substitution

matrix can be written in the form

where the piare background amino acid

frequencies, the qij are target frequencies

and λ is an arbitrary scale factor.

(PNAS87:2264-2268)


The blosum 62 matrix
The BLOSUM-62 matrix Database Searches

PNAS89:10915-10919

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V


Amino acid compositional bias Database Searches

Some sources of bias:

Organismal bias

AT-rich genome:tend to have more amino acids FLINKYM

GC-rich genome: tend to have more amino acids PRAWG

Protein family bias

Transmembrane proteins: more hydrophobic residues

Cysteine-rich proteins: more Cysteines than usual


Construction of an asymmetric log odds substitution matrix
Construction of an asymmetric Database Searcheslog-odds substitution matrix

Given a (not necessarily symmetric) set of target

frequencies qij, define two sets of background

frequencies pi and p’j as the marginal sums of the qij:

The substitution scores are then defined as

We call this matrix valid in the context of the pi and p’j.


Substitution matrix validity theorem
Substitution matrix validity theorem Database Searches

A substitution matrix can be valid for only a unique set

of target and background frequencies, except in certain

degenerate cases.(Proof omitted)

One can determine efficiently whether an

arbitrary substitution matrix can be valid in

some context and, if so, one can extract its

unique target and background frequencies,

and scale. (Proof and algorithms omitted)


Choosing new target frequencies Database Searches

Given new sets of background frequencies Pi and P’j, how

should one choose appropriate target frequencies Qij ?

Consistency constraints:

Close to original qij:

Sometimes, it is

desirable to constrain

the relative entropy H


Substitution matrices compared
Substitution matrices compared Database Searches

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.


Performance evaluation Database Searches(mode D vrs. mode A)


BLOSUM-62 and sequence specific background frequencies Database Searches

Amino P. falciparumM. tuberculosis

Acid BLOSUM62 #16805184 #15607948

----- --------------- --------------- -----------------

A 7.4 4.8 13.9

R 5.2 4.1 7.4

N 4.5 8.9 2.8

D 5.3 5.6 5.9

C 2.5 2.1 1.9

Q 3.4 3.0 3.6

E 5.4 7.0 6.1

G 7.4 6.2 9.5

H 2.6 3.1 1.7

I 6.8 9.0 4.4

L 9.9 8.2 9.3

K 5.8 8.2 1.9

M 2.5 1.3 1.5

F 4.7 5.1 2.5

P 3.9 3.8 5.3

S 5.7 7.4 4.4

T 5.1 2.3 5.7

W 1.3 1.0 0.8

Y 3.2 4.6 2.8

V 7.3 4.4 8.7


Difference between a scaled, standard BLOSUM-62 Database Searches

and a compositionally adjusted BLOSUM-62

P. falciparum

A -15-55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34

R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8

N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34

D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6

C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9

Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8

E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7

G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42

H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39

I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23

L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16

K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29

M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22

F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55

P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44

S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2

T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1

W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47

Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41

V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9

A R N D C Q E G H I L K M F P S T W Y V

Entries shown: score of standard matrix subtracted from the adjusted one


Optimal alignments implied by modes A and D Database Searches

Mode A: 29.7 bits(H = 0.51 nats)

Mode D: 31.8 bits (H = 0.51 nats)

Mode C: 33.1 bits (H = 0.44 nats)


Substitution matrices compared1
Substitution matrices compared Database Searches

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.


Performance of various matrices on 143 pairs of related sequences febs j 272 5101 5109
Performance of various matrices on 143 pairs Database Searchesof related sequences(FEBS J. 272:5101-5109)


Empirical rules for invoking compositional adjustment when comparing two sequences
Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.


One metric definition of distance between two composition vectors
One metric definition of distance between two composition vectors

(IEEE Trans. Info. Theo. 49:1858-1860)


Empirical rules for invoking compositional adjustment when comparing two sequences1
Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.


Law of cosines
Law of cosines comparing two sequences

In a triangle with sides of length a,b and c, the angle opposite the side of length c is


Empirical rules for invoking compositional adjustment when comparing two sequences2
Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.

3: The angleθmade by the compositions of the two sequences with the standard composition is less than 70o.


ROC comparing two sequencesn curves for Aravind set (NAR29: 2994-3005)

b


ROC comparing two sequencesn curves for SCOP set (Proc IEEE9: 1834-1847)


Future directions
Future directions comparing two sequences

  • Possible less extensive use of SEG when compositional adjustment is invoked.

  • Application to PSI-BLAST.


ad