Loading in 2 Seconds...

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Loading in 2 Seconds...

- 69 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Compositionally Adjusted Substitution Matrices for Protein Database Searches' - southern-olson

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Compositionally Adjusted Substitution Matrices for Protein Database Searches

Stephen Altschul

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

Collaborators

Yi-Kuo Yu Alejandro Schäffer

John Wootton Richa Agarwala

Mike Gertz Aleksandr Morgulis

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

See: Yu, Wootton & Altschul (2003) PNAS100:15688-15693;

Yu & Altschul (2005) Bioinformatics 21: 902-911;

Altschul et al. (2005) FEBS J. 272:5101-5109.

Log-odds scores

The scores of any local-alignment substitution

matrix can be written in the form

where the piare background amino acid

frequencies, the qij are target frequencies

and λ is an arbitrary scale factor.

(PNAS87:2264-2268)

The BLOSUM-62 matrix

PNAS89:10915-10919

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Some sources of bias:

Organismal bias

AT-rich genome:tend to have more amino acids FLINKYM

GC-rich genome: tend to have more amino acids PRAWG

Protein family bias

Transmembrane proteins: more hydrophobic residues

Cysteine-rich proteins: more Cysteines than usual

Construction of an asymmetric log-odds substitution matrix

Given a (not necessarily symmetric) set of target

frequencies qij, define two sets of background

frequencies pi and p’j as the marginal sums of the qij:

The substitution scores are then defined as

We call this matrix valid in the context of the pi and p’j.

Substitution matrix validity theorem

A substitution matrix can be valid for only a unique set

of target and background frequencies, except in certain

degenerate cases.(Proof omitted)

One can determine efficiently whether an

arbitrary substitution matrix can be valid in

some context and, if so, one can extract its

unique target and background frequencies,

and scale. (Proof and algorithms omitted)

Choosing new target frequencies

Given new sets of background frequencies Pi and P’j, how

should one choose appropriate target frequencies Qij ?

Consistency constraints:

Close to original qij:

Sometimes, it is

desirable to constrain

the relative entropy H

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

BLOSUM-62 and sequence specific background frequencies

Amino P. falciparumM. tuberculosis

Acid BLOSUM62 #16805184 #15607948

----- --------------- --------------- -----------------

A 7.4 4.8 13.9

R 5.2 4.1 7.4

N 4.5 8.9 2.8

D 5.3 5.6 5.9

C 2.5 2.1 1.9

Q 3.4 3.0 3.6

E 5.4 7.0 6.1

G 7.4 6.2 9.5

H 2.6 3.1 1.7

I 6.8 9.0 4.4

L 9.9 8.2 9.3

K 5.8 8.2 1.9

M 2.5 1.3 1.5

F 4.7 5.1 2.5

P 3.9 3.8 5.3

S 5.7 7.4 4.4

T 5.1 2.3 5.7

W 1.3 1.0 0.8

Y 3.2 4.6 2.8

V 7.3 4.4 8.7

Difference between a scaled, standard BLOSUM-62

and a compositionally adjusted BLOSUM-62

P. falciparum

A -15-55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34

R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8

N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34

D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6

C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9

Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8

E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7

G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42

H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39

I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23

L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16

K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29

M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22

F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55

P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44

S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2

T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1

W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47

Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41

V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9

A R N D C Q E G H I L K M F P S T W Y V

Entries shown: score of standard matrix subtracted from the adjusted one

Optimal alignments implied by modes A and D

Mode A: 29.7 bits(H = 0.51 nats)

Mode D: 31.8 bits (H = 0.51 nats)

Mode C: 33.1 bits (H = 0.44 nats)

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Performance of various matrices on 143 pairs of related sequences(FEBS J. 272:5101-5109)

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

One metric definition of distance between two composition vectors

(IEEE Trans. Info. Theo. 49:1858-1860)

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.

Law of cosines

In a triangle with sides of length a,b and c, the angle opposite the side of length c is

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.

3: The angleθmade by the compositions of the two sequences with the standard composition is less than 70o.

Future directions

- Possible less extensive use of SEG when compositional adjustment is invoked.
- Application to PSI-BLAST.

Download Presentation

Connecting to Server..