compositionally adjusted substitution matrices for protein database searches
Download
Skip this Video
Download Presentation
Compositionally Adjusted Substitution Matrices for Protein Database Searches

Loading in 2 Seconds...

play fullscreen
1 / 23

Compositionally Adjusted Substitution Matrices for Protein Database Searches - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Compositionally Adjusted Substitution Matrices for Protein Database Searches. Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Collaborators. Yi-Kuo Yu Alejandro Sch ä ffer John Wootton Richa Agarwala

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Compositionally Adjusted Substitution Matrices for Protein Database Searches' - southern-olson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compositionally adjusted substitution matrices for protein database searches

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Stephen Altschul

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

collaborators
Collaborators

Yi-Kuo Yu Alejandro Schäffer

John Wootton Richa Agarwala

Mike Gertz Aleksandr Morgulis

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

See: Yu, Wootton & Altschul (2003) PNAS100:15688-15693;

Yu & Altschul (2005) Bioinformatics 21: 902-911;

Altschul et al. (2005) FEBS J. 272:5101-5109.

log odds scores
Log-odds scores

The scores of any local-alignment substitution

matrix can be written in the form

where the piare background amino acid

frequencies, the qij are target frequencies

and λ is an arbitrary scale factor.

(PNAS87:2264-2268)

the blosum 62 matrix
The BLOSUM-62 matrix

PNAS89:10915-10919

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

slide5

Amino acid compositional bias

Some sources of bias:

Organismal bias

AT-rich genome:tend to have more amino acids FLINKYM

GC-rich genome: tend to have more amino acids PRAWG

Protein family bias

Transmembrane proteins: more hydrophobic residues

Cysteine-rich proteins: more Cysteines than usual

construction of an asymmetric log odds substitution matrix
Construction of an asymmetric log-odds substitution matrix

Given a (not necessarily symmetric) set of target

frequencies qij, define two sets of background

frequencies pi and p’j as the marginal sums of the qij:

The substitution scores are then defined as

We call this matrix valid in the context of the pi and p’j.

substitution matrix validity theorem
Substitution matrix validity theorem

A substitution matrix can be valid for only a unique set

of target and background frequencies, except in certain

degenerate cases.(Proof omitted)

One can determine efficiently whether an

arbitrary substitution matrix can be valid in

some context and, if so, one can extract its

unique target and background frequencies,

and scale. (Proof and algorithms omitted)

slide8

Choosing new target frequencies

Given new sets of background frequencies Pi and P’j, how

should one choose appropriate target frequencies Qij ?

Consistency constraints:

Close to original qij:

Sometimes, it is

desirable to constrain

the relative entropy H

substitution matrices compared
Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

slide11

BLOSUM-62 and sequence specific background frequencies

Amino P. falciparumM. tuberculosis

Acid BLOSUM62 #16805184 #15607948

----- --------------- --------------- -----------------

A 7.4 4.8 13.9

R 5.2 4.1 7.4

N 4.5 8.9 2.8

D 5.3 5.6 5.9

C 2.5 2.1 1.9

Q 3.4 3.0 3.6

E 5.4 7.0 6.1

G 7.4 6.2 9.5

H 2.6 3.1 1.7

I 6.8 9.0 4.4

L 9.9 8.2 9.3

K 5.8 8.2 1.9

M 2.5 1.3 1.5

F 4.7 5.1 2.5

P 3.9 3.8 5.3

S 5.7 7.4 4.4

T 5.1 2.3 5.7

W 1.3 1.0 0.8

Y 3.2 4.6 2.8

V 7.3 4.4 8.7

slide12

Difference between a scaled, standard BLOSUM-62

and a compositionally adjusted BLOSUM-62

P. falciparum

A -15-55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34

R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8

N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34

D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6

C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9

Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8

E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7

G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42

H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39

I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23

L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16

K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29

M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22

F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55

P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44

S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2

T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1

W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47

Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41

V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9

A R N D C Q E G H I L K M F P S T W Y V

Entries shown: score of standard matrix subtracted from the adjusted one

slide13

Optimal alignments implied by modes A and D

Mode A: 29.7 bits(H = 0.51 nats)

Mode D: 31.8 bits (H = 0.51 nats)

Mode C: 33.1 bits (H = 0.44 nats)

substitution matrices compared1
Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

empirical rules for invoking compositional adjustment when comparing two sequences
Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

one metric definition of distance between two composition vectors
One metric definition of distance between two composition vectors

(IEEE Trans. Info. Theo. 49:1858-1860)

empirical rules for invoking compositional adjustment when comparing two sequences1
Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.

law of cosines
Law of cosines

In a triangle with sides of length a,b and c, the angle opposite the side of length c is

empirical rules for invoking compositional adjustment when comparing two sequences2
Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distanced between the compositions of the two sequences is less than 0.16.

3: The angleθmade by the compositions of the two sequences with the standard composition is less than 70o.

future directions
Future directions
  • Possible less extensive use of SEG when compositional adjustment is invoked.
  • Application to PSI-BLAST.
ad