Sequence analysis of nucleic acids and proteins: part 1

Sequence analysis of nucleic acids and proteins: part 1 Similarity search Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000

Search and learning problems in sequence analysis

A comparison of the homology search and the motif search for functional interpretation of sequence information. Homology Search Motif Search New sequence New sequence Knowledge acquisition Motif library (Empirical rules) Sequence database (Primary data) Retrieval Similar sequence Inference Expert knowledge Expert knowledge Sequence interpretation Sequence interpretation

Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b). (a) Path Matrix(b) Search Tree A I M S A M O S X X . . . . . . . . . . . . . . Alignment AIM-S A-MOS Pruning by an optimization function

Di, j-l Di, j-l Di-1, j-1 Di-1, j Di-1, j Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant. (b) the gap penalty is a linear function of the gap length. (a) (b) Di-1, j-1 d ws(i), t(j) d b Di, j(2) Di,j ws(i), t(j) b Di,j(1) Di,j(3)

0 0 . . . . . . 0 . . . . 0 Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the initial values are assigned to the path matrix. (a) Global vs. Global (b) Local vs. Global 0 0 . . . . . . 0 0 (c) Local vs. Local 0 0 . . . . . . 0 . . . . 0 X

The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing. (a) (i -1, j -1) (I, j -1) (i +1, j-1) (i -1, j ) (i, j) (i +1, j ) (b) (i, j -2) (i+1, j -2) (i -1, j -1) (i, j -1) (i +1, j -1) (i, j) (i -1, j )

The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the diagonals that contain candidate markers. i 1 n 1 1 j l m m n +m -1 l

The hashing technique for rapid sequence comparison. In this case the horizontal sequence is converted to a hash table, which contains the locations of the four nucleotides. Query Sequence Hash Table A T C A C A C G G C T A T C G C A G T C A A T T C . . Key Address A 1 4 6 C 3 5 7 10 G 8 9 T 2 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Used in FASTA

An example of the finite state automaton for pattern matching C Q2 B A A B Q1 C A Q0 C B C Q4 A B B A Q3 C Bold arrows lead to ouputs indicating patterns have been found Used in BLAST

The tree-based progressive method for multiple sequence alignment, which utilizes: (a) a dendrogram obtained by cluster analysis and (b) group alignment for pairwise comparison of groups of sequences. (a) L W R D G R G A L Q L W R G G R G A A Q D W R - G R T A S G (b) DEHUG3 DEPGG3 DEBYG3 DEZYG3 DEBSGF L R R - A R T A S A L - R G A R A A A E

A A B B A A B B D C C C C Possible tree topologies in the phylogenetic analysis of: (a) three sequences or (b) four sequences. Filled circles represent extant sequences, while open circles represent common ancestors. (a) D D

Simulated annealing and Metropolis Monte Carlo methods are based on the concept of thermal fluctuations in the energy functions. DE = E (x’n) - E (x n) E 1 When DE £ 0 p = exp(-DElTn ) When DE > 0 x

Dynamic programming to find edit distances - Edit operation: M, R, I, D - Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example: R D I M D M M A - T H S A - R T - S - Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example: R D I M D M 1+ 1+ 1+ 0+ 1+ 0 = 4

The recurrence - Stage: position in the edit transcript; - State: I, D, M, or R; - Optimal value function: D(i, j) where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j] - Recurrence relation: 1 +D(i-1, j) D(i, j) = min 1 +D(i, j-1) t(i, j) +D(i-1, j-1) , where t(i, j) = {

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 M 1 A 2 T 3 H 4 S 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 M 1 A 2 T 3 H 4 S 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 M 1 A 2 T 3 H 4 S 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 M 1 A 2 T 3 H 4 S 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 A 2 2 T 3 3 H 4 4 S 5 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 A 2 2 T 3 3 H 4 4 S 5 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 A 2 2 T 3 3 H 4 4 S 5 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 H 4 4 S 5 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 S 5 5

The tabulation , D(i, j) Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 3 3 3 3 S 5 5 4 4 4 3

The traceback Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 3 3 3 3 S 5 5 4 4 4 3

The solutions - #1 1 0 1 1 0 = 3 D M R R M M A T H S - A R T S

The solutions - #2 1 0 1 0 1 0 = 3 D M I M D M M A - T H S - A R T - S

The solutions - #3 1 1 0 1 0 = 3 R R M D M M A T H S A R T - S “Life must be lived forwards and understood backwards.” - Søren Kierkegaard

BLOSUM62 SCORING MATRIX 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM D:D = +6 D:R = -2 From Henikoff 1996

Scoring Matrices • Physical/Chemical similarities • comparing two sequences according to the properties of their residues may highlight regions of structural similarity • Identity matrices • by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features

Scoring Matrices (ctd) • As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated • The most commonly used will be one of the mutation matrices • PAM, BLOSUM • The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned

Some probabilities of observations depend on unknown parameters. E.g. if O = SFFSFFF then under independence pr(O) = p2(1-p)5. We can calculate this for any observation O, so in a sense we have a 2-variable function pr(O,p) or pr(O|p) depending on O and p (0< p <1). Likelihood: holds O fixed, varies p. Maximum Likelihood estimate: the p which maximizes pr(O,p), O fixed, denoted . E.g. above, = 2/7. Probability and Likelihood ^p ^p

AGCTGATCA... AACCGGTTA... Alignment: Hypotheses: H = homologous (indep. sites, Jukes-Cantor) R = random (indep. sites, equal freq.) Statistical motivation for alignment scores pr(data|H) = pr( |H) = pr( |H) x ... = (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8at) pr(data|R) = pr( |R) = pr( |R) x ... = ( )a( )d = a x log + d x log . Since p < , log <0, log >0 score = a x s + d x (-m) s >0 match score, -m <0 mismatch penalty Note that if at 0, p 6at, 1-p 1 and so s log4, while -m log8at is large and negative: a big difference in the two scores. Conversely, if at is large, p = (1-e), = 1-e, and m = log(1-e) -e, while 1-p = (1+3e), = 1+3e, and so s = log(1+3e) 3e. Thus the scores are about 3:1. AA GA AA GA

a gap free alignment of two a.a. sequence fragments a1 ..... am b1 ..... bm data = We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities, m P 1 P i pr(data|H) = paipaibi(2t) pr(data|R) = paipbi log{} = log{ } paibi(2t)/ pbi S i The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always. Also the relative sizes of match and mismatch penalties increase as #PAMs (at) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it. PAM(0) = the identity matrix is the toughest. There are plenty of score matrices based on other principles.

Local alignment aligns only the most similar regions of two sequences Why? Often distantly related proteins have only isolated regions (e.g. active sites) of similarity. The modular nature of proteins How? The dynamic programming algorithm we have seen needs only a minor modification to yield the best local alignment between two sequences. It is called the Smith-Waterman algorithm, and is named bestfit in GCG.

Similar Amino Acid Sequences: Chance or Common Ancestry? The usual caveats: Title of paper by Russell F. Doolittle, Science 214 (1981)1 The question arises every time an alignment is done without prior knowledge of homology. • the scientific goal is not necessarily the same as the mathematical/statistical goal • significance may not mean homology • non-significance may not mean non-homology

Early use of statistics • Generate random permutations of the sequence(s) • Obtain the average (av) and standard deviation (SD) of the random similarity scores • Compute z=(observed score - av)/SD • Think normal (e.g. 4 is a very large z) • This approach is still used for global alignments, but is no longer seen as appropriate for local alignments, since the score is optimized, and random optimal scores do not follow the normal law.

More recent statistical developments: Theory developed by Karlin and collaborators in 1990-4 and, independently, by Waterman and collaborators in 1988-94. Incorporates the fact that the score has been optimized. Immediately implemented in BLAST. Later appears in a similar form in FASTA and elsewhere.

The theory applies to the ensemble of random • pairs of sequences, with fixed • possibly different lengths, • possibly different residue distributions • and ungapped alignments • (extensions to ungapped alignments coming now)

The theoretical distribution of random similarity scores • is universal in form (see diagram) • with scale parameter depending on the two residue distributions, and the substitution scores used • and location parameter depending on the above, plus the lengths of the two sequences

For m, n large, the optimal random score S has the extreme-value distribution with cdf exp{-exp{-l(s-u)}} where l is the unique positive solution (in t) of Sijpiqjexp(sijt)=1, and u = log(Kmn) and K is given by a series depending on the compositions (pi) and (qj) and the scoring matrix (sij).

Databases searches: why do them? To find exact matches to sequences To find homologous sequences To infer structure and/or function of new protein sequences To locate genes in ESTs or genomic sequences To discover gene structure in DNA sequence And much more...

Database searching Compares a query sequence to each sequence in a database (also called a library). Because of the large size of sequence databases, comparisons are generally carried out using faster heuristic approximations to, rather than the exact Smith-Waterman localalignment algorithm. The two most common of these are FASTA and BLAST, where each of these names corresponds to a family of algorithms used in different contexts.

BLAST variants for different searchesa (after S. Brenner, Trends Guide to Bioinformatics, 1998) aSimilar variant programs are available for FASTA. Protein-level searches of DNA sequences are performed by comparing translations of all six reading frames.

cDNA, ORFs and ESTs • Complementary DNA (cDNA) • Single stranded DNA complementary to an RNA, from which synthesized by reverse transcription. • Open reading frames (ORFs) • Contains a series of triplets coding for amino acids without any termination codons (potentially translatable into proteins) • Many derived from sequencing of cDNAs • Expressed sequence tags (ESTs) • Short (300-500 bp) single reads from mRNA (cDNA) sequencing survey projects. • A snapshot of what is expressed in a given tissue at a given developmental stage.

Sequence analysis of nucleic acids and proteins: part 1