Definitions

# Definitions

## Definitions

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

2. Pairwise Global Alignment • Global alignment - Needleman-Wunsch (1970) • maximizes the number of matches between the sequences along the entire length of the sequences. • Reason for making a global alignment: • checking minor difference between two sequences • Analyzing polymorphisms (ex. SNPs) between closely related sequences • …

3. Pairwise Global Alignment • Computationally: • Given: a pair of sequences (strings of characters) • Output: an alignment that maximizes the similarity

4. How can we find an optimal alignment? • ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT • How many possible alignments? C(27,7) gap positions = ~888,000 possibilities • Dynamic programming: The Needleman & Wunsch algorithm 27 1

5. = (2n)!/(n!)2 = (22n /n ) = (2n)   2n n Time Complexity Consider two sequences: AAGT AGTC How many possible alignments the 2 sequences have?

6. Scoring a sequence alignment • Match/mismatch score: +1/+0 • Open/extension penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Open: 2 × (–2) • Extension: 5 × (–1) Score = +9

7. Pairwise Global Alignment • Computationally: • Given: a pair of sequences (strings of characters) • Output: an alignment that maximizes the similarity

8. Needleman & Wunsch • Place each sequence along one axis • Place score 0 at the up-left corner • Fill in 1st row & column with gap penalty multiples • Fill in the matrix with max value of 3 possible moves: • Vertical move: Score + gap penalty • Horizontal move: Score + gap penalty • Diagonal move: Score + match/mismatch score • The optimal alignment score is in the lower-right corner • To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.

9. empty A A A C empty 0 -2 -4 -6 -8 A -2 1 -1 -3 -5 G -4 -1 0 -2 -4 C -3 -2 -1 -1 -6 Example • Let gap = -2 match = 1 mismatch = -1. AAAC A-GC AAAC -AGC

10. Time Complexity Analysis • Initialize matrix values: O(n), O(m) • Filling in rest of matrix: O(nm) • Traceback: O(n+m) • If strings are same length, total time O(n2)

11. Local Alignment • Problem first formulated: • Smith and Waterman (1981) • Problem: • Find an optimal alignment between a substring of s and a substring of t • Algorithm: • is a variant of the basic algorithm for global alignment

12. Motivation • Searching for unknown domains or motifs within proteins from different families • Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) • Identifying active sites of enzymes • Comparing long stretches of anonymous DNA • Querying databases where query word much smaller than sequences in database • Analyzing repeated elements within a single sequence

13. GATCACCT GAT_ACCC empty G A T C A C C T 0 0 0 0 0 0 0 0 empty 0 0 0 0 0 G 0 0 0 0 0 A 0 0 0 0 1 T 0 0 0 0 0 A 0 0 C 0 0 0 0 C 0 0 0 C Local Alignment GATCACCT GATACCC • Let gap = -2 match = 1 mismatch = -1. 0 1 0 0 0 0 2 0 0 1 0 3 1 0 1 1 2 2 0 0 2 1 3 1 0 1 1 2 4 2 1 0 2 3 3

14. Smith & Waterman • Place each sequence along one axis • Place score 0 at the up-left corner • Fill in 1st row & column with 0s • Fill in the matrix with max value of 4 possible values: • 0 • Vertical move: Score + gap penalty • Horizontal move: Score + gap penalty • Diagonal move: Score + match/mismatch score • The optimal alignment score is the max in the matrix • To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit

15. exercise • Let: gap = -2 match = 1 mismatch = -1. • Find the best local alignment: CGATGAAATGGA

16. Semi-global Alignment Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.

17. Global Alignment Example: AAACCC A  CCC • Prefer to see: • AAACCC •   ACCC Do not want to penalize the end spaces

18. SemiGlobal Alignment Example: s = AAACCC t =  ACCC

19. SemiGlobal Alignment Example: s = AAACCCG t =  ACCC G 0 -1 -2 -1 2

20. SemiGlobal Alignment • Summary of end space charging procedures:

21. Pairwise Sequence Comparison over Internet Bioinformatics for Dummies

22. Significance of Sequence Alignment • Consider randomly generated sequences. What distribution do you think the best local alignment score of two sequences of sample length should follow? • Uniform distribution • Normal distribution • Binomial distribution (n Bernoulli trails) • Poisson distribution (n, np=) • others

23. Extreme Value Distribution • Yev = exp(- x - e-x )

24. Extreme Value Distribution vs. Normal Distribution

25. “Twilight Zone” Some proteins with less than 15% similarity have exactly the same 3-D structure while some proteins with 20% similarity have different structures. Homology/non-homology is never granted in the twilight zone.