1 / 43

Multiple Sequence Alignment

Multiple Sequence Alignment. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Sep 19 th , 2013. Key concepts. The Multiple Sequence Alignment (MSA) Problem The need for MSA Scoring Multiple Sequence Alignments Heuristic Algorithms for MSA Progressive alignment

nedra
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Sep 19th, 2013

  2. Key concepts • The Multiple Sequence Alignment (MSA) Problem • The need for MSA • Scoring Multiple Sequence Alignments • Heuristic Algorithms for MSA • Progressive alignment • Star and tree alignments • Iterative alignment

  3. Readings • Durbin Chapter 6 • 6.1 • 6.2 • 6.3: ClustalW • 6.4

  4. What is multiple sequence alignment? The task of locating equivalent regions of more than two sequences so as to maximize their overall similarity.

  5. Why multiple sequence alignment? • Build phylogenetic trees (next module) • Determine evolutionary relationships between sequences • A multiple sequence alignment can represent a family of proteins with similar function • Compare new sequence to a “family” of known proteins • For example the BLOCKS database used for BLOSUM contains several ungapped alignments for known protein families • Discover common signatures or motifs among a group of proteins • Identify genetic variation among individuals of a population

  6. The tasks in Multiple Sequence Alignment • Scoring an alignment • Creating an alignment • Assessing significance

  7. Some notation • Let denote a Multiple Sequence Alignment • mi is the ith column of the alignment m • mij is the ith column and jth row • cia count of residue a in column i

  8. Example using notation G A R F I E L D T H E F A T C A T G A R F I E L D T H E _ _ _ C A T G A R F I E L D T H A T _ _ C A T G A R R Y _ L I K E D A _ _ C A T

  9. Scoring a Multiple Sequence Alignment (MSA) • Key issue: how do we score a multiple sequence alignment? • Usually, we assume thatcolumnsof an alignment are independent • We will simplify score by assuming linear gap penalty for now as gap function score of ith column

  10. Two ways to score are used • Entropy based scores • Sum of pairs

  11. Scoring an alignment: Entropy based score • Assume all columns are independent, and all rows per column are independent • Take a log on both sides • Allows one to have a position-specific scoring model (PSSM) • We’ll see PSSMs again.

  12. Score of a column • This is also a measure of entropy, or uncertainty in the distribution of residues in a column. • High entropy: More uniform distribution/more variability of residues • Low entropy: Less uniform distribution/less variability of residues Where to get the pia?

  13. Scoring an Alignment: Sum of Pairs • Compute the sum of the pairwise scores Per column Rows in the column From a substitution matrix such as BLOSUM or PAM

  14. Algorithms for performing a Multiple Sequence Alignment • Dynamic programming • Not feasible • Progressive alignment algorithms • Guide tree approach • Iterative alignment algorithms

  15. Dynamic Programming for finding an optimal MSA • Assume columns are independent • Score of alignment is sum of score per column • Generalization of methods for pairwise alignment • consider k-dimension matrix for k sequences (instead of 2-dimensional matrix) • each matrix element represents alignment score for k subsequences (instead of 2 subsequences)

  16. Notation for DP • Assume we have k sequences • i1 denotes where we are in the sequence 1 • i2 denotes where we are in sequence 2 • … • ikdenotes where we are in sequencek • Denotes the ik position of sequence xk • F: k-dimensional matrix where denotes the score of the best partial alignment uptoi1, i2.. ik parts of the sequences

  17. Recall the DP for the pairwise alignment

  18. Dynamic Programming Approach max score of alignment for subsequences

  19. DP algorithm is too expensive • For k sequences each of length n • O(nk) Space complexity • O(nk2k) time complexity

  20. Heuristic algorithms to MSA • Progressive alignment • Adding one sequence at a time • Sensitive to the ordering of the sequences • Iterative alignment • Possibly remove some of the aligned sequences and re-align to see if score improves

  21. Ordering matters Consider aligning GG, DGG and DGD 2 1 D G D _ G G D G D G G _ Are as good. But when we include DGG 2 1 D G D _ G G D G G D G D G G _ D G G 1 is better than 2

  22. Progressive alignment • Key heuristic: Align the “most similar” sequences first • Assume we can compute pairwise similarity • Pairwise Sequence alignments • Star alignments • Pick a center and align everything to that • Tree alignments • Simple (quick and dirty) tree • At each time merge two, possibly singleton, sets of sequences

  23. Progressive alignment algorithms differ in • The order in which sequences are selected for merges • Alignment of sequences OR Alignment of sequence and partial alignments • How alignments are scored

  24. Star Alignment Approach • given: k sequences to be aligned • pick one sequence as the “center” • for each determine an optimal alignment between and • merge pairwise alignments • return: multiple alignment resulting from aggregate

  25. Star Alignments: Approaches to Picking the Center Two possible approaches: • try each sequence as the center, return the best multiple alignment • compute all pairwise alignments and select the string that maximizes:

  26. Star Alignments: Aggregating Pairwise Alignments • “once a gap, always a gap” • shift entire columns when incorporating gaps

  27. Star Alignment Example Given: ATGGCCATT ATTGCCATT ATTGCCATT ATGGCCATT ATCCAATTTT ATC-CAATTTT ATTGCCATT-- ATCTTCTT ATTGCCGATT ATTGCCATT ATTGCCGATT ATTGCC-ATT ATCTTC-TT ATTGCCATT

  28. Star Alignment Example • merging pairwise alignments present pair Current multiple alignment ATGGCCATT ATTGCCATT ATTGCCATT ATGGCCATT 1. ATC-CAATTTT ATTGCCATT-- ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT 2.

  29. Star Alignment Example present pair Current multiple alignment ATCTTC-TT ATTGCCATT ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT-- 3. ATTGCCGATT ATTGCC-ATT ATTGCC- A TT-- ATGGCC- A TT-- ATC-CA- A TTTT ATCTTC- - TT-- ATTGCCG A TT-- 4. shift entire columns when incorporating a gap

  30. Does ordering matter in star alignment? Pair Current multiple alignment ATGGCCATT ATTGCCATT ATTGCCATT ATGGCCATT 1. ATCTTC-TT ATTGCCATT ATTGCCATT ATGGCCATT ATCTTC-TT 3. ATTGCCATT-- ATGGCCATT-- ATCTTC-TT-- ATC-CAATTTT 2. ATC-CAATTTT ATTGCCATT--

  31. Does ordering matter in star alignment? ATTGCC-ATT-- ATGGCC-ATT-- ATCTTC--TT-- ATC-CA-ATTTT ATTGCCGATT-- ATTGCCGATT ATTGCC-ATT 4. No.

  32. Tree Alignments • Basic idea: organize multiple sequence alignment using a guide tree • leaves represent sequences • internal nodes represent alignments • Determine alignments from bottom of tree upward • return multiple alignment represented at the root of the tree • One common variant: the CLUSTALW algorithm [Thompson et al. 1994]

  33. Doing the Progressive Alignment on the tree • Depending on the internal node in the tree, we may have to align a • a sequence with a sequence • a sequence with a partial alignment • a partial alignmentwith a partial alignment • In all cases we can use dynamic programming • For aligning alignments, we will use sum of pairs scoring

  34. Aligning sequence to a partial alignment • Need to treat each “partial alignment” as a single entity • Partial alignment should not be changed other than gap insertions • Shift entire columns when incorporating gaps TGTTAAC -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC -TGT AAC -TGT -AC ATGT --C ATGT GGC

  35. Scoring an alignment of partial alignments • Recall the sum of pairs score for a column i • Let 1 to n represent sequences from the first alignment • Let n+1 to N represent sequences from the second alignment • Alignment at column ican be written as Within first alignment Within second alignment Between two alignments

  36. Tree Alignment example • Starting sequences • Create a guide tree • Using pairwise distances and a hierarchical clustering approach • Similar to but simpler than phylogenetic trees TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

  37. Converting alignment score to a distance Used in Feng and Doolittle’s algorithm

  38. Tree Alignment Example TGTAAC TGT-AC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

  39. Tree Alignment Example TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

  40. Tree Alignment Example Aligning two alignments -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

  41. Tree Alignment Example -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC Aligning sequence to alignment -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

  42. Iterative refinement methods • The order of selection of sequences can influence the alignment • How to avoid committing to a non-optimal pairwise decision? • Revisit alignments

  43. Barton-Sternberg alignment • Align two sequences with highest alignment score using standard DP techniques for pairwise alignment • Repeat until all sequences are in the alignment • Find the sequence most similar to current alignment • Add to alignment. • For all sequences xi, • Remove xifrom alignment, re-align to the partial alignment of x2..xn. • Repeat 3 until the score does not improve OR we have executed a fixed number of steps

More Related