Practical multiple sequence algorithms
Download
1 / 47

Practical multiple sequence algorithms - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Practical multiple sequence algorithms. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 24th, 2013. Goals for today. Review Guide-tree based m ultiple sequence alignment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Practical multiple sequence algorithms' - naiara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Practical multiple sequence algorithms

Practical multiple sequence algorithms

Sushmita Roy

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Sushmita Roy

sroy@biostat.wisc.edu

Sep 24th, 2013


Goals for today
Goals for today

  • Review Guide-tree based multiple sequence alignment

  • Two practical implementations of algorithms for multiple sequence alignment

    • CLUSTALW

    • MUSCLE


The problems with progressive alignment
The problems with progressive alignment

  • Greedy

    • The tree might not be correct, that is, reflect an incorrect ordering of how sequences should be joined

    • Errors in alignment

      • Even if the tree is correct, there might be some positions that are misaligned.

  • Choice of alignment parameters

    • Especially when the sequences are diverged and there are more mismatches than identities

      • For closely related sequences, identities dominate over mismatches

    • Different weight matrices might be optimal for different evolutionary distances.

    • Gaps do not occur randomly

      • Gaps more likely to occur between “secondary structures” rather than within them.


Clustalw
ClustalW

  • A progressive alignment algorithm with several heuristics

  • Based on a guide tree approach

  • Dynamically varies the gap penalties in a position and residue specific manner

  • Weight different sequences differently

Thompson et al, 1994


Alignments based on guide trees
Alignments based on guide trees

  • Build up a multiple sequence alignment by progressively adding new sequences by following the order of a phylogenetic tree.

  • Needs sequences to have different extents of divergence

  • Start with aligning the closest pairs of sequences.

  • Gaps inserted in the earlier alignments should be preserved as these gaps are most reliable.


Steps in clustalw
Steps in ClustalW

  • Align all pairs of sequences separately to create a pairwise distance matrix.

  • Calculate a guide tree from the matrix

  • Align sequences progressively according to guide tree starting from the leaves


Calculating the pairwise distance
Calculating the pairwise distance

  • For two sequences with the following alignment

    • AATAATAATAA_TA

  • Similarity S

    • No. of identical bases/size of alignment

      • 4/7 for the above example

  • Distance=1-S


Example of creating distance matrix
Example of creating distance matrix

  • Consider four sequences

    • AAAC

    • AGC

    • ACC

    • GAC

  • Generate pairwise alignments for all pairs of sequences


Pairwise alignment for all the pairs of sequences
Pairwise alignment for all the pairs of sequences

Sequence pair

Alignment

% similarity

Distance

1. and 2.

AAAC_AGC

2/4

0.5

1.

2.

3.

4.

1. and 3.

AAAC_ACC

2/4

0.5

1.

2.

1. and 4.

AAAC_GAC

2/4

0.5

3.

4.

2. and 3.

AGCACC

2/3

0.33

2. and 4.

AGCGAC

1/3

0.67

3. and 4.

ACCGAC

1/3

0.67


Creating a tree from the distance matrix using upgma
Creating a tree from the distance matrix using UPGMA

  • UPGMA: Unweighted pair group method using arithmetic averages

  • Represent all sequences as the leaf nodes of a tree

  • Merge two closest nodes at a time to create a new node in the tree

    • Set new node at height determined by nodes being merged

  • Let i and j be two existing nodes that are merged to create a new node

  • Distance between a new node kcreated from two existing nodes iand j and other nodes l

Distance between node k and l

Number of elements in cluster

associated with node j


Upgma in practice
UPGMA in practice

1

2

3

4

1

Place new node at height d23/2

2

3

5

d23/2=1/6

4

1

2

3

4

1

4

5

1

4

5


Upgma in practice1
UPGMA in practice

1

4

5

1

d14/2=0.25

4

6

5

1/6

5

2

1

4

3

6

5

d56/2=0.29

7

d14/2=0.25

6

5

1/6

2

1

4

3


Computing the sum of scores for two alignments
Computing the sum of scores for two alignments

  • Assume we have two alignments corresponding to intermediate nodes of the guide tree

  • At each step we maximize over score from

    • aligning column i in A1 to a column j in A2

    • aligning column i in A1 to gaps in A2

    • aligning column j in A2 to gaps in A1

  • ClustalW uses an average of all pairwise comparisons between two alignments

Alignment A1

Alignment A2

AAAC_GAC

AGCACC


Clustalw scores for aligning columns from two alignments
ClustalW scores for aligning columns from two alignments

Assume a score of 1 for mismatch, 2 for match and 0 for gap

Score of aligning column 3 from Alignment 1 and column 2 from alignment 2

AAAC_GAC

Alignment 1

Alignment 2

AGCACC


An example for aligning two alignments
An example for aligning two alignments

A A A C_ G A C

A G CA C C

Max of three options

A

_

A

_

_

_

Alignment 1

A A

_ _

A A

Alignment 2


Assigning sequence weights in clustalw
Assigning sequence weights in ClustalW

  • ClustalW also considers different weights for different sequences

  • Closely related sequences need to be down-weighted

  • Divergent sequences are up-weighted

  • Uses the branch length of the tree to calculate weights


Clustalw weights of sequences
ClustalW weights of sequences

Weight of a sequence: sum of branch lengths from root to leaf, but sequences sharing a branch share the weight

For example, weight for Hbb_Human=0.081+(0.226/2)+(0.061/4)+(0.015/5)+(0.062/6)


Clustalw score computation
ClustalW score computation


Clustalw gap handling rules
ClustalW gap handling rules

  • Gap penalties are dynamically adjusted

  • For each position in the alignment compute a possible gap penalty value

    • If there is a gap in any of the sequences being aligned reduce its penalty

    • If there is no gap, and this position is <8 positions from another gap, increase the gap open penalty

    • Reduce gap penalty for positions inside a hydrophilic stretch of 5 residues

    • Otherwise use the gap penalty associated with residue-specific gap penalties estimated based on the known alignments

    • different amino acid substitution matrices depending upon the estimated divergence of sequences being aligned at a particular stage may be selected.


Position specific gap penalties in clustalw
Position-specific gap penalties in ClustalW

High gap penalty within 8 positions of existing gaps

Hydrophilic stretches

Existing gap

Higgins et al, methods in Enzymology, 1996


Switching weight matrices
Switching weight matrices

  • Dynamically switch between matrices depending upon the average similarity between sequences being aligned

  • PAM

    • 80-100%: PAM20

    • 60-80%: PAM60

    • 40-60%: PAM120

    • 0-40%: PAM350

  • BLOSUM

    • 80-100%: BLOSUM80

    • 60-80%: BLOSUM62

    • 30-60%: BLOSUM45

    • 0-30%: BLOSUM30


  • Applying clustalw to sh3 domain proteins
    Applying ClustalW to SH3 domain proteins

    Proteins share <12% sequence identity

    Alignment blocks correspond to beta strand secondary structures


    Summary of clustalw
    Summary of ClustalW

    • Guide tree method

    • Complex gap penalty rules

    • Sequences are weighted to reduce the importance of very similar sequences

    • Adaptive scoring matrix


    Muscle multiple sequence comparison by log expectation
    MUSCLE: Multiple Sequence Comparison by log-expectation

    • Progressive + iterative

    • Has three main stages

    • Stage1: Draft Progressive

    • Stage 2: Improved Progressive

    • Stage 3: Refinement:

      • Select pairs of subtrees and re-align the alignment for the subtrees.

      • Keep if it improves alignment


    Steps in muscle
    Steps in MUSCLE

    Stage 1: Draft progressive

    Stage 2: Improved progressive

    Stage 3: Refinement


    Muscle stage 1
    MUSCLE Stage 1

    1.1 Compute k-mer distance matrix

    1.2 Use UPGMA to make tree (TREE1)

    1.3. Use guide tree to make first MSA


    K mer distance
    K-mer distance

    Let k=2

    • K-mer distance is defined from common fractional k-mer count (F)

    • D=1-F

    # of instances in sequence 1

    # of instances in sequence 1

    A k-mer

    Length of sequences


    K mer distance example
    K-mer distance example


    Stage 2 improved progressive
    Stage 2: Improved progressive

    2.1 Recomputesimilarity of sequences of pairs using mutual alignment in MSA

    2.2 Construct a phylogenetic tree (TREE2) using an alignment-based distance

    2.3 Build a new progressive alignment only for subtrees where branching order has changed between TREE1 and TREE2

    2.4 Repeat 2.3 until number of “reordered nodes” does not decrease.


    Stage 2 1 recomputing pairwise sequence similarity from a multiple alignment
    Stage 2.1. Recomputing pairwise sequence similarity from a multiple alignment

    Derived pairwise alignment

    Fraction identity

    TGTTAAC

    TGT-AAC

    6/7

    An MSA

    Exclude gaps in both sequences

    -TGTTAAC

    -TGT-AAC

    -TGT--AC

    ATGT---C

    ATGT-GGC

    TGTTAAC

    TGT--AC

    5/7

    -TGTTAAC

    ATGT---C

    4/8

    -TGTTAAC

    ATGT-GGC

    4/8


    Stage 2 2 phylogenetic tree creation
    Stage 2.2: Phylogenetic tree creation

    Construct a phylogenetic tree using a Kimura distance

    D: fractional identity of sequences


    Stage 2 3 re align only when branching order is changed
    Stage 2.3 Re-align only when branching order is changed

    Recompute alignment for these nodes

    Branching order same

    Branching order different:

    x branches before v


    Stage 3 iterative refinement
    Stage 3: Iterative Refinement

    3.1 Select a branch

    3.2 Extract profiles

    3.3 Re-align profiles

    3.4 Update MSA if its score is better than current MSA


    3 1 selecting a branch
    3.1 Selecting a branch

    • Select a branch in order of decreasing distance from the root

    1

    MQTIF

    MQTIF

    LH-IW

    5

    2

    LHIW

    MQTIF

    LH-IW

    LQS-W

    LQSW

    6

    3

    LQSW

    L-SW

    4

    LSF

    Branch selection order: 1,2,3,4,5,6


    3 2 extracting a profile
    3.2 Extracting a profile

    1

    MQTIF

    Re-align profiles for subtrees

    MQTIF

    LH-IW

    MQTIF

    5

    2

    LHI-W

    MQTIF

    LQS-WL-S-W

    LHIW

    MQTIF

    LH-IW

    LQS-WL-S-W

    Delete branch 1

    LH-IW

    LQS-WL-S-W

    LQSW

    6

    3

    LQSW

    L-SW

    Is score better?

    4

    LSF

    yes

    Keep new alignment

    Discard


    3 2 extracting a profile1
    3.2 Extracting a profile

    1

    LHIW

    Re-align profiles for subtrees

    MQTIF

    MQTIF

    LH-IW

    5

    2

    LHI-W

    MQTIF

    LQS-WL-S-W

    LHIW

    MQTIF

    LH-IW

    LQS-WL-S-W

    MQTIF

    LQS-WL-S-W

    Delete branch 2

    LQSW

    6

    3

    Is score better?

    LQSW

    L-SW

    4

    yes

    LSF

    Keep new alignment

    Discard


    Summary of muscle
    Summary of MUSCLE

    • Three stage algorithm

    • Stage 1: Draft progressive

      • k-mer distance

      • UPGMA tree (TREE1)

      • Guide tree based alignment (MSA1)

    • Stage 2: Improved progressive

      • Distance derived from MSA1

      • UPGMA tree (TREE2)

      • Redo alignment for nodes with changed orderings

      • Repeat until number of re-ordered nodes does not change

    • Stage 3: Iterative refinement

      • Generate subtree profiles

      • Realign profiles

      • Keep realignment if of higher score

      • Repeat until no more improvement or fixed number of steps.

    • MUSCLE-fast: Stage 1

    • MUSCLE-p: Stage1 and 2


    Accuracy scores of different msa algorithms on benchmark datasets
    Accuracy scores of different MSA algorithms on benchmark datasets

    Accuracy measures the fraction of residues correctly aligned with the reference alignment

    Edgar, 2004, BMC Bioinformatics



    Summary of algorithms
    Summary of algorithms datasets

    • ClustalW

      • Lots of heuristics for gaps

      • One guide tree and then alignment

      • Weights sequences

      • Dynamically selects scoring matrix depending upon sequence identity

    • MUSCLE

      • Three-stage algorithm: Draft, Improved, Iterative refinement

      • Two guide trees

      • Uses k-mer distance for first tree

      • Selectively re-aligns using second tree

      • Refines iteratively by working on subtree-associated alignments

      • Fast and has as good or better quality alignments


    How do muscle and clustalw work in practice
    How do MUSCLE and CLUSTALW work in practice datasets

    • Consider coding sequences of 15 yeast species

    • Consider promoter sequences of 15 yeast species

    • Align with MUSCLE and CLUSTALW


    Protein sequence alignment
    Protein sequence alignment datasets

    MUSCLE

    CLUSTALW


    Promoter sequence alignment
    Promoter sequence alignment datasets

    MUSCLE

    CLUSTALW


    Comparing alignment of promoters to shuffled sequences in clustalw
    Comparing alignment of promoters to shuffled sequences in CLUSTALW

    Original sequences

    Shuffled sequences


    Comparing alignment of promoters to shuffled sequences in muscle
    Comparing alignment of promoters to shuffled sequences in CLUSTALW MUSCLE

    Original sequences

    Shuffled sequences


    Conclusion
    Conclusion CLUSTALW

    • Algorithms seemed similar for protein/coding sequences

    • Algorithms gave different alignments for DNA sequence

      • Possibly DNA sequence is harder to align

      • DNA sequence in non-coding regions are even harder to align


    Summary of sequence alignment algorithms
    Summary of sequence alignment algorithms CLUSTALW

    • Pairwise alignment

      • Global: (Needleman-Wunsch)

      • Local: (Smith-Waterman)

    • Database searching

      • BLAST

    • Multiple sequence alignment

      • Star alignment

      • Progressive alignment with guide tree: CLUSTALW

      • Progressive + Iterative alignment with guide tree: MUSCLE