Multiple Sequence Alignment
1 / 47

Multiple Sequence Alignment A survey of the various programs available and application of MSA in addressing certain b - PowerPoint PPT Presentation

  • Uploaded on

Multiple Sequence Alignment A survey of the various programs available and application of MSA in addressing certain biological problems. Jeff Mower Kiran Annaiah. Sequence Comparison. Aligning two sequences is the cornerstone of Bioinformatics.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Multiple Sequence Alignment A survey of the various programs available and application of MSA in addressing certain b' - sinjin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Multiple Sequence AlignmentA survey of the various programs available and application of MSA in addressing certain biological problems

Jeff Mower

Kiran Annaiah

Sequence comparison l.jpg
Sequence Comparison

  • Aligning two sequences is the cornerstone of Bioinformatics.

    • Sequence alignment is the basic step upon which everything else built.

  • Sequence alignments are employed in

    • predicting de novo secondary structure of proteins,

    • The initial functional assignment

    • knowledge-based tertiary structure predictions,

    • Interpretation of data from genome sequencing projects

    • Inference of phylogenetic trees and resolution of ancestral relationships between species.

Sequence comparisons l.jpg
Sequence Comparisons

  • Homology Searches

    • Look for homologous sequences in databases using FASTA or BLAST program

  • Pattern Searches

    • Used for searching short sequence patterns

  • Multiple Sequence Alignment

    • For aligning and comparing 3 or more related sequences

  • Profile Analysis

    • A profile is created from a multiple sequence alignment

Msa vs pairwise l.jpg
MSA vs Pairwise

  • PSA

    – based on seq. similarity we can identify unknown biological relationship

  • MSA

    • Similar to PSA

    • But also possible to identify conserved sub-patterns based on known biological relationship

  • High seq similarity – functional & structural similarity (PSA)

  • Sequences with functional and structural similarity can differ in sequences

    • PSA cannot detect this case

    • Example : Haemoglobin

Msa vs pairwise5 l.jpg
MSA vs Pairwise

  • Structurally and functionally conserved molecules can differ in sequence – PSA cannot reveal conserved patterns

  • Comparing of 2 sequences with high similarity – patterns detection lost due to high similarity

  • MSA – useful in revealing critical patterns from multiple related sequences

Multiple sequence alignment l.jpg
Multiple Sequence Alignment

  • Homology

    • Homologous sequences, derive from common ancestor

    • Inferred by sequence similarity

    • MSA useful to demonstrate homology

      • Weak similarity – non-significant in pairwise comparison

        could be highly significant if same residues are conserved

        in other distantly related sequences

Multiple sequence alignment7 l.jpg
Multiple Sequence Alignment

  • Global or Local Alignments

  • Substitution Matrices and weighting gaps

    • Best alignment is one that represents an evolutionary scenario

    • Mutational events in evolution considered in MSA

      • Substitutions

      • Insertions

      • Deletions

    • BLOSUM and PAM – based on evolutionary distances

    • Affine gap penalty model

      • p = a + bL

      • p = a + b blog(L)

  • Scoring MSA

    • Sum of Pairs score (SP) for columns

Which msa method l.jpg
Which MSA method…?

  • Global Methods

    • Optimal Algorithms (MSA, MWT, MUSEQAL)


  • Local methods

    • PIMA, DIALIGN, PRALINE, POA, MACAW, BlockMaker, Iteralign


  • Statistical (HMMT, SAGA, SAM, Match Box)

  • Parsimony(MALIGN, TreeAlign)

Progressive algorithm l.jpg
Progressive algorithm

  • Alignment is only an approximate solution (heuristic based)

  • Simplest and effective ways of MSA

  • Less time and less memory

  • Sequences are added one by one to the multiple alignment based on a pre-computed dendogram

  • Sequence addition is by PSA algorithm

  • Disadvantage

    • Once a sequence is aligned, cant be changed even if it conflicts with later sequence additions

  • Examples

    • PileUp, ClustalW, MultAlign, T-Coffee, etc

Exact algorithms l.jpg
Exact Algorithms

  • Useful in cases where sequences are extremely divergent

  • Simultaneously aligns all sequences

  • Disadvantage

    • Need to generalize Needleman-Wunsch algorithm

      • Only for a maximum of 3 sequences

      • Causes exponential increase in time and memory as number of sequences increase

  • Examples

    • MSA (up to 10 closely related sequences)

Iterative algorithms l.jpg
Iterative algorithms

  • Iterate over a existing sub-optimal solution, modifying it at each step, until a convergence point is met.

  • Examples

    • SAGA, AMPS, Praline, IterAlign

Consistency based algorithms l.jpg
Consistency-based Algorithms

  • Given a set of sequences, an optimal MSA is one that agrees most with all possible optimal pairwise alignments

    • Do not depend on specific subs. Matrix

    • Score associated with alignment of 2 residues depends on their indexes (position within protein sequence)

    • Most consistent are often the ones close to truth

  • Examples

    • T-Coffee, DiAlign

Multiple alignment by profile hmm training l.jpg
Multiple Alignment by profile HMM training

Given n sequences , consider the following cases:

  • If the profile HMM is known, the following procedure can be applied:

    • Align each sequence S(i) to the profile separately.

    • Accumulate the obtained alignments to a multiple alignment.

  • If the profile HMM is not known, one can use the following technique in order to obtain an HMM profile from the given sequences:

    • Choose a length L for the profile HMM and initialize the transition and emission probabilities.

    • Train the model using the Baum-Welch algorithm, on all the training sequences.

    • Obtain the multiple alignment from the resulting profile HMM, as in the previous case.

  • Testing the methods l.jpg
    Testing the methods

    • BAliBASE benchmark

      • “Correct” Alignments

      • Core Blocks of Conserved Motifs

      • Typical “Hard Problem” Sets

    Slide18 l.jpg

    BAliBASE - each set Reference set 1

    Slide19 l.jpg

    Scores based on core blocks(V1) each set

    Scores based on full-length alignment(V1)

    Slide20 l.jpg

    Median score for each set Reference set 2

    Results l.jpg
    Results: each set

    • Core blocks aligned well over long sequences – all programs

      • Due to different patterns of conservation

    • Alignment unreliable in the twilight zone

    • Iterative method did well under distinct alignment conditions, but not in the presence of an orphan sequence

    • Global algorithms – accurate and reliable for

      • equidistant sequences

      • divergent families of sequences and

      • alignment of orphan sequence with a family

    • Local algorithm – DiAlign

      • Best for N/C terminal extensions

      • Internal insertions

    Clustalw vs dialign vs t coffee vs poa l.jpg
    ClustalW each set vs DiAlign vs T-Coffee vs POA

    • ClustalW – global progressive method

      • Guide tree created

      • Successive pairwise alignment

    • Poa – progressive using partially ordered graphs

      • No tree to guide alignment of sequences

      • 2 most similar seqs are aligned and others are added to this one profile in a stepwise fashion

    • DiAlign – local algorithm

      • Aligns whole segments

      • PSA performed and ungapped alignments used

    • T-Coffee – progressive global & local

      • Performs PSA – twice, once global (ClustalW) and local (LAlign)

      • Results combined into primary library, then extension step

      • Progressive alignment using info from library

    Results29 l.jpg
    Results: increasingly long sequences

    • Poor performance by all programs with increase in evolutionary distance

    • Increase in seq length – better alignment

    • T-Coffee – best for low – moderate evolutionary distances

    • DiAlign good for high evolutionary distances

    What can we do with msas l.jpg
    What Can We Do With MSAs? increasingly long sequences

    • Motif / pattern identification

    • Structural modeling

    • Phylogenetic analysis

    • Molecular evolutionary analyses

    • Identification of conserved genomic regions across species

    Phylogenetic analysis l.jpg
    Phylogenetic Analysis increasingly long sequences

    • Visual representation of a MSA as a tree of relationships

    • Many methods:

      • Distance: builds tree by clustering sequences according to their similarity

      • Parsimony: finds tree that minimizes the number of changes required

      • Maximum Likelihood: finds tree that maximizes likelihood given parameters

      • Bayesian: Markov chain Monte Carlo simulation to calculate posterior probabilities of trees

    Determining relationships l.jpg
    Determining Relationships increasingly long sequences



    (Baldauf 2003)

    Rooted vs unrooted trees l.jpg
    Rooted vs Unrooted Trees increasingly long sequences



    (Baldauf 2003)

    Slide34 l.jpg

    Many taxa (567), increasingly long sequences

    Few genes (3)

    (Soltis et al. 1999)

    Slide35 l.jpg

    Few taxa (13), increasingly long sequences

    Many genes (61)

    (Goremykin et al. 2003)

    Slide36 l.jpg

    Many taxa, increasingly long sequences

    Many genes

    Coming Soon…

    Detecting gene families l.jpg
    Detecting Gene Families increasingly long sequences

    (Baldauf 2003)

    Slide38 l.jpg

    Human, Gorilla, & Chimp increasingly long sequences

    Glycophorin gene family

    (Wang et al. 2003)

    Bootstrapping l.jpg
    Bootstrapping increasingly long sequences

    (Baldauf 2003)

    Molecular evolution background l.jpg
    Molecular Evolution Background increasingly long sequences

    • Types of changes in protein-coding DNA

      • Silent (synonymous)

      • Replacement (nonsynonymous)

      • Based on degeneracy of the genetic code

    • K – frequency of change between two sequences

      • Ka: # of replacement changes per replacement site

        • Driven by natural selection

        • Reflect level of protein conservation

      • Ks: # of silent changes per silent site

        • Neutral, not affected by selective forces

        • Used to estimate the neutral mutation rate

    Estimation of substitution rates l.jpg
    Estimation of Substitution Rates increasingly long sequences

    • Absolute Rate

      • R = ½ K / T, where T = time since last common ancestor

      • Rates are directly comparable

    • Relative Rate

      • Two species of interest and one outgroup

      • Compare K from species 1 and outgroup against K from species 2 and outgroup

    Relative rate l.jpg
    Relative Rate increasingly long sequences


    K Rat-Kan =

    K Hum-Kan =




    K Rat-Kan - K Hum-Kan =



    Evaluation of selective pressures the k a k s ratio l.jpg
    Evaluation of Selective Pressures, increasingly long sequences The Ka / Ks Ratio

    • Ka / Ks > 1 indicates positive selection

    • Ka / Ks ≈ 1 indicates no selection

    • Ka / Ks < 1 indicates purifying selection

    Slide44 l.jpg

    280 homologs, increasingly long sequences


    (Wang et al. 2003)

    Conserved genomic regions l.jpg
    Conserved Genomic Regions increasingly long sequences

    • Need complete genomes or homologous genomic regions

    • Identify exons from distantly related species

    • Identify regulatory elements from more closely related species

    Slide46 l.jpg

    (Thomas increasingly long sequenceset al. 2003)

    References l.jpg
    References increasingly long sequences

    • Biological Sequence Analysis – R. Durbin, S. Eddy, A. Krogh & G. Mitchison

    • Bioinformatics: Sequence, structure and databanks – D. Higgins & W. Taylor

    • Recent progress in multiple sequence alignment: a survey.

    • Quality assessment of multiple alignment programs.

    • Multiple sequence alignment with the Clustal series of programs.

    • MAVID multiple alignment server

    • Multiple alignment of sequences and structures.

    • Fast algorithms for large-scale genome alignment and comparison.