Multiple sequence alignment
Download
1 / 64

Multiple Sequence Alignment - PowerPoint PPT Presentation


  • 39 Views
  • Uploaded on

Multiple Sequence Alignment. Alignment can be easy or difficult. Easy. Difficult due to insertions or deletions (indels). Homology: Definition.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Multiple Sequence Alignment' - nasya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Alignment can be easy or difficult
Alignment can be easy or difficult

Easy

Difficult due

to insertions

or deletions

(indels)


Homology definition
Homology: Definition

  • Homology:similarity that is the result of inheritance from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics.

  • AnAlignment is an hypothesis of positional homology between bases/Amino Acids.


Multiple sequence alignment goals
Multiple Sequence Alignment- Goals

  • To generate a concise, information-rich summary of sequence data.

  • Sometimes used to illustrate the dissimilarity between a group of sequences.

  • Alignments can be treated as models that can be used to test hypotheses.

  • Used to identify homologous residues within sequences.


Multiple sequence alignments problems
Multiple sequence alignments - problems

  • All sequences show some similarity (even random sequences).

  • Similarity levels might be high in some parts of the sequence and low in other parts.

  • Sequences might show substantial length variation and presence/absence of various domains.


Ssu rrna
SSU rRNA

  • Structural RNA (not translated)

  • Found in the small ribosomal subunit.

  • Widely-used for phylogeny reconstruction (found in every species)

  • Contains stem and loop structures.

  • Stem structures usually conform to watson-crick base pairing.


Alignment of 16s rrna can be guided by secondary structure
Alignment of 16S rRNA can be guided by secondary structure

Alignment of 16S rRNA sequences from different bacteria


Protein alignment may be guided by tertiary structure interactions
Protein Alignment may be guided by Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein


Multiple sequence alignment methods
Multiple Sequence Alignment- Methods Interactions

  • 3 main methods of alignment:

  • Manual (using custom-built text editors).

  • Automatic (using custom-built alignment software).

  • Combined


  • Manual alignment reasons
    Manual Alignment - reasons Interactions

    • Might be carried out because:

      • Alignment is easy.

      • There is some extraneous information (structural).

      • Automated alignment methods have encountered the local minimum problem.

      • An automated alignment method can be “improved”.


    Local minimum
    Local minimum Interactions

    GARFIELDTHEFAT---CAT

    GARFIELDTHEFATFATCAT


    Dotplots
    Dotplots Interactions

    • The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously.

    • Lets consider a dotplot between sperm whale and human myoglobins

    • Sperm whale myoglobin

    • GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

    • human myoglobin

    • VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG


    Dotplot example sperm whale vs human myg
    Dotplot example sperm whale vs human myg Interactions

    • Sperm whale myoglobin

    • G L S D G E W Q L V ...

    • V *

    • L * *

    • S *

    • E *

    • G * *

    • E *

    • W *

    • Q *

    • L * *

    • V * *

    • .

    • .

    • .

    Human

    myoglobin

    • Put one sequence on top

    • the other on the side

    • where residues are identical put a dot

    • Diagonal lines of dots show similarities

    • just do the first 10 amino acids of each

    • make a table with

      • whale sequence on top

      • human sequence on the side


    Dotplot example sperm whale vs human myg1
    Dotplot example sperm whale vs human myg Interactions

    • Sperm whale myoglobin

    • G L S D G E W Q L V ...

    • V *

    • L * *

    • S *

    • E *

    • G * *

    • E *

    • W *

    • Q *

    • L * *

    • V * *

    • .

    • .

    • .

    Human

    myoglobin

    • This is the result for the whole sequence

    • It is easy to see that the diagonal is a line of dots.

    • So sperm whale and human myoglobin are very similar

    • But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

    16


    Dotplot example sperm whale vs human myg2
    Dotplot example sperm whale vs human myg Interactions

    • can smooth noise using a sliding window which considers neighbouring residues as well

    • Have done this here can see the diagonal is highly similar

    • Also instead of using using a simple identity use a scoring matrix


    Dotplots in practice
    Dotplots in practice Interactions

    • The best tool is an applet* called dotlet

      • www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

      • www.bip.bham.ac.uk/dotlet/Dotlet.html

  • *an applet is a program that runs in a web browser. This means that you can produce dotplots within a netscape/IE window.

  • Dotplots are often useful to identify things like repeated domains or duplications in big proteins...


  • Example dotplot repeated domains in drosophila melanogaster slit protein
    Example dotplot - repeated domains in InteractionsDrosophila melanogaster SLIT protein.

    • Protein has many repeats

    • SLIT_DROME (P24014):MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

    • Perform a dotplot of the SLIT protein against itself www.bio.bham.ac.uk/dotlet/Dotlet.html.


    Example dotplot repeated domains in drosophila melanogaster slit protein1
    Example dotplot - repeated domains in InteractionsDrosophila melanogaster SLIT protein

    Swiss-prot entry

    20

    For further discussion of dotplot see Attwood and Parry-Smith p116-8


    Dynamic programming
    Dynamic programming Interactions

    2 methods:

    • Dynamic programming

      • Consider 2 protein sequences of 100 amino acids in length.

      • If it takes 1002 seconds to exhaustively align these sequences, then it will take 1003 seconds to align 3 sequences, 1004 to align 4 sequences...etc.

      • More time than the universe has existed to align 20 sequences exhaustively.

    • Progressive alignment


    Progressive alignment
    Progressive Alignment Interactions

    • Devised by Feng and Doolittle in 1987.

    • Essentially a heuristic method and as such is not guaranteed to find the ‘optimal’ alignment.

    • Requires n-1+n-2+n-3...n-n+1 pairwise alignments as a starting point

    • Most successful implementation is Clustal (Des Higgins). This software is cited 3,000 times per year in the scientific literature.


    Overview of ClustalW Procedure Interactions

    CLUSTAL W

    Hbb_Human 1 -

    Hbb_Horse 2 .17 -

    Hba_Human 3 .59 .60 -

    Quick pairwise alignment:

    calculate distance matrix

    Hba_Horse 4 .59 .59 .13 -

    Myg_Whale 5 .77 .77 .75 .75 -

    Hbb_Human

    4

    2

    3

    Hbb_Horse

    Neighbor-joining tree

    (guide tree)

    Hba_Human

    1

    Hba_Horse

    Myg_Whale

    alpha-helices

    1 PEEKSAVTALWGKVN--VDEVGG

    4

    2

    3

    Progressive alignment

    following guide tree

    2 GEEKAAVLALWDKVN--EEEVGG

    3 PADKTNVKAAWGKVGAHAGEYGA

    1

    4 AADKTNVKAAWSKVGGHAGEYGA

    5 EHEWQLVLHVWAKVEADVAGHGQ


    Clustalw pairwise alignments
    ClustalW- Pairwise Alignments Interactions

    • First perform all possible pairwise alignments between each pair of sequences. There are (n-1)+(n-2)...(n-n+1) possibilities.

    • Calculate the ‘distance’ between each pair of sequences based on these isolated pairwise alignments.

    • Generate a distance matrix.



    Possible alignment
    Possible alignment Interactions

    • Scoring Scheme:

      • Match: +1

      • Mismatch: 0

      • Indel: -1

    1

    1

    0

    1

    Score for this path= 2

    0

    -1


    Alignment using this path
    Alignment using this path Interactions

    1

    GATTC-

    GAATTC

    1

    0

    1

    0

    -1


    Optimal alignment 1
    Optimal Alignment 1 Interactions

    Alignment using

    this path

    GA-TTC

    GAATTC

    1

    1

    -1

    1

    1

    Alignment score: 4

    1


    Optimal alignment 2
    Optimal Alignment 2 Interactions

    Alignment using

    this path

    G-ATTC

    GAATTC

    1

    -1

    1

    1

    1

    Alignment score: 4

    1



    Clustalw guide tree
    ClustalW- Guide Tree Interactions

    • Generate a Neighbor-Joining ‘guide tree’ from these pairwise distances.

    • This guide tree gives the order in which the progressive alignment will be carried out.


    Neighbor joining method

    Neighbor joining method Interactions

    • The neighbor joining method is a greedy heuristic which joins at each step, the two closest sub-trees that are not already joined.

    • It is based on the minimum evolution principle.

    • One of the important concepts in the NJ method is neighbors, which are defined as two taxa that are connected by a single node in an unrooted tree

    Node 1

    A

    B


    Distance matrix

    What is required for the Neighbour joining method? Interactions

    Distance Matrix

    Distance matrix


    First step
    First Step Interactions

    PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances.

    Mon-Hum

    Mosquito

    Spinach

    Rice

    Human

    Monkey


    Calculation of new distances
    Calculation of New Distances Interactions

    After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances:

    Dist[Spinach, MonHum]

    = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2

    = (90.8 + 86.3)/2 = 88.55

    Mon-Hum

    Spinach

    Human

    Monkey


    Next cycle
    Next Cycle Interactions

    Mos-(Mon-Hum)

    Mon-Hum

    Rice

    Spinach

    Mosquito

    Human

    Monkey


    Penultimate cycle
    Penultimate Cycle Interactions

    Mos-(Mon-Hum)

    Spin-Rice

    Mon-Hum

    Rice

    Spinach

    Mosquito

    Human

    Monkey


    Last joining
    Last Joining Interactions

    (Spin-Rice)-(Mos-(Mon-Hum))

    Mos-(Mon-Hum)

    Spin-Rice

    Mon-Hum

    Rice

    Spinach

    Mosquito

    Human

    Monkey


    Unrooted neighbor joining tree
    Unrooted Neighbor-Joining Tree Interactions

    Human

    Spinach

    Monkey

    Mosquito

    Rice


    Multiple alignment first pair
    Multiple Alignment- First pair Interactions

    • Align the two most closely-related sequences first.

    • This alignment is then ‘fixed’ and will never change. If a gap is to be introduced subsequently, then it will be introduced in the same place in both sequences, but their relative alignment remains unchanged.


    Clustalw decision time
    ClustalW- Decision time Interactions

    • Consult the guide tree to see what alignment is performed next.

      • Align a third sequence to the first two

        Or

      • Align two entirely different sequences to each other.

    Option 1

    Option 2


    Clustalw alternative 1
    ClustalW- Alternative 1 Interactions

    If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences.

    • If, on the other hand, two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out.

    +

    ClustalW- Alternative 2

    +


    Clustalw progression
    ClustalW- Progression Interactions

    • The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a ‘pair’ having more than one sequence.


    Progressive alignment step 1
    Progressive alignment - step 1 Interactions

    1

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgacagcta

    3. gctcgatacacgatgactagcta

    4. gctcgatacacgatgacgagcga

    5. ctcgaacgatacgatgactagct

    2

    3

    4

    5

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgac-agcta


    Progressive alignment step 2
    Progressive alignment - step 2 Interactions

    1

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgacagcta

    3. gctcgatacacgatgactagcta

    4. gctcgatacacgatgacgagcga

    5. ctcgaacgatacgatgactagct

    2

    3

    4

    5

    3. gctcgatacacgatgactagcta

    4. gctcgatacacgatgacgagcga


    Progressive alignment step 3
    Progressive alignment - step 3 Interactions

    1

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgac-agcta

    +

    3. gctcgatacacgatgactagcta

    4. gctcgatacacgatgacgagcga

    2

    3

    4

    5

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgac-agcta

    3. gctcgatacacga---tgactagcta

    4. gctcgatacacga---tgacgagcga


    Progressive alignment final step
    Progressive alignment - final step Interactions

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgac-agcta

    3. gctcgatacacga---tgactagcta

    4. gctcgatacacga---tgacgagcga

    +

    5. ctcgaacgatacgatgactagct

    1

    2

    3

    4

    5

    1. gctcgatacgatacgatgactagcta

    2. gctcgatacaagacgatgac-agcta

    3. gctcgatacacga---tgactagcta

    4. gctcgatacacga---tgacgagcga

    5. -ctcga-acgatacgatgactagct-


    Clustalw good points bad points
    ClustalW-Good points/Bad points Interactions

    • Advantages:

      • Speed.

    • Disadvantages:

      • No objective function.

      • No way of quantifying whether or not the alignment is good

      • No way of knowing if the alignment is ‘correct’.


    Clustalw local minimum
    ClustalW-Local Minimum Interactions

    • Potential problems:

      • Local minimum problem. If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure.

      • Arbitrary alignment.


    Increasing the sophistication of the alignment process
    Increasing the sophistication of the alignment process. Interactions

    • Should we treat all the sequences in the same way? - even though some sequences are closely-related and some sequences are distant relatives.

    • Should we treat all positions in the sequences as though they were the same? - even though they might have different functions and different locations in the 3-dimensional structure.


    Clustalw caveats
    ClustalW- Caveats Interactions

    • Sequence weighting

    • Varying substitution matrices

    • Residue-specific gap penalties and reduced penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions.

    • Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments


    Clustalw user supplied values
    ClustalW- User-supplied values Interactions

    • Two penalties are set by the user (there are default values, but you should know that it is possible to change these).

    • GOP- Gap Opening Penalty is the cost of opening a gap in an alignment.

    • GEP- Gap Extension Penalty is the cost of extending this gap.


    Position specific gap penalties
    Position-Specific gap penalties Interactions

    • Before any pair of (groups of) sequences are aligned, a table of GOPs are generated for each position in the two (sets of) sequences.

    • The GOP is manipulated in a position-specific manner, so that it can vary over the sequences.

    • If there is a gap at a position, the GOP and GEP penalties are lowered, the other rules do not apply.

    • This makes gaps more likely at positions where gaps already exist.


    Discouraging too many gaps
    Discouraging too many gaps Interactions

    • If there is no gap opened, then the GOP is increased if the position is within 8 residues of an existing gap.

    • This discourages gaps that are too close together.

    • At any position within a run of hydrophilic residues, the GOP is decreased.

    • These runs usually indicate loop regions in protein structures.

    • A run of 5 hydrophilic residues is considered to be a hydrophilic stretch.

    • The default hydrophilic residues are:

      • D, E, G, K, N, Q, P, R, S

      • But this can be changed by the user.


    Divergent sequences
    Divergent Sequences Interactions

    • The most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align.

    • It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned).

    • The user has the choice of setting a cutoff (default is 40% identity).

    • This will delay the alignment until the others have been aligned.


    T coffee tree based consistency objective function for alignment evaluation
    T-COFFEE InteractionsTree-based consistency objective function for alignment evaluation)

    • Generate a library of all the pairwise alignments between the sequences.

    • This gives positional information concerning which residues are homologous to which other residues.

    • This can then be used to guide progressive alignments.


    An example dataset
    An example dataset Interactions

    SequenceA GARFIELD THE LAST FAT CAT

    SequenceB GARFIELD THE FAST CAT

    SequenceC GARFIELD THE VERY FAST CAT

    SequenceD THE FAT CAT

    Clustal alignment

    Sequence A GARFIELD THE LAST FA-T CAT

    Sequence B GARFIELD THE FAST CA-T ---

    Sequence C GARFIELD THE VERY FAST CAT

    Sequence D -------- THE ---- FA-T CAT


    Primary library
    Primary library Interactions

    SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE ---- FAST CAT

    SeqB GARFIELD THE FAST CAT --- 88 SeqC GARFIELD THE VERY FAST CAT 100

    SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CAT

    SeqC GARFIELD THE VERY FAST CAT 77 SeqD -------- THE FA-T CAT 100

    SeqA GARFIELD THE LAST FAT CAT SeqC GARFIELD THE VERY FAST CAT

    SeqD -------- THE ---- FAT CAT 100 SeqD -------- THE ---- FA-T CAT 100


    Secondary library
    Secondary library Interactions

    SeqA GARFIELD THE LAST FAT CAT

    SeqB GARFIELD THE FAST CAT Weight = 88

    SeqA GARFIELD THE LAST FAT CAT

    SeqC GARFIELD THE VERY FAST CAT

    SeqB GARFIELD THE FAST CAT Weight = 77

    SeqA GARFIELD THE LAST FAT CAT

    SeqD THE FAT CAT

    SeqB GARFIELD THE FAST CAT Weight = 100


    Extended library
    Extended library Interactions

    SeqA GARFIELD THE LAST FAT CAT

    SeqB GARFIELD THE FAST CAT

    SeqA GARFIELD THE LAST FA-T CAT

    SeqB GARFIELD THE ---- FAST CAT

    Dynamic programming


    Advice on progressive alignment
    Advice on progressive alignment Interactions

    • Progressive alignment is a mathematical process that is completely independent of biological reality.

    • Can be a very good estimate

    • Can be an impossibly poor estimate.

    • Requires user input and skill.

    • Treat cautiously

    • Can be improved by eye (usually)

    • Often helps to have colour-coding.

    • Depending on the use, the user should be able to make a judgement on those regions that are reliable or not.

    • For phylogeny reconstruction, only use those positions whose hypothesis of positional homology is unimpeachable


    Alignment of protein coding dna sequences
    Alignment of protein-coding DNA sequences Interactions

    • It is not very sensible to align the DNA sequences of protein-coding genes.

    ATGCTGTTAGGG

    ATGACTCTGTTAGGG

    ATG-CT--GTTAGGG

    ATGACTCTGTTAGGG

    The result might be highly-implausible and might not reflect what is known about biological processes.

    It is much more sensible to translate the sequences to their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment.


    Manual alignment software
    Manual Alignment- software Interactions

    GDE- The Genetic Data Environment (UNIX)

    CINEMA- Java applet available from:

    • http://www.biochem.ucl.ac.uk

      Seqapp/Seqpup- Mac/PC/UNIX available from:

    • http://iubio.bio.indiana.edu

      SeAl for Macintosh, available from:

    • http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

      BioEdit for PC, available from:

    • http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bioedit.html


    ad