Multiple sequence alignment
Download
1 / 51

Multiple sequence alignment - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Multiple sequence alignment. Dr Alexei Drummond Department of Computer Science [email protected] Semester 2, 2006. Multiple alignment software. Really need approximation methods. Four techniques

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Multiple sequence alignment' - zizi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Multiple sequence alignment

Multiple sequence alignment

Dr Alexei Drummond

Department of Computer Science

[email protected]

Semester 2, 2006


Multiple alignment software
Multiple alignment software

Really need approximation methods.

Four techniques

  • progressive global alignment of sequences starting with an alignment of the most similar sequences and then building a full alignment by adding more sequences

  • iterative methods that make an initial alignment of groups of sequences and then refine the alignment to achieve a better result (Barton-sternberg, Simulated annealing, stochastic hill climbing)

  • (alignments based on locally conserved patterns found in the same order in the sequences), and

  • use of probabilistic models of the indel and substitution process to do statistical inference of alignment. (“Statistical alignment”)


Scoring a multiple alignment
Scoring a multiple alignment

i

Usually

1

Gaps score

Score for column

N

Column


Linear gap scores sp scoring
Linear gap scores & SP scoring

Treat gap as separate symbol.

s(a,-) = s(-,a) = gap score

s(-,-) = 0

“Sum of Pairs” (SP) scoring function

i

1

k

l

N

Column


Multidimensional dynamic programming
Multidimensional dynamic programming

Define

= max score of an alignment up

to the sequences ending with

1

N

ways of placing

gaps in this column

All

space

time,


MSA

Carrillo and Lipman (1988),

Lipman, Altschul and Kececioglu (1989).

Can optimally align up to 5-7 protein sequences of up to 200 residues.


Progressive alignment
Progressive alignment

Align sequences

(pairwise) in some

(greedy) order

  • Decisions

  • (1) Order of alignments

  • (2) Alignment of sequence to group (only), or allow group to group

  • Method of alignment, and scoring function


Guide tree
Guide tree

A

this ?

B

C

D

E

A

B

or this ?

C

D

E

F


Feng doolittle 1987
Feng & Doolittle (1987)

Overview

Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances”.

Construct guide tree from the distance matrix by using appropriate clustering algorithm.

Starting from first node added to the tree, align the child nodes (which may be two sequences, a sequence and an alignment, or two alignments). Repeat for all other nodes in the order that they were added to tree, until all sequences have been aligned.


Feng doolittle 19871
Feng & Doolittle (1987)

sequence-to-group

Best pairwise

alignment

determines

alignment to group

X

X

X

X

X

X

X

X

X

XX

XX


Feng doolittle 19872
Feng & Doolittle (1987)

sequence-to-group

Best pairwise

alignment

determines

alignment to group

X


Feng doolittle 19873
Feng & Doolittle (1987)

sequence-to-group

Best pairwise

alignment

determines

alignment to group

– – – – –

X

This column is encouraged because it has no cost


Feng doolittle 19874
Feng & Doolittle (1987)

sequence-to-group

Best pairwise

alignment

determines

alignment to group

– – – – –

X

X

X

X

X

X

X

X

X

XX

XX


Feng doolittle 19875
Feng & Doolittle (1987)

sequence-to-group

Best pairwise

alignment

determines

alignment to group

X

X

X

X

X

X

X

X

X

X

X

X

X

X

XX

XX


Feng doolittle 19876
Feng & Doolittle (1987)

group-to-group

X

X

XX

XX

Best pairwise

alignment

determines

alignment of

groups

X

X

X

X

X

X

X

X

X

XX

XX


Feng doolittle 19877
Feng & Doolittle (1987)

group-to-group

XX

Best pairwise

alignment

determines

alignment of

groups

X


Feng doolittle 19878
Feng & Doolittle (1987)

group-to-group

XX

– – – – – –

Best pairwise

alignment

determines

alignment of

groups

X


Feng doolittle 19879
Feng & Doolittle (1987)

group-to-group

– – – – – –

X

X

– – – – – –

– – – – – –

XX

XX

– – – – – –

Best pairwise

alignment

determines

alignment of

groups

X

X

X

–––––––

X

X

X

X

X

X

XX

XX


Feng doolittle 198710
Feng & Doolittle (1987)

group-to-group

– – – – – –

X

X

– – – – – –

XX

XX

– – – – – –

– – – – – –

Best pairwise

alignment

determines

alignment of

groups

X

X

X

–––––––

X

X

X

X

X

X

XX

XX


Feng doolittle 198711
Feng & Doolittle (1987)

group-to-group

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

XX

XX

Best pairwise

alignment

determines

alignment of

groups

X

X

X

XXXXXXX

X

X

X

X

X

X

XX

XX


Feng doolittle 198712
Feng & Doolittle (1987)

After alignment is completed gap symbols replaced by “X”.

“Once a gap, always a gap”.

Encourages gaps to occur in same columns in subsequent alignments.

Implemented by PILEUP (from GCG package).


Profile alignment
Profile alignment

group-to-group

X

X

X

A

X

X

X

B

X

X

X

Total alignment score = score (A) + score (B) + score (A*B)


Clustalw
CLUSTALW

  • Thompson, Higgins and Gibson (1994).

  • Widely used implementation of profile-based progressive multiple alignment.

  • Similar to Feng-Doolittle method, except for use of profile alignment methods.

  • Overview:

    • Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances”.

    • Construct guide tree from distance matrix by using an appropriate neighbour-joining clustering algorithm.

    • Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment.

  • Plus many other heuristics.


Clustal w heuristics
CLUSTAL W heuristics

  • Closely related sequences are aligned with hard matrices (BLOSUM80) and distant sequences are aligned with soft matrices (BLOSUM50).

  • Hydrophobic residues (which are more likely to be buried) are given higher gap penalties than hydrophilic residues (which are more likely to be surface-accessible).

  • Gap-open penalties are also decreased if the position is spanned by 5 or more consecutive hydrophilic residues.


Clustal w heuristics1
CLUSTAL W heuristics

  • Both gap-open penalties and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries to force all gaps to occur in the same places in an alignment.

  • In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.


Iterative refinement
Iterative refinement

i.e. “hill climbing”. Slightly change solution to improve score. Converge to local optimum.

e.g. Barton-Sternberg (1987) multiple alignment

Find the two sequences with the highest pairwise similarity and align them

using standard dynamic programming alignment.

Find sequence most similar to a profile of the alignment of the first two, and align it to first two by profile-sequence alignment.

Repeat until all sequences have been included in the multiple alignment.

Remove sequence and realign it to a profile of the other aligned

sequences by profile-sequence alignment. Repeat for

sequences .

Repeat the previous alignment step a fixed number of times, or until the alignment score converges.






C_aminophilum AGCT.YCGCATGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT

C_colinum AGTA..GGCATCTACAAGTT GGAAAA.... ............ACTGAGGT GGTATAGGAG

C_lentocellum GGTATTCGCTTGATTATNATAGTAAA.... ............GATTTATC GCCATAGGAT

C_botulinum_D TTTA.TGGCATCATACATAAAATAATCAAA ..........GGAGCAATCC GCTTTGAGAT

C_novyi_A TTTA.CGGCAT....CGTAG AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT

C_gasigenes AGTT.TCGCATGAAACA... GC.AATTAAA ..........GGAGAAATCC GCTATAAGAT

C_aurantibutyricum A.NT.TCGCATGGAGCA... AC.AATCAAA ..........GGAGCAAT.CACTATAAGAT

C_sp_C_quinii AGTT.T.GCATGGGACA... GC.AATTAAA ..........GGAGCAATCC GCTATGAGAT

C_perfringens AAGA.TGGCAT.CATCA... TTCAACCAAA ..........GGAGCAATCC GCTATGAGAT

C_cadaveris TTTT.CTGCATGGGAAA... GTC.ATGAAA ..........GGAGCAATCC GCTGTAAGAT

C_cellulovorans ATTC.TCGCATGAGAGA... .TGTATCAAA ..........GGAGCAATCC GCTATAAGAT

C_K21 TTGR.TCGCATGATCKAAACATCAAAGGAT ..TTTTCTTTGGAAAATTCCACTTTGAGAT

C_estertheticum TTGA.TCGCATGATCTTAACATCAAAGGAA ..TTT..TTCGG..AATTTCACTTTGAGAT

C_botulinum_A AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT

C_sporogenes AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT

C_argentinense AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT

C_subterminale AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT

C_tetanomorphum TTTT.CCGCATGAAAAACTAATCAAAGGAG ..T............AAT.C GCTTTGAGAT

C_pasteurianum AGTT.TCACATGGAGCTTTAATTAAAGGAG ..T............AATCC GCTTTGAGAT

C_collagenovorans TTGA.TCGCATGGTCGAAATATTAAAGGAG ..T............AATCC GCTTACAGAT

C_histolyticum TTTA.ATGCATGTTAGAAAG ATTAAAGGAG ..............CAATCC GCTTTGAGAT

C_tyrobutyricum AGTT.TCACATGGAATTTGG ATGAAAGGAG ..T............AATTC GCTTTGAGAT

C_tetani GGTT.TCGCATGAAACTTTAACCAAAGGAG ..T............AATCT GCTTTGAGAT

C_barkeri GACA.TCGCATGGTGTT... .TTAATGAAA ............ACTCCGGT GCCATGAGAT

C_thermocellum GGCA.TCGTCCTGTTAT... .CAAAGGAGA ............AATCCGGT ...ATGAGAT

Pep_prevotii AGTC.TCGCATGGNGTTATCATCAAAGA.. ..............TTTATC GGTGTAAGAT

C_innocuum ACGGAGCGCATGCTCTGTATATTAAAGCGCCCTTCAAGGCGTGAAC.... ....ATGGAT

S_ruminantium AGTTTCCGCATGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT


TCAAAGGAG

C_aminophilum AGCT.YCGCATGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT

C_colinum AGTA..GGCATCTACAAGTT GGAAAA.... ............ACTGAGGT GGTATAGGAG

C_lentocellum GGTATTCGCTTGATTATNATAGTAAA.... ............GATTTATC GCCATAGGAT

C_botulinum_D TTTA.TGGCATCATACATAAAATAATCAAA ..........GGAGCAATCC GCTTTGAGAT

C_novyi_A TTTA.CGGCAT....CGTAG AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT

C_gasigenes AGTT.TCGCATGAAACA... GC.AATTAAA ..........GGAGAAATCC GCTATAAGAT

C_aurantibutyricum A.NT.TCGCATGGAGCA... AC.AATCAAA ..........GGAGCAAT.CACTATAAGAT

C_sp_C_quinii AGTT.T.GCATGGGACA... GC.AATTAAA ..........GGAGCAATCC GCTATGAGAT

C_perfringens AAGA.TGGCAT.CATCA... TTCAACCAAA ..........GGAGCAATCC GCTATGAGAT

C_cadaveris TTTT.CTGCATGGGAAA... GTC.ATGAAA ..........GGAGCAATCC GCTGTAAGAT

C_cellulovorans ATTC.TCGCATGAGAGA... .TGTATCAAA ..........GGAGCAATCC GCTATAAGAT

C_K21 TTGR.TCGCATGATCKAAACATCAAAGGAT ..TTTTCTTTGGAAAATTCCACTTTGAGAT

C_estertheticum TTGA.TCGCATGATCTTAACATCAAAGGAA ..TTT..TTCGG..AATTTCACTTTGAGAT

C_botulinum_A AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT

C_sporogenes AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT

C_argentinense AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT

C_subterminale AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT

C_tetanomorphum TTTT.CCGCATGAAAAACTAATCAAAGGAG ..T............AAT.C GCTTTGAGAT

C_pasteurianum AGTT.TCACATGGAGCTTTAATTAAAGGAG ..T............AATCC GCTTTGAGAT

C_collagenovorans TTGA.TCGCATGGTCGAAATATTAAAGGAG ..T............AATCC GCTTACAGAT

C_histolyticum TTTA.ATGCATGTTAGAAAG ATTAAAGGAG ..............CAATCC GCTTTGAGAT

C_tyrobutyricum AGTT.TCACATGGAATTTGG ATGAAAGGAG ..T............AATTC GCTTTGAGAT

C_tetani GGTT.TCGCATGAAACTTTAACCAAAGGAG ..T............AATCT GCTTTGAGAT

C_barkeri GACA.TCGCATGGTGTT... .TTAATGAAA ............ACTCCGGT GCCATGAGAT

C_thermocellum GGCA.TCGTCCTGTTAT... .CAAAGGAGA ............AATCCGGT ...ATGAGAT

Pep_prevotii AGTC.TCGCATGGNGTTATCATCAAAGA.. ..............TTTATC GGTGTAAGAT

C_innocuum ACGGAGCGCATGCTCTGTATATTAAAGCGCCCTTCAAGGCGTGAAC.... ....ATGGAT

S_ruminantium AGTTTCCGCATGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT

TCAAAGGAG


Alignment considerations
Alignment - considerations

  • The programs simply try to maximize the number of matches

    • The “best” alignment may not be the correct biological one

  • Multiple alignments are done progressively

    • Such alignments get progressively worse as you add sequences

    • Mistakes that occur during alignment process are frozen in.

  • Unless the sequences are very similar you will almost certainly have to correct manually


Manual alignment software
Manual Alignment- software

Geneious 2.0- java application:

  • http://www.geneious.com/

    CINEMA- Java applet available from:

  • http://www.biochem.ucl.ac.uk

    Seqapp/Seqpup- Mac/PC/UNIX available from:

  • http://iubio.bio.indiana.edu

    Se-Al for Macintosh, available from:

  • http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

    BioEdit for PC, available from:

  • http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bioedit.html



Missing G

Extra T




What makes a good alignment1
What makes a good alignment

Sequence Alignment

Structural Alignment



I hate ad hoc algorithms and manual sequence alignment is there an alternative
I hate ad hoc algorithms and manual sequence alignment!Is there an alternative?


An evolutionary hypothesis
An evolutionary hypothesis

Hypothesis/Model

AG

Knowing the rates of different events (substitutions, insertions and deletions) provides a method of assessing the probability of these observations, given this hypothesis: Pr{D|T,Q}

T: the evolutionary tree

Q: parameters of the evolutionary process

G->A

Insert CC

Insert T

G->C

T->C

Delete G

AAT

AAC

AC

ACCG

ACC

Observations


Statistics fitting versus modeling
Statistics: fitting versus modeling

  • Statistical fitting of sequence variation

    • Count frequencies of changes in real data sets

    • Build empirical statistical descriptions of the data (Blosum62)

    • Compare observed frequencies to well defined null hypothesis for testing (log-odds ratio and scores)

    • Use scores in ad hoc algorithms for search and alignment (BLAST and ClustalX)

  • Probabilistic models of sequence evolution

    • Describe a probabilistic model in terms of a process of evolution, rates of substitution, insertion and deletion

    • Estimate parameters of the models and compare models using model comparison (likelihood ratios, Bayes factors)

    • Use maximum likelihood and Bayesian inference to co-estimate (uncertainty in) alignment and evolutionary history.


Probabilistic models and biology
Probabilistic models and biology

3D structure of myoglobin, showing six alpha-helices.



What does the future hold
What does the future hold?

  • No single “true” alignment

    • In most situations there are a set of alignments that are consistent with the observations

    • Understanding this uncertainty is as important as understanding the “best” alignment

  • Explicit evolutionary model-based methods

    • Methods that co-estimate alignment and phylogeny are beginning to appear

    • Co-estimation of protein structure and alignment using evolutionary models may be on horizon

  • Death of manual sequence alignment?


ad