Multiple alignment: Feng-Doolittle algorithm

1 / 21

# Multiple alignment: Feng-Doolittle algorithm - PowerPoint PPT Presentation

Multiple alignment: Feng-Doolittle algorithm. Why multiple alignments?. Alignment of more than two sequences Usually gives better information about conserved regions and function (more data) Better estimate of significance when using a sequence of unknown function

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Multiple alignment: Feng-Doolittle algorithm' - gema

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Multiple alignment: Feng-Doolittle algorithm

Why multiple alignments?
• Alignment of more than two sequences
• Usually gives better information about conserved regions and function (more data)
• Better estimate of significance when using a sequence of unknown function
• Must use multiple alignments when establishing phylogenetic relationships
Dynamic programming extended to many dimensions?
• No – uses up too much computer time and space
• E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 104 matrix elements
• If 3 sequences, 8 x 106 matrix elements
• If 6 sequences, 6.4 x 1013 matrix elements
Need to find more efficient method
• Sacrifice certainty of optimum alignment for certainty of good alignment but faster
Feng-doolittle algorithm
• Does all pairwise alignments and scores them
• Converts pairwise scores to “distances”
• D = -logSeff = -log [(Sobs –Srand)/(Smax –Srand)]
• Sobs = pairwise alignment score
• Srand = exoected score for random alignment
• Smax = average of self-alignments of the two sequences
As Smax approaches Srand (increasing evolutionary distance), Seff goes down; to make the distance measure positive, use the -log
Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences
• Sequences can be aligned with sequences or groups; groups can be aligned with groups
Sequence-sequence alignments: dynamic programming
• Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group
• Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned
Example

Seq5

Seq3

Seq4

Seq1

Seq2

Alignment 2

Alignment 1

Alignment 3

Final alignment

Notice that this method does not guarantee the optimum alignment; just a good one.

Gaps are preserved from alignment to alignment: “once a gap, always a gap”

In-class exercise
• Retrieve sequences from multalign.apr into BioScout
• Run Gap in BioScout on all combinations of the sequences in multalign.apr; use a gap penalty of 6 and an extension penalty of 2
• Record alignment scores of each pairwise comparison
• Save pairwise alignments
In class exercise, cont
• use raw alignment scores as distance measures; make a guide tree based on these scores
• In Vector NTI, select all sequences in multalign.apr (in the sequence pane); choose Alignment from the toolbar at the top; choose Alignment Setup from the pulldown; choose multiple alignment; take the defaults, choose ok; choose Alignment again, this time choose Align Selected Sequences from the pulldown
In class exercise, cont.
• Note that ClustalW does some other things that the Pileup program discussed on the tape does not; we are going to ignore those things for the moment
• Compare ClustalW’s guide tree (visible in the Phylogenetic Tree Pane – tab at bottom of window) with yours
In class exercise, cont
• Carefully examine ClustalW’s alignment; compare it to the individual pairwise alignments you saved. Are there differences?
Start refining alignment:
• Use structural info if you have it
• Find patterns if you don’t
• Use amino acid structure handout from beginning of class for substitution decisions!
ClustalW
• Most widely used multiple alignment method
• Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results
• Ad hoc nature of the program can be mysterious
• Gap penalties vary locally:
• By observed frequency (in database) after each residue
• By simple structure prediction – lower gap penalties in probable loop regions
• By proximity to existing gaps – higher gap penalties when within 8 residues of an existing gap
• Change in substitution matrix choice depending on distance computed for guide tree
• Substitution matrix families
• Profile construction (more later)
• Weighting of sequences in profiles depending on evolutionary distance computed for guide tree
• More similar sequences get less weight than less similar sequences
In class exercise II
• Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum)
• Compare alignment results with different parameters changed
MultAlin
• MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments
• It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step
• Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving
Scoring a multiple sequence alignment
• Assumptions:
• Sequences (rows) independent
• Positions (columns) independent
• Neither assumption is true …
• Score of a column is the (possibly weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column
• Score of a multiple alignment is the sum of scores for all columns