Multiple Sequence Alignments

1 / 38

# Multiple Sequence Alignments - PowerPoint PPT Presentation

Multiple Sequence Alignments. Progressive Alignment. x. When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x , p y , to generate a new alignment with associated profile p result Weighted version:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Multiple Sequence Alignments' - carina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Multiple Sequence Alignments

Progressive Alignment

x

• When evolutionary tree is known:
• Align closest first, in the order of the tree
• In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult

Weighted version:

• Tree edges have weights, proportional to the divergence in that edge
• New profile is a weighted average of two old profiles

pxy

y

z

pxyzw

pzw

w

Progressive Alignment

x

• When evolutionary tree is unknown:
• Perform all pairwise alignments
• Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment
• Construct a tree (UPGMA / Neighbor Joining / Other methods)
• Align on the tree

y

?

z

w

Heuristics to improve alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
Iterative Refinement

One problem of progressive alignment:

• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTT

y: GAC-TT

z: GAACTG

w: GTACTG

Frozen!

Now clear correct y = G-ACTT

allow y to vary

x,z fixed projection

Iterative Refinement

Algorithm (Barton-Stenberg):

• For j = 1 to N,

Remove xj, and realign to x1…xj-1xj+1…xN

• Repeat 4 until convergence

z

x

y

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTA

y: GAC-TTA

z: GAACTGA

w: GTACTGA

After realigning y:

x: GAAGTTA

y: G-ACTTA + 3 matches

z: GAACTGA

w: GTACTGA

Variant:

Refinement on a tree

Iterative Refinement

Example not handled well:

x: GAAGTTA

y1: GAC-TTA

y2: GAC-TTA

y3: GAC-TTA

z: GAACTGA

w: GTACTGA

• Realigning any single yi changes nothing
Consistency

zk

z

xi

x

y

yj

yj’

Consistency

zk

z

Basic method for applying consistency

• Compute all pairs of alignments xy, xz, yz, …
• When aligning x, y during progressive alignment,
• For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz)
• Align x and y with DP using the modified s(.,.) function

xi

x

y

yj

yj’

Some Resources

Genome Resources

Annotation and alignment genome browser at UCSC

http://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNL

http://pipeline.lbl.gov/cgi-bin/gateway2

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable

http://probcons.stanford.edu/

PROBCONS – most accurate

MUSCLE at a glance
• Fast measurement of all pairwise distances between sequences
• DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time
• Build tree TDRAFT based on those distances, with UPGMA
• Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT
• Only perform alignment steps for the parts of the tree that have changed
• Measure new Kimura-based distances D(x, y) based on MDRAFT
• Build tree T based on D
• Progressive alignment over T, to build M
• Iterative refinement; for many rounds, do:
• Tree Partitioning: Split M on one branch and realign the two resulting profiles
• If new alignment M’ has better sum-of-pairs score than previous one, accept
PROBCONS at a glance
• Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM
• Re-estimation of posterior matrices M’xy with probabilistic consistency
• M’xy(i, j) = 1/N sequence zk Mxz(i, k)  Myz (j, k); M’xy = Avgz(MxzMzy)
• Compute for every pair x, y, the maximum expected accuracy alignment
• Axy: alignment that maximizes aligned (i, j) in AM’xy(i, j)
• Define E(x, y) = aligned (i, j) in AxyM’xy(i, j)
• Build tree T with hierarchical clustering using similarity measure E(x, y)
• Progressive alignment on T to maximize E(.,.)
• Iterative refinement; for many rounds, do:
• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

### Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Motivation
• Genomic sequences are very long:
• Human genome = 3 x 109 –long
• Mouse genome = 2.7 x 109 –long
• Aligning genomic regions is useful for revealing common gene structure
• Useful to compare regions > 1,000,000-long
Main Idea

Genomic regions of interest contain islands of similarity, such as genes

• Find local alignments
• Chain an optimal subset of them
• Refine/complete the alignment

Systems that use this idea to various degrees:

MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

Saving cells in DP
• Find local alignments
• Chain -O(NlogN) L.I.S.
• Restricted DP

Methods toCHAINLocal Alignments

Sparse Dynamic Programming

O(N log N)

The Problem: Find a Chain of Local Alignments

(x,y)  (x’,y’)

requires

x < x’

y < y’

Each local alignment has a weight

FIND the chain with highest total weight

• Build Directed Acyclic Graph (DAG):
• Nodes: local alignments [(xa,xb) (ya,yb)] & score
• Directed edges: local alignments that can be chained
• edge ( (xa, xb, ya, yb), (xc, xd, yc, yd) )
• xa < xb < xc < xd
• ya < yb < yc < yd

Each local alignment

is a node vi with

alignment score si

Initialization:

Find each node va s.t. there is no edge (u, va)

Set score of V(a) to be sa

Iteration:

For each vi, optimal path ending in vi has total score:

V(i) = maxjs.t. there is edge (vj, vi) ( weight(vj, vi) + V(j) )

Termination:

Optimal global chain:

j = argmax ( V(j) ); trace chain from vj

Sparse Dynamic Programming

Back to the LCS problem:

• Given two sequences
• x = x1,…, xm
• y = y1, …, yn
• Find the longest common subsequence
• How about when “hits” xi = yj are sparse?
Sparse Dynamic Programming
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
• x = x1,…, xm
• Find a subsequence
• s = s1, …, sk
• s1 < s2 < … < sk
Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:

L: 1-indexed array, L[1]  w1

B: 0-indexed array of backpointers; B[0] = 0

P: array used for traceback

// L[j]: smallest last element wi of j-long LIS seen so far

ALGORITHM

for i = 2 to n {

Find j such that L[j] < w[i] ≤ L[j+1]

L[j+1]  w[i]

B[j+1]  i

P[i]  B[j]

}

That’s it!!!

• Running time?
Sparse LCS expressed as LIS

x

Create a sequence w

• Every matching point (i, j), is inserted into w as follows:
• For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
• a is before b in w, and
• y < y’

y

Sparse LCS expressed as LIS

x

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

y

Sparse Dynamic Programming for LIS

x

• Algorithm:

initialize empty array L

/* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */

for i = 1 to |w|

binary search for w[i] in L, to find lj< w[i] ≤ lj+1

replace lj+1 with w[i]

keep a backptr lj w[i]

That’s it!!!

y

Sparse Dynamic Programming for LIS

x

Example:

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L =

• (4,2)
• (3,3)
• (3,3) (10,5)
• (2,5) (10,5)
• (2,5) (8,6)
• (1,6) (8,6)
• (1,6) (3,7)
• (1,6) (3,7) (4,8)
• (1,6) (3,7) (4,8) (7,9)
• (1,6) (3,7) (4,8) (5,9)
• (1,6) (3,7) (4,8) (5,9) (9,10)

y

Longest common subsequence:

s = 4, 24, 3, 11, 18

Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
• L is sorted by lj: top to bottom
• L is implemented as a balanced binary tree

h

l

y

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it
• In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j)

V(b)

V(a)

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

• When on the leftmost end of i:
• j: rectangle in L, with largest lj < hi
• V(i) = w(i) + V(j)
• When on the rightmost end of i:
• k: rectangle in L, with largest lk li
• If V(i)  V(k):
• INSERT (li, V(i), i) in L
• REMOVE all (lj, V(j), j) with V(j)  V(i) & lj li

j

i

k

Example

x

2

1: 5

5

6

2: 6

9

10

3: 3

11

12

14

4: 4

15

5: 2

16

y

Time Analysis
• Sorting the x-coords takes O(N log N)
• Going through x-coords: N steps
• Each of N steps requires O(log N) time:
• Searching L takes log N
• Inserting to L takes log N
• All deletions are consecutive, so log N per deletion
• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
Examples

Human Genome Browser

ABC