multiple sequence alignments n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Multiple Sequence Alignments PowerPoint Presentation
Download Presentation
Multiple Sequence Alignments

Loading in 2 Seconds...

play fullscreen
1 / 38

Multiple Sequence Alignments - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Multiple Sequence Alignments. Progressive Alignment. x. When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x , p y , to generate a new alignment with associated profile p result Weighted version:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multiple Sequence Alignments' - carina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
progressive alignment
Progressive Alignment

x

  • When evolutionary tree is known:
    • Align closest first, in the order of the tree
    • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult

Weighted version:

    • Tree edges have weights, proportional to the divergence in that edge
    • New profile is a weighted average of two old profiles

pxy

y

z

pxyzw

pzw

w

progressive alignment1
Progressive Alignment

x

  • When evolutionary tree is unknown:
    • Perform all pairwise alignments
    • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment
    • Construct a tree (UPGMA / Neighbor Joining / Other methods)
    • Align on the tree

y

?

z

w

heuristics to improve alignments
Heuristics to improve alignments
  • Iterative refinement schemes
  • A*-based search
  • Consistency
  • Simulated Annealing
iterative refinement
Iterative Refinement

One problem of progressive alignment:

  • Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTT

y: GAC-TT

z: GAACTG

w: GTACTG

Frozen!

Now clear correct y = G-ACTT

iterative refinement1

allow y to vary

x,z fixed projection

Iterative Refinement

Algorithm (Barton-Stenberg):

  • For j = 1 to N,

Remove xj, and realign to x1…xj-1xj+1…xN

  • Repeat 4 until convergence

z

x

y

iterative refinement2
Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTA

y: GAC-TTA

z: GAACTGA

w: GTACTGA

After realigning y:

x: GAAGTTA

y: G-ACTTA + 3 matches

z: GAACTGA

w: GTACTGA

Variant:

Refinement on a tree

iterative refinement3
Iterative Refinement

Example not handled well:

x: GAAGTTA

y1: GAC-TTA

y2: GAC-TTA

y3: GAC-TTA

z: GAACTGA

w: GTACTGA

  • Realigning any single yi changes nothing
consistency
Consistency

zk

z

xi

x

y

yj

yj’

consistency1
Consistency

zk

z

Basic method for applying consistency

  • Compute all pairs of alignments xy, xz, yz, …
  • When aligning x, y during progressive alignment,
    • For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz)
    • Align x and y with DP using the modified s(.,.) function

xi

x

y

yj

yj’

some resources
Some Resources

Genome Resources

Annotation and alignment genome browser at UCSC

http://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNL

http://pipeline.lbl.gov/cgi-bin/gateway2

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable

http://probcons.stanford.edu/

PROBCONS – most accurate

muscle at a glance
MUSCLE at a glance
  • Fast measurement of all pairwise distances between sequences
    • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time
  • Build tree TDRAFT based on those distances, with UPGMA
  • Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT
    • Only perform alignment steps for the parts of the tree that have changed
  • Measure new Kimura-based distances D(x, y) based on MDRAFT
  • Build tree T based on D
  • Progressive alignment over T, to build M
  • Iterative refinement; for many rounds, do:
    • Tree Partitioning: Split M on one branch and realign the two resulting profiles
    • If new alignment M’ has better sum-of-pairs score than previous one, accept
probcons at a glance
PROBCONS at a glance
  • Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM
  • Re-estimation of posterior matrices M’xy with probabilistic consistency
    • M’xy(i, j) = 1/N sequence zk Mxz(i, k)  Myz (j, k); M’xy = Avgz(MxzMzy)
  • Compute for every pair x, y, the maximum expected accuracy alignment
    • Axy: alignment that maximizes aligned (i, j) in AM’xy(i, j)
    • Define E(x, y) = aligned (i, j) in AxyM’xy(i, j)
  • Build tree T with hierarchical clustering using similarity measure E(x, y)
  • Progressive alignment on T to maximize E(.,.)
  • Iterative refinement; for many rounds, do:
    • Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles
rapid global alignments

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

motivation
Motivation
  • Genomic sequences are very long:
    • Human genome = 3 x 109 –long
    • Mouse genome = 2.7 x 109 –long
  • Aligning genomic regions is useful for revealing common gene structure
    • Useful to compare regions > 1,000,000-long
main idea
Main Idea

Genomic regions of interest contain islands of similarity, such as genes

  • Find local alignments
  • Chain an optimal subset of them
  • Refine/complete the alignment

Systems that use this idea to various degrees:

MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

saving cells in dp
Saving cells in DP
  • Find local alignments
  • Chain -O(NlogN) L.I.S.
  • Restricted DP
slide19

Methods toCHAINLocal Alignments

Sparse Dynamic Programming

O(N log N)

the problem find a chain of local alignments
The Problem: Find a Chain of Local Alignments

(x,y)  (x’,y’)

requires

x < x’

y < y’

Each local alignment has a weight

FIND the chain with highest total weight

quadratic time solution
Quadratic Time Solution
  • Build Directed Acyclic Graph (DAG):
    • Nodes: local alignments [(xa,xb) (ya,yb)] & score
    • Directed edges: local alignments that can be chained
      • edge ( (xa, xb, ya, yb), (xc, xd, yc, yd) )
      • xa < xb < xc < xd
      • ya < yb < yc < yd

Each local alignment

is a node vi with

alignment score si

quadratic time solution1
Quadratic Time Solution

Initialization:

Find each node va s.t. there is no edge (u, va)

Set score of V(a) to be sa

Iteration:

For each vi, optimal path ending in vi has total score:

V(i) = maxjs.t. there is edge (vj, vi) ( weight(vj, vi) + V(j) )

Termination:

Optimal global chain:

j = argmax ( V(j) ); trace chain from vj

Worst case time: quadratic

sparse dynamic programming
Sparse Dynamic Programming

Back to the LCS problem:

  • Given two sequences
    • x = x1,…, xm
    • y = y1, …, yn
  • Find the longest common subsequence
    • Quadratic solution with DP
  • How about when “hits” xi = yj are sparse?
sparse dynamic programming1
Sparse Dynamic Programming
  • Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
sparse dynamic programming l i s
Sparse Dynamic Programming – L.I.S.
  • Longest Increasing Subsequence
  • Given a sequence over an ordered alphabet
    • x = x1,…, xm
  • Find a subsequence
    • s = s1, …, sk
    • s1 < s2 < … < sk
sparse dynamic programming l i s1
Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:

L: 1-indexed array, L[1]  w1

B: 0-indexed array of backpointers; B[0] = 0

P: array used for traceback

// L[j]: smallest last element wi of j-long LIS seen so far

ALGORITHM

for i = 2 to n {

Find j such that L[j] < w[i] ≤ L[j+1]

L[j+1]  w[i]

B[j+1]  i

P[i]  B[j]

}

That’s it!!!

  • Running time?
sparse lcs expressed as lis
Sparse LCS expressed as LIS

x

Create a sequence w

  • Every matching point (i, j), is inserted into w as follows:
  • For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order
  • The 11 example points are inserted in the order given
  • a = (y, x), b = (y’, x’) can be chained iff
    • a is before b in w, and
    • y < y’

y

sparse lcs expressed as lis1
Sparse LCS expressed as LIS

x

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

  • (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

y

sparse dynamic programming for lis
Sparse Dynamic Programming for LIS

x

  • Algorithm:

initialize empty array L

/* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */

for i = 1 to |w|

binary search for w[i] in L, to find lj< w[i] ≤ lj+1

replace lj+1 with w[i]

keep a backptr lj w[i]

That’s it!!!

y

sparse dynamic programming for lis1
Sparse Dynamic Programming for LIS

x

Example:

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L =

  • (4,2)
  • (3,3)
  • (3,3) (10,5)
  • (2,5) (10,5)
  • (2,5) (8,6)
  • (1,6) (8,6)
  • (1,6) (3,7)
  • (1,6) (3,7) (4,8)
  • (1,6) (3,7) (4,8) (7,9)
  • (1,6) (3,7) (4,8) (5,9)
  • (1,6) (3,7) (4,8) (5,9) (9,10)

y

Longest common subsequence:

s = 4, 24, 3, 11, 18

sparse dp for rectangle chaining
Sparse DP for rectangle chaining
  • 1,…, N: rectangles
  • (hj, lj): y-coordinates of rectangle j
  • w(j): weight of rectangle j
  • V(j): optimal score of chain ending in j
  • L: list of triplets (lj, V(j), j)
    • L is sorted by lj: top to bottom
    • L is implemented as a balanced binary tree

h

l

y

sparse dp for rectangle chaining1
Sparse DP for rectangle chaining

Main idea:

  • Sweep through x-coordinates
  • To the right of b, anything chainable to a is chainable to b
  • Therefore, if V(b) > V(a), rectangle a is “useless” – remove it
  • In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j)

V(b)

V(a)

sparse dp for rectangle chaining2
Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

  • When on the leftmost end of i:
    • j: rectangle in L, with largest lj < hi
    • V(i) = w(i) + V(j)
  • When on the rightmost end of i:
    • k: rectangle in L, with largest lk li
    • If V(i)  V(k):
      • INSERT (li, V(i), i) in L
      • REMOVE all (lj, V(j), j) with V(j)  V(i) & lj li

j

i

k

example
Example

x

2

1: 5

5

6

2: 6

9

10

3: 3

11

12

14

4: 4

15

5: 2

16

y

time analysis
Time Analysis
  • Sorting the x-coords takes O(N log N)
  • Going through x-coords: N steps
  • Each of N steps requires O(log N) time:
    • Searching L takes log N
    • Inserting to L takes log N
    • All deletions are consecutive, so log N per deletion
    • Each element is deleted at most once: N log N for all deletions
      • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
examples
Examples

Human Genome Browser

ABC