Multiple Sequence Alignment

1 / 23

# Multiple Sequence Alignment - PowerPoint PPT Presentation

A - T. A G -. G T T. G G G. G T G. G - -. T - A. T T A. - - A. - T A. C C A. C C C. - G C. - G -. Possible alignment. Possible alignment. Multiple Sequence Alignment. S 1 = AGGTC. S 2 = GTTCG. S 3 = TGAAC. Multiple Sequence Alignment (cont).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

A

-

T

A

G

-

G

T

T

G

G

G

G

T

G

G

-

-

T

-

A

T

T

A

-

-

A

-

T

A

C

C

A

C

C

C

-

G

C

-

G

-

Possible alignment

Possible alignment

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAAC

Multiple Sequence Alignment (cont)

Input: Sequences S1, S2,…, Sk over the same alphabet

Output: Gapped sequences S’1, S’2,…, S’k of equal length

• |S’1|= |S’2|=…= |S’k|
• Removal of spaces from S’iobtains Si

Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence Alignment Example

Consider the following alignment:

AC-CDB-

Scoring scheme: match - 0

mismatch/indel - -1

SP score:

-4

-3

-5

=-12

Multiple Sequence AlignmentComplexity
• Given kstrings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment:
• Instead of a 2-dimensional table we have a k-dimensional table
• Each dimension is of length ‘n’+1
• Each entry depends on 2k-1 adjacent entries

Complexity:O(2knk)

This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence Alignment Approximation Algorithm
• We use cost instead of score
•  Find alignment of minimal cost
• Assumption:the cost function δ is a distance function
• δ(x,x) = 0
• δ(x,y) = δ(y,x) ≥ 0
• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality)
• (e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

Multiple Sequence Alignment Approximation Algorithm
• The ‘star’ algorithm:
• Input: Γ - set of k strings S1,…,Sk.
• Find the string S’ (center) that minimizes
• Denote S1=S’and the rest of the strings as S2,…,Sk
• Iteratively add S2,…,Sk to the alignment as follows:
• Suppose S1,…,Si-1are alreadyaligned as S’1,…,S’i-1
• AlignSi to S’1 to produce S’i and S’’1 aligned
• Replace S’1 by S’’1

total complexity

Multiple Sequence Alignment Approximation Algorithm
• Time analysis:
• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)
• Adding Si to the alignment -execute DP for Si , S’1 - O(i·n2).
• (In the ith stage the length of S’1can be up-to i· n)
Multiple Sequence Alignment Approximation Algorithm
• Approximation ratio:
• M* - optimal alignment
• M - The alignment produced by this algorithm
• d(i,j) - the distanceMinduces on the pair Si,Sj

For all i: d(1,i)=D(S1,Si)

(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Multiple Sequence Alignment Approximation Algorithm

Triangle inequality

Approximation ratio:

Definition of S1:

A

-

T

A

G

-

G

T

T

G

G

G

G

T

G

G

-

-

T

-

A

T

T

A

-

-

A

-

T

A

C

C

A

C

C

C

-

G

C

-

G

-

Possible alignment

Possible alignment

Multiple Sequence AlignmentReminder

S1=AGGTC

S2=GTTCG

S3=TGAAC

Multiple Sequence AlignmentReminder

Input: Sequences S1, S2,…, Sk over the same alphabet

Output: Gapped sequences S’1, S’2,…, S’k of equal length

• |S’1|= |S’2|=…= |S’k|
• Removal of spaces from S’iobtains Si

Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence AlignmentReminder
• The ‘star’ algorithm:
• Input: Γ - set of k strings S1,…,Sk.
• Find the string S1 (center) that minimizes
• Iteratively add S2,…,Sk to the alignment
• Finds MA costing at most twice the optimal cost!

Problem: Conventional MA does not model correctly evolutionary relationships

Tree Alignment
• Input:X - set of sequences
• T – phylogenetic tree on X (leaves labeled by X)
• Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.
• How do we label internal vertices?
• Sequences
• Profiles (multiple alignments)

A

-

T

G

G

G

G

-

-

T

T

A

-

T

A

C

C

C

-

G

-

Profile Alignment

A profile of a MA of length n over alphabet Σ is a (|Σ|+1)*n table.

Column i holds the distribution of Σ (and gap) in that position

: 3

Profile Alignment
• Aligning a sequence to a profile:
• Matching letter to position: weighted average of scores
• Indels: introducing new columns gets special consideration
• (same goes for aligning two profiles)

: 3

Clustal Algorithm
• Iteratively constructs MA for intermediate nodes
• At each point holds profiles for all leaves
• Chooses closest pair of neighbors
• neighbors – have common father in T
• distance - cost of optimal (pairwise) alignment
• Aligns the two profiles to get the ‘father-profile’
• Replaces the two leaves with their father
• Analysis:
• Initialization – O(k2) alignments
• k-1 iterations
• Iteration i involves k-i-1 new pairwise alignments

Sequences/profiles are weighted

S4

S4

S5

S2

S5

S1

S2

S3

S4

S6

Lifted Tree Alignments

Lifted tree alignment –

each internal node is labeled by one of the labels of its daughters

Internal nodes are sequences and not profiles

Example:

We’ll show:

DP algorithm for optimal lifted tree alignment

Optimal lifted alignment is 2-approximation of optimal tree alignment

S4

S4

S2

S5

S5

S1

S2

S3

S4

S6

Lifted Tree AlignmentsAlgorithm

Input:X - set of sequences

T – phylogenetic tree on X (leaves labeled by X)

Output:lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.

Basic principle: calculate for every node v in T, and sequence S in X:

d(v,S) - the optimal cost of v’s subtree when it is labeled by S

The cost of optimal tree is

S4

S4

S2

O(k2depth(T))=O(k3)

S5

S5

S1

S2

S3

S4

S6

Lifted Tree AlignmentsAlgorithm

d(v,S) - the optimal cost of v’s subtree when it is labeled by S

Initialization: for leaf v labeled Sv -

Recurrence: for internal node v with daughters u1,…ul -

Correctness: check for suboptimal solution property

Complexity:O(k2) pairwise alignments - O(n2k2).

k-1 iterations

For internal node v - O(kv2) work

Total: O(k2(n2+depth(T)))

S4

S4

S2

S5

S5

S1

S2

S3

S4

S6

Lifted Tree AlignmentsApproximation analysis
• Claim: Optimal LTA 2-approximates general tree alignments
• We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes
• (? can be generalized for profile-labeled nodes ?)
• Notations:
• T* - optimal TA labels
• Sv* - label of node v in T*
• TL– our constructed LTA
• SvL - label of node v in TL

S4

S4

S2

S5

S5

S1

S2

S3

S4

S6

Lifted Tree AlignmentsApproximation analysis
• Construction:
• We label the nodes bottom-up.
• For node v with daughters u1,…ul –
• we choose the label (from Su1L ,…,SulL) closest to Sv*
• We need to show: D(TL) ≤ 2D(T*)

S4

S4

triangle inequality

choice of i

triangle inequality

S2

S5

S5

S1

S2

S3

S4

S6

Lifted Tree AlignmentsApproximation analysis
• Analysis:
• Some edges in TL have cost 0
• Observe edges (v,u) of cost > 0:
• Si- label of father(v)
• Sj- label of daughter (u)
• P(v,u) – the path in T* from v to the leaf labeled by Sj
• D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u))

Q.E.D.

S4

S4

S2

S5

S5

S1

S2

S3

S4

S6

Lifted Tree AlignmentsApproximation analysis
• D(Si,Sj) ≤ 2D(P(v,u))

If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges

• Final Remarks:
• Lifted tree alignment TL is only conceptual (we don’t have T*)
• Optimal LTA cannot cost more than TL
• In case of profile-labeled nodes:
• construction and analysis OK when cost is still distance function