1 / 23

Aligning Alignments Exactly

Aligning Alignments Exactly. By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng. Background Definition Hardness An Exponential time algorithm. Alignments.

anne
Download Presentation

Aligning Alignments Exactly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aligning Alignments Exactly By John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaAppeared in 8th ACM RECOME 2004, Presented by Jie Meng

  2. Background • Definition • Hardness • An Exponential time algorithm

  3. Alignments • Given two (DNA or Protein) sequences, an alignment puts them against each other such that the similar parts are aligned as close as possible, for example: A T – C – T C G C T- T G - A T G – A T There are four kinds of alignments Match Insertion; Deletion; Mismatch

  4. Scoring Alignments • There are four types of aligned columns: • Match – Score match = 0. • Mismatch – Score mismatch  0. • Insertion – Score insertion  0. • Deletion – Score deletion  0. • The score of an alignment is defined to be the sum of the score of the aligned columns. • The goal is to minimize the score

  5. Gap-cost • We can extend the score indel by open and extension, then for a gap of size x, we have open +x* extension instead of x* indel . • AT----CGCTTCAT -TGCAT—AT----- • open +4* extension

  6. Multiple Alignments • In general we also need compare multiple sequences and find the similarities. • Multiple alignment generalizes the alignment idea to handle many sequences. • AT-C-TCGAT -TGCAT--AT ATCCA-CGCT

  7. Sum-of-Pairs (SP) Score • Given a multiple alignment, the sum-of-pairs (SP) score is given by the sum of the induced pairwise alignment scores of each pair in the alignment. • AT-C-TCGAT -TGCAT--AT ATCCA-CGCT •  • AT-C-TCGAT -TGCAT--AT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT + +

  8. BAD NEWS • Multiple alignment is NP-hard • One methods is to approximate the optimal value; • Progressive alignments • A problem arised natually: Aligning Alignments

  9. Aligning Alignments • Let S be a collection of strings s1, s2, s3…sk, over alphabet ; • An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si; • Like before, we have three types of aligning score:match, mismatch and substitution;

  10. Aligning Alignments • Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B; AT-C-TCGAT-TGCAT--ATATCCA-CGAT CT-ATTGGAT-TTAT-G--TCTTA-GGGAT

  11. Aligning Alignments • In other word, We treat the columns of A and B as single letters, just like aligning two sequences. • CT GT -T • AT -T GT C-TG-T--T -AT--T-GT

  12. Aligning Alignments • The score function is still sum-of-pair, namely • We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here Ai’: a----aa-a Bj’: aaa-a-a-a

  13. Aligning Alignments • Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.

  14. Aligning Alignments • With gap cost, this problem is NP-complete • We can use a reduction from MAX-CUT problem • MAX-CUT: Given a graph G=(V, E), and a integer c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c; • By cut, it means the set of edges which have one end vertex in L and another is in R;

  15. NP-hardness • Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c; • we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;

  16. NP-hardness • The dummy rows in A are (0-)n, dummy rows in B are (0--)n; • As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere; • As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”

  17. NP-hardness • Simply we let score for match is 0, score for mismatch is 1, and gap open cost is 2, gap extension cost is 1 ask whether there is an alignment such that the score is less then d-c; So we have an instance of Aligning Alignments.

  18. HOMEWORK4 • Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.

  19. Exact Algorithm • The basic idea is still dynamic programming; • We have to remember extra information by a set, so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.

  20. Exact Algorithm • S(i, j)=

  21. Exact Algorithm • C(i,j,t)=min • Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;

  22. Exact Algorithm • The optimum value is • The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is

  23. Any Questions? 423B jmeng@cs.tamu.edu

More Related