1 / 30

Sequence Local Alignment using Directed Acyclic Word Graph

Sequence Local Alignment using Directed Acyclic Word Graph. Do Huy Hoang. Sequence Alignment. Sequence Similarity. Alignment Arrange DNA/Protein sequences to show the similarity “” denotes the insertion/deletion event. Other variations. Edit distance Longest common substring

vic
Download Presentation

Sequence Local Alignment using Directed Acyclic Word Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang

  2. Sequence Alignment

  3. Sequence Similarity • Alignment • Arrange DNA/Protein sequences to show the similarity • “” denotes the insertion/deletion event

  4. Other variations • Edit distance • Longest common substring • Affine gap scoring • Using scoring matrix (BLOSUM, PAM)

  5. Alignment score computation • Needleman–Wunsch • Dynamic programming

  6. Other variations

  7. Local alignment • Local alignment • Find the best alignments of two substring from the sequences

  8. BWTSW

  9. BWTSW • Motivation • Scoring 75% similarity • Local alignment table most are zero • Meaningful alignment • Suffix tree • Meaningful alignment • Meaningful alignment with gap • How good is it?

  10. Meaningful alignment (1) • Sequences similarity sometimes implies functional similarity. • Biologists is NOT usually interested in sequences with less than 70% similarity. • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending gap = -2

  11. Meaningful alignment (2) • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending Gap = -2 • At least 70% match to have none zero score

  12. Meaningful alignment (3) • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending Gap = -2 • How many none zero entries in the local alignment DP table?

  13. How to improve? • Idea: • Not storing zero score entries • Using suffix tree to prune off early

  14. BWTSW details • FM index for suffix tree representation • Prune zero entries • Store DP vector using linked list

  15. Analysis • Text length = N • Pattern length = M • Alphabet size = 

  16. Average running time (1) • Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0 • Sizeof{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0} • F(L) counts the number of pairs of 75% identity. • F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i) • F(L)  k1k2L • F(log(N))  k3* N0.68

  17. Average running time (2) • Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L • For M < log(N) • The number of entries are • O(M * F(M)) < O(log(N)*F(log(N)) • For M > log (N) • O(M * N * F(M) / L) • On average • Time = O(M*F(log(N))) = M * N0.68

  18. DAWG

  19. Possible improvement of BWTSW • Worst case running time O(N2 M) • When M=N • O(M N0.68+M3) When M is substring of N • What about ST vs. ST?

  20. What we used in BWTSW is Suffix Trie (not suffix tree). • #Prove it# • Suffix trie has O(N2)nodes • DAWG is a similar structure with O(N) nodes

  21. DAWG (1)

  22. DAWG (2) • DAWG: Directed Acyclic Word Graph • DAWG is a cyclic automata that recognizes all the sub-strings of the given string.

  23. DAWG (3) • Example: • DAWG of “abcbc”

  24. DAWG (4) • End-set view

  25. Trivial DAWG construction • Using End-set class

  26. DAWG properties • For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges

  27. D(w) and ST(wR) • There is a map between nodes in DAWG and implicit ST(wR) • Example: w=abcbc, wR=cbcba • Store DAWG using ST, which uses only o(N) bits a cb b a a cba cba

  28. D(w) and ST(wR) (2) list all incoming edges of node q in Dw using ST(w^R)

  29. Local Alignment using DAWG • Basis • Induction

  30. Extensions • Meaningful alignment using DAWG • Prune the nodes whose Score is less than zero • Shortest path pruning style • Cache log(N) nodes  the worst case running time is M*N*log(N), average case is the same for M << N.

More Related