1 / 48

DNA Sequence Alignment

DISP Laboratory Graduate Institute of Communication Engineering National Taiwan University, Taipei, Taiwan (ROC) Speaker: Che -Ming Hu Adviser : Prof. Jian-Jiun Ding. DNA Sequence Alignment. Outline. Motivations Introduction to DNA sequences Sequence alignment algorithm

iris-park
Download Presentation

DNA Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DISP Laboratory Graduate Institute of Communication EngineeringNational Taiwan University, Taipei, Taiwan (ROC)Speaker: Che-MingHuAdviser : Prof. Jian-Jiun Ding DNA Sequence Alignment DISP Lab @ MD531

  2. Outline • Motivations • Introduction to DNA sequences • Sequence alignment algorithm • Dynamic Programming Algorithm • FASTA • BLAST • UDCR • CUDCR • Conclusion • Reference DISP Lab @ MD531

  3. Motivation Huge amount of sequences Too much computation time DISP Lab @ MD531

  4. Introduction to DNA sequence(1) • DNA Sequence Assembly • Shotgun Sequencing • Greedy algorithm • Issues of shotgun sequencing • Sequence alignment DISP Lab @ MD531

  5. Original DNA Copies of the original DNA Fragments of the copies (Shotgun) Reconstruct the original DNA Introduction to DNA sequence(2) • Shotgun Sequencing DISP Lab @ MD531

  6. Introduction to DNA sequence(3) • Greedy algorithm • Step1. Calculate pair-wise alignments of all fragments. • Step2. Choose two fragments with the largest overlap. • Step3. Merge the chosen fragments. • Step4. Repeat step 2 and 3 until there is not any fragment which can be merged. DISP Lab @ MD531

  7. Introduction to DNA sequence(4) • Issues of shotgun sequencing • The errors in fragments. • The unknown orientation of a fragment. • Gaps in fragment coverage. • Repeats in fragments. DISP Lab @ MD531

  8. Global alignment Local alignment Semi-global alignment Introduction to DNA sequence(5) Sequence alignment DISP Lab @ MD531

  9. Dynamic Programming(1) The edit distance between two strings String similarity method DISP Lab @ MD531

  10. Dynamic Programming(2) The edit distance between two strings DISP Lab @ MD531

  11. Dynamic Programming(3) Alignment : S1: a b c - S2: a c - b Pairwise score : 1 -2 -2 0 Similarity score = 1-2-2+0 = -3 String similarity method DISP Lab @ MD531

  12. Dynamic Programming(4) • The recurrence relation • Tabular computation • The traceback DISP Lab @ MD531

  13. Dynamic Programming(5) • The recurrence relation • Initial condition • D(i, 0)=i(the first column) • D(0, j)=j (the first row) • recurrence relation • D(i,j)=min[D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1)+t(i,j)] • where • Here is a example: • S1=‘vintner’;S2=‘writers’ DISP Lab @ MD531

  14. Dynamic Programming(6) • Initial condition • D(i, 0)=i • D(0, j)=j DISP Lab @ MD531

  15. Dynamic Programming(7) • Tabular computation • D(i,j)=min[D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1)+t(i,j)] DISP Lab @ MD531

  16. Dynamic Programming(8) • The traceback • Set a pointer from (i,j) to cell (i,j-1) , denoted by if D(i,j)= D(i,j-1)+1 • Set a pointer from (i,j) to cell (i-1,j) , denoted by if D(i,j)= D(i-1,j)+1 • Set a pointer from (i,j) to cell (i-1,j-1) , denoted by if D(i,j)= D(i-1,j-1)+t(i.j) • Where, DISP Lab @ MD531

  17. Dynamic Programming(9) • Simulation(1) • Number “1” in cell represent the route from right to left, denote by • Number “2” in cell represent the route from down to up, denote by • Number “3” in cell represent the route from right-down to left-up, denote by DISP Lab @ MD531

  18. Dynamic Programming(13) The same similarity score !!! Simulation(2) DISP Lab @ MD531

  19. Dynamic Programming(10) Simulation(3) DISP Lab @ MD531

  20. Dynamic Programming(11) Simulation(4) DISP Lab @ MD531

  21. Dynamic Programming(12) Simulation(5) DISP Lab @ MD531

  22. FASTA(1) Only search for the consecutive identities of length k More faster than dynamic program DISP Lab @ MD531

  23. FASTA(2) • STEP1. Establish the lookup table (or Hash table) to show the positions of the k-tuple words in a sequence • STEP2. Use hashing to reveal a region of alignment between two sequences • STEP3. Find the 10 best diagonal regions. • STEP4. Keep only the most high-scoring diagonal regions. • STEP5. Try to join these remained diagonal regions into a longer alignment. DISP Lab @ MD531

  24. FASTA(3) The lookup table including the offset for two DNA sequences “ATAGTCAATCCG” and “TGAGCAATCAAG”, with k=2. Simulation(1) (ex: k=2) DISP Lab @ MD531

  25. FASTA(4) Each x indicates a word hit, and the word hits sharing the same offset are on a same diagonal. Simulation(2) DISP Lab @ MD531

  26. FASTA(5) Simulation(3) DISP Lab @ MD531

  27. FASTA(6) Simulation(4) DISP Lab @ MD531

  28. Suffix Overlap Prefix Merge Contig FASTA Simulation(5) DISP Lab @ MD531

  29. BLAST(1) • Similar to FASTA • Major difference with FASTA is: • choose the relative high-scoring word DISP Lab @ MD531

  30. Query sequence: PQGEFG Word 1: PQG Word 2: QGE Word 3: GEF Word 4: EFG BLAST(2) • Take 3 for example • For example: • the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. • While T is 13, PEG is kept and PQA is abandoned. DISP Lab @ MD531

  31. Query sequence: R P P Q G L F Database sequence: D P P E G V V Exact match is scanned. Score: -2 7 7 2 6 1 -1 HSP Maximal aggregate score = 7+7+2+6+1 = 23 BLAST(3) DISP Lab @ MD531

  32. BLAST(4) • Original BLAST • New BLAST (Gapped BLAST) • more sensitive at augmented speed DISP Lab @ MD531

  33. UDCR(1) Unitary Discrete Correlation (UDCR) Algorithm A novel algorithm DISP Lab @ MD531

  34. UDCR(2) Unitary Mapping Discrete Correlation DISP Lab @ MD531

  35. UDCR(3) Unitary Mapping DISP Lab @ MD531

  36. UDCR(4) • Discrete Correlation • Definition: • s [n] (similarity index): • s1[n] (pair-similarity index): • s2[n] (pair-different index): • x, y are two DNA sequence: DISP Lab @ MD531

  37. UDCR(5) • s[n] (similarity index) • ( xn[] = x[+ n],  = 0, 1, ….., M1, n = -M+1, -M+2, ….., N1.) • the number of nucleotides of xn that satisfy xn[] = y[]. DISP Lab @ MD531

  38. UDCR(6) • s1[n] (pair-similarity index): • the number of nucleotides of xn that satisfy bx,n[] =by[], • where bx,n and by are the unitary value representations of xn and y, respectively • In fact, bx,n[] = by[] means that x[n+] is different from y[] but they belong to the same pair (A-T pair or G-C pair). DISP Lab @ MD531

  39. UDCR(7) • s2[n] (pair-different index): • the number of nucleotides of xn that satisfy bx,n[] = jby[] • (i.e., x[n+] is quite different from y[]. Thus they do not belong to the same pair). DISP Lab @ MD531

  40. UDCR(8) DISP Lab @ MD531

  41. UDCR(9) • Simulation(1) • x = ‘GTAGCTGAACTGAAC’; • y = ‘AACTGAA’, • Then, N = 15 and M = 7. • bx = [j, 1, 1, j, j, 1, j, 1, 1, j, 1, j, 1, 1, j], • by = [1, 1, j, 1, j, 1, 1]. • z1= [j,-1+j, 1,1+j, -j,-1-j,-3+j2, j3,6+j,1-j4,-4-j3, -4+j3,2+j5, 7,2-j5,-3-j3,-3+j2, 1+j3, 3, 1-j, -j], • z2= [1, 0, 3, 2, 1, 0, 1, 1, 5, 5, 1, 1, 3, 7, 3, 0, 1, 2, 3, 0, 1] • Note that DISP Lab @ MD531

  42. UDCR(10) • Simulation(2) • Since s[7]=L7=M=7, we can conclude that the 7-length subsequence starting from s[7] • (i.e.,{s[7],s[8],….,s[13]}) • x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. • y7: y shifted 7 entries rightward) DISP Lab @ MD531

  43. UDCR(11) • Simulation(3) • Since s[2]=L2=M=7, we can conclude that the 7-length subsequence starting from s[2] • (i.e.,{s[2],s[3],….,s[8]}) • x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. • y2: y shifted 2 entries rightward) DISP Lab @ MD531

  44. UDCR(12) • Simulation(4) • Since S[12]=L12=N-n=15-12=3, • We can conclude that the 3-length suffix of x matches the 3-length prefix of y • x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. DISP Lab @ MD531

  45. UDCR(13) • Simulation(5) • Since S[-4]=L-4=M+n=7+(-4)=3, • We can conclude that the 3-length prefix of x matches the 3-length sufffix of y • x = ‘GTAGCTGAACTGAAC’, y =‘AACTGAA’. DISP Lab @ MD531

  46. Conclusion By using UDCR, we can derive the result and this algorithm really saves us a lot of computation time. In addition, we can combine UDCR and DP algorithm to a new algorithm called CUDCR. The advantage of CUDCR is saving more time and as accurate as UDCR. Use NTT instead of DFT due to less computation. DISP Lab @ MD531

  47. Reference [1]Soo-Chang Pei, Jian-Jiun Ding “Sequence Comparison and Alignment by Discrete Correlations, Unitary Mapping, and Number Theoretic Transforms” [2]Kang-Hua Hsu, ” Introduction to sequence comparison and alignment” [3]Michael S. Waterman, ”Introduction to computational biology” [4]Dan Dusfield, ”Algorithm on Strings, Trees, and Sequences” [5]Setubal, Meidanis, “Introduction to Computational Molecular Biology” DISP Lab @ MD531

  48. Thank you DISP Lab @ MD531

More Related