DNA Sequence Alignment

DISP Laboratory Graduate Institute of Communication EngineeringNational Taiwan University, Taipei, Taiwan (ROC)Speaker: Che-MingHuAdviser : Prof. Jian-Jiun Ding DNA Sequence Alignment DISP Lab @ MD531

Outline • Motivations • Introduction to DNA sequences • Sequence alignment algorithm • Dynamic Programming Algorithm • FASTA • BLAST • UDCR • CUDCR • Conclusion • Reference DISP Lab @ MD531

Motivation Huge amount of sequences Too much computation time DISP Lab @ MD531

Introduction to DNA sequence(1) • DNA Sequence Assembly • Shotgun Sequencing • Greedy algorithm • Issues of shotgun sequencing • Sequence alignment DISP Lab @ MD531

Original DNA Copies of the original DNA Fragments of the copies (Shotgun) Reconstruct the original DNA Introduction to DNA sequence(2) • Shotgun Sequencing DISP Lab @ MD531

Introduction to DNA sequence(3) • Greedy algorithm • Step1. Calculate pair-wise alignments of all fragments. • Step2. Choose two fragments with the largest overlap. • Step3. Merge the chosen fragments. • Step4. Repeat step 2 and 3 until there is not any fragment which can be merged. DISP Lab @ MD531

Introduction to DNA sequence(4) • Issues of shotgun sequencing • The errors in fragments. • The unknown orientation of a fragment. • Gaps in fragment coverage. • Repeats in fragments. DISP Lab @ MD531

Global alignment Local alignment Semi-global alignment Introduction to DNA sequence(5) Sequence alignment DISP Lab @ MD531

Dynamic Programming(1) The edit distance between two strings String similarity method DISP Lab @ MD531

Dynamic Programming(2) The edit distance between two strings DISP Lab @ MD531

Dynamic Programming(3) Alignment : S1: a b c - S2: a c - b Pairwise score : 1 -2 -2 0 Similarity score = 1-2-2+0 = -3 String similarity method DISP Lab @ MD531

Dynamic Programming(4) • The recurrence relation • Tabular computation • The traceback DISP Lab @ MD531

Dynamic Programming(5) • The recurrence relation • Initial condition • D(i, 0)=i(the first column) • D(0, j)=j (the first row) • recurrence relation • D(i,j)=min[D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1)+t(i,j)] • where • Here is a example: • S1=‘vintner’;S2=‘writers’ DISP Lab @ MD531

Dynamic Programming(6) • Initial condition • D(i, 0)=i • D(0, j)=j DISP Lab @ MD531

Dynamic Programming(7) • Tabular computation • D(i,j)=min[D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1)+t(i,j)] DISP Lab @ MD531

Dynamic Programming(8) • The traceback • Set a pointer from (i,j) to cell (i,j-1) , denoted by if D(i,j)= D(i,j-1)+1 • Set a pointer from (i,j) to cell (i-1,j) , denoted by if D(i,j)= D(i-1,j)+1 • Set a pointer from (i,j) to cell (i-1,j-1) , denoted by if D(i,j)= D(i-1,j-1)+t(i.j) • Where, DISP Lab @ MD531

Dynamic Programming(9) • Simulation(1) • Number “1” in cell represent the route from right to left, denote by • Number “2” in cell represent the route from down to up, denote by • Number “3” in cell represent the route from right-down to left-up, denote by DISP Lab @ MD531

Dynamic Programming(13) The same similarity score !!! Simulation(2) DISP Lab @ MD531

Dynamic Programming(10) Simulation(3) DISP Lab @ MD531

FASTA(1) Only search for the consecutive identities of length k More faster than dynamic program DISP Lab @ MD531

FASTA(2) • STEP1. Establish the lookup table (or Hash table) to show the positions of the k-tuple words in a sequence • STEP2. Use hashing to reveal a region of alignment between two sequences • STEP3. Find the 10 best diagonal regions. • STEP4. Keep only the most high-scoring diagonal regions. • STEP5. Try to join these remained diagonal regions into a longer alignment. DISP Lab @ MD531

FASTA(3) The lookup table including the offset for two DNA sequences “ATAGTCAATCCG” and “TGAGCAATCAAG”, with k=2. Simulation(1) (ex: k=2) DISP Lab @ MD531

FASTA(4) Each x indicates a word hit, and the word hits sharing the same offset are on a same diagonal. Simulation(2) DISP Lab @ MD531

FASTA(5) Simulation(3) DISP Lab @ MD531

FASTA(6) Simulation(4) DISP Lab @ MD531

Suffix Overlap Prefix Merge Contig FASTA Simulation(5) DISP Lab @ MD531

BLAST(1) • Similar to FASTA • Major difference with FASTA is: • choose the relative high-scoring word DISP Lab @ MD531

Query sequence: PQGEFG Word 1: PQG Word 2: QGE Word 3: GEF Word 4: EFG BLAST(2) • Take 3 for example • For example: • the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. • While T is 13, PEG is kept and PQA is abandoned. DISP Lab @ MD531

Query sequence: R P P Q G L F Database sequence: D P P E G V V Exact match is scanned. Score: -2 7 7 2 6 1 -1 HSP Maximal aggregate score = 7+7+2+6+1 = 23 BLAST(3) DISP Lab @ MD531

BLAST(4) • Original BLAST • New BLAST (Gapped BLAST) • more sensitive at augmented speed DISP Lab @ MD531

UDCR(1) Unitary Discrete Correlation (UDCR) Algorithm A novel algorithm DISP Lab @ MD531

UDCR(2) Unitary Mapping Discrete Correlation DISP Lab @ MD531

UDCR(3) Unitary Mapping DISP Lab @ MD531

UDCR(4) • Discrete Correlation • Definition: • s [n] (similarity index): • s1[n] (pair-similarity index): • s2[n] (pair-different index): • x, y are two DNA sequence: DISP Lab @ MD531

UDCR(5) • s[n] (similarity index) • ( xn[] = x[+ n],  = 0, 1, ….., M1, n = -M+1, -M+2, ….., N1.) • the number of nucleotides of xn that satisfy xn[] = y[]. DISP Lab @ MD531

UDCR(6) • s1[n] (pair-similarity index): • the number of nucleotides of xn that satisfy bx,n[] =by[], • where bx,n and by are the unitary value representations of xn and y, respectively • In fact, bx,n[] = by[] means that x[n+] is different from y[] but they belong to the same pair (A-T pair or G-C pair). DISP Lab @ MD531

UDCR(7) • s2[n] (pair-different index): • the number of nucleotides of xn that satisfy bx,n[] = jby[] • (i.e., x[n+] is quite different from y[]. Thus they do not belong to the same pair). DISP Lab @ MD531

UDCR(8) DISP Lab @ MD531

UDCR(9) • Simulation(1) • x = ‘GTAGCTGAACTGAAC’; • y = ‘AACTGAA’, • Then, N = 15 and M = 7. • bx = [j, 1, 1, j, j, 1, j, 1, 1, j, 1, j, 1, 1, j], • by = [1, 1, j, 1, j, 1, 1]. • z1= [j,-1+j, 1,1+j, -j,-1-j,-3+j2, j3,6+j,1-j4,-4-j3, -4+j3,2+j5, 7,2-j5,-3-j3,-3+j2, 1+j3, 3, 1-j, -j], • z2= [1, 0, 3, 2, 1, 0, 1, 1, 5, 5, 1, 1, 3, 7, 3, 0, 1, 2, 3, 0, 1] • Note that DISP Lab @ MD531

UDCR(10) • Simulation(2) • Since s[7]=L7=M=7, we can conclude that the 7-length subsequence starting from s[7] • (i.e.,{s[7],s[8],….,s[13]}) • x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. • y7: y shifted 7 entries rightward) DISP Lab @ MD531

UDCR(11) • Simulation(3) • Since s[2]=L2=M=7, we can conclude that the 7-length subsequence starting from s[2] • (i.e.,{s[2],s[3],….,s[8]}) • x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. • y2: y shifted 2 entries rightward) DISP Lab @ MD531

UDCR(12) • Simulation(4) • Since S[12]=L12=N-n=15-12=3, • We can conclude that the 3-length suffix of x matches the 3-length prefix of y • x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. DISP Lab @ MD531

UDCR(13) • Simulation(5) • Since S[-4]=L-4=M+n=7+(-4)=3, • We can conclude that the 3-length prefix of x matches the 3-length sufffix of y • x = ‘GTAGCTGAACTGAAC’, y =‘AACTGAA’. DISP Lab @ MD531

Conclusion By using UDCR, we can derive the result and this algorithm really saves us a lot of computation time. In addition, we can combine UDCR and DP algorithm to a new algorithm called CUDCR. The advantage of CUDCR is saving more time and as accurate as UDCR. Use NTT instead of DFT due to less computation. DISP Lab @ MD531

Reference [1]Soo-Chang Pei, Jian-Jiun Ding “Sequence Comparison and Alignment by Discrete Correlations, Unitary Mapping, and Number Theoretic Transforms” [2]Kang-Hua Hsu, ” Introduction to sequence comparison and alignment” [3]Michael S. Waterman, ”Introduction to computational biology” [4]Dan Dusfield, ”Algorithm on Strings, Trees, and Sequences” [5]Setubal, Meidanis, “Introduction to Computational Molecular Biology” DISP Lab @ MD531

Thank you DISP Lab @ MD531

DNA Sequence Alignment