Presented by: Dong Si 4-6-2011

A Novel Approach to Multiple Sequence Alignment using Hadoop Data Grids - Dr G Sudha Sadasivam, Mr G Baktavatchalam Presented by: Dong Si 4-6-2011 Your Site Here

Contents 1. Introduction 2. Method 3. Result & Application

1.1 What is sequence alignment? • Protein/DNA sequence --- a chain of amino acid residues/nucleic acid . • Sequence alignment --- a way to identify regions of similarity that may be a consequence of functional, structural or evolutionary relationships between the sequences.

1.2 What is MSA? • Multiple sequence alignment(MSA) ---an extension of pairwise alignment to incorporate more than two sequences at a time. • Current method → Dynamic programming: produce global alignments via the Needleman- Wunsch algorithm. → Progressive methods: aligns the closest sequences first and successively adds in more distant ones. DP methods produce accurate results but are computation intensive; Progressive alignment methods are fast and deterministic but tend to get caught in local maxima.

1.3 How we can improve? • Dynamic programming algorithms guarantee a optimal alignment but the computing power required for larger alignments is very high. • Their proposed method uses a highly efficient algorithm executed using Hadoop data grid. The dynamic nature of the algorithm coupled with data and computation parallelism that can be achieved in Hadoopframework improves the computational efficiency as well as accuracy. • As Hadoopframework is highly scalable, the proposed multiple sequence aligner is highly suited for large-scale alignment problems.

2.1 Needleman-Wunsch algorithm • A method for some optimization problems determine a best solution based on scoring scheme. • Break a problem into sub problems • Solve each sub problem separately F(i-1,j-1) + s(xi, yj) F(i,j) = max F(i,j-1) + g F(i-1,j) + g F(i, j) : The max score for aligning 1st i symbols of sequence 1 with 1st j symbols of sequence 2 s(xi, yj) : Match/Mismatch score for aligning xi with yj g : gap penalty

2.2 Example Sequence 1: ACAGTAG Sequence 2: ACTCG • Initialization • Matrix filling (scoring) • Trace back Match: 1 Mismatch: 0 Gap: -1

2.3 MSA using Hadoop

2.7 Analysis • For three sequence problem, six different combinations of sequences are possible. All these combinations are aligned in parallel using Hadoop data grid. • Within each combination, pairwise alignment is carried. 1. S1 and S2 read from DFS and then aligned to produce aligned sequences A1S1 and A1S2. These aligned sequences are then stored in the DFS. 2. Then A1S1 is aligned with S3 and A1S2 is aligned with S3 in parallel to produce final aligned sequences A2S1, A2S2 and A1S3. 3. These two alignments have no dependencies and hence they are carried out in parallel.

3. Results • The sequence is split to blocks in Hadoop, and sequence alignment can be done in parallel on these blocks. • It is found that the time decreases as the block size (# of blocks) increases, It is also found that the time decreases as the number of nodes increases.

Thank you!

Presented by: Dong Si 4-6-2011