BioSlax Cloud – Distributing your jobs

BioSlax Cloud – Distributing your jobs

Distributing Jobs on the BioSlax Cloud • Stages of distributing jobs • Establishing secure communications • Splitting data • Distributing executables and data • Processing at the nodes • Collation of results • Examine a simple example – fuzzy search • Use agrep (a fuzzy search grep utility) and Bioperl

Distributing Jobs on the BioSlax Cloud The problem: “Find matches to the nr database that includes 1 to 4 mismatch in amino acids to any given input sequence” For example, given the hypothetical protein record in a database: >gi|284518918_M5|gb|ADB92594.1_M5 FLDGIDKAQEEHEKYHSNWRAMVSDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEG KIILVAVHVASGYIEAEVIPAGTGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQEFG IPYNPQSQGVVESMHKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDLQT RELEKEITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNSDIKVVPHKKAKIIRD and an input sequence of: DIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS the input sequence is found in the protein record (underlined) with 4 mismatches as follows: DLQTRELEKEITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS DIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS

Distributing Jobs on the BioSlax Cloud Executable – agrep.pl perl script using Bioperl Database – sequence.fasta small subset db with about 100 sequences

Distributing Jobs on the BioSlax Cloud

Distributing Jobs on the BioSlax Cloud Finds 5 matches Database is only 100 sequences – NR is > 10,000,000 sequences Linearly scaled, for the full NR database, it would take (10,000,000/100) x 0.55 seconds to complete or approximately 16 hours

Distributing Jobs on the BioSlax Cloud Use 4 BioSlax VMs on the Cloud 1 Master node, 3 Slave nodes remote_process_send shell script that is executed by each slave node to do the processing and then scp the results file back to the master node 01-split_sequence perl script to split db into chunks of X number of sequences per chunk chosen to split the 100 sequences by 40 sequences each => 3 chunks (or 3 files) 02-upload_parts shell script using scp with publickey authentication to upload agrep.pl, one chunk and the ‘remote_process_send’ script to each slave node 03-call_slave_to_execute shell script using ssh with publickey authentication to execute agrep.pl against each chunk on each of the slave nodes concurrently and have the slave scp the results file back to the master – done using the ‘remote_process_send’ shell script .

Distributing Jobs on the BioSlax Cloud remote_process_send #!/bin/sh HOSTN=`hostname` for i in seq_*.fasta do ./agrep.pl $i >> $HOSTN.results scp ./$HOSTN.results root@bioslax01:/mnt/hda1/downloads/. 1> /dev/null 2>/dev/null done

Distributing Jobs on the BioSlax Cloud 01-split_sequence #!/usr/bin/perl open (DBFILE, "$ARGV[0]"); $fcount=1; $count=0; while (<DBFILE>){ my($line) = $_; chomp($line); if ( $line =~ />/ ) { $count += 1; } if ($count == $ARGV[1]) { $fcount += 1; $count = 0; } open (NEWFILE, ">>seq_$fcount.fasta"); print NEWFILE "$line\n"; close (NEWFILE); }

Distributing Jobs on the BioSlax Cloud 02-upload_parts #!/bin/sh count=1 for i in seq_*.fasta do count=`expr $count + 1`; scp ./agrep.pl root@bioslax0${count}:. 1> /dev/null 2> /dev/null scp ./remote_process_send root@bioslax0${count}:. 1> /dev/null 2> /dev/null scp ./$i root@bioslax0${count}:. 1> /dev/null 2> /dev/null done

Distributing Jobs on the BioSlax Cloud 03-call_slave_to_execute #!/bin/sh if [ -f results ] then rm results fi count=1 for i in seq_*.fasta do count=`expr $count + 1`; ssh -l root bioslax0${count} "./remote_process_send" & done

Distributing Jobs on the BioSlax Cloud

Distributing Jobs on the BioSlax Cloud Establish secure communications between the master and slave nodes using SSH Publickey Authentication. Master node: bioslax01 Slave nodes: bioslax02, bioslax03, bioslax04 Generate public and private keys on bioslax01 run ‘ssh-keygen –t rsa’ generates id_rsa and id_rsa.pub in /root/.ssh Copy id_rsa.pub to each slave node as /root/.ssh/authorized_keys Repeat step 1 on all the slave nodes Copy contents of each of the id_rsa.pub files from bioslax02 to bioslax04 into the file /root/.ssh/authorised_keys of bioslax01 Should now be able to ssh and scp/sftp between bioslax01 and bioslax02, bioslax03, bioslax04 without keying in passwords * This has already been setup between bioslax01 and bioslax02, 03 and 04.

Distributing Jobs on the BioSlax Cloud Takes 0.06 seconds to split 100 sequences into chunks of 40 sequences Scaled linearly, for 10,000,000 sequences it will take (100,000,000/100) x 0.06 or approximately 1.5 hours.

Distributing Jobs on the BioSlax Cloud The agrep.pl executable is 950 bytes (0.00095MB) The remote_process_send script is 175 bytes (0.000175MB) Each 40 sequence chunk is 14,000 bytes (0.014MB) 1GBit network => 125MB/sec transfer rate Each executable and chunk will take (0.014 + 0.00095 + 0.000175) / 125 = 0.000121 seconds For 10,000,000 sequences split by 40 sequences there will be 250,000 chunks => approximately 0.000121 x 250,000 seconds to upload all the chunks or approximately 30 seconds.

Distributing Jobs on the BioSlax Cloud Take 0.51 seconds to run agrep.pl against the 40 sequence chunk on each slave node => NOT SIGNIFICANTLY FASTER THAN PROCESSING ON A SINGLE NODE! Takes approximately 0.86 seconds on each slave to run agrep.pl against each chunk AND send the results file back to the master node (done by ‘remote_process_send’ script => LONGER THAN PROCESSING ON A SINGLE NODE!

Distributing Jobs on the BioSlax Cloud All the nodes have almost similar timing for the processing and sending the results back to the master node.

Distributing Jobs on the BioSlax Cloud Any speed up is dependant on size of the job distributed computing is not advantageous when applied to small jobs (eg: processing dbs of 100 sequences) distributed computing most advantageous when applied to large jobs (eg: processing dbs of 100,000 sequences or more) overheads for each node process contribute to time taken for processing any job that takes an hour or less to run on a single node doesn’t need distributed computing

Distributing Jobs on the BioSlax Cloud Common riddle : “1 man digs 1 hole in 1 hour. How long will it take 10 men to dig 10 holes?” Each man starts at (approximately) the same time, all variables remaining constant, all of them should finish at the same time => 10 men will take 1 hour to dig 10 holes. Answer : 1 hour Applied to the problem at hand – 1 slave node processes 1 chunk and submits results to the master node in 0.86 seconds => 3 nodes will process 3 chunks and submit results to the master node in 0.86 seconds.

Distributing Jobs on the BioSlax Cloud Apply extrapolated timings from example to a db of 10,000,000 sequences 10,000,000 split by 40 sequences = 250,000 chunks Instantiate 250,000 VMs on the Cloud => process 10,000,000 in approximately 0.86 seconds (in theory) plus some overheads!

Distributing Jobs on the BioSlax Cloud For ONLY the compute portion, with 30 nodes, 10,000,000 sequence db can be processed in 8192 seconds or approximately 2.3 hours.

Distributing Jobs on the BioSlax Cloud Total time taken with 30 nodes = Time to split db into chunks (1.5 hours) + Time to upload executable, db chunk and script (30 seconds) + Time for nodes to process all chunks and send results back to master (2.3 hours) = 1.5 hours + 30 seconds + 2.3 hours ≈ 4 hours => ≈ 4x speed up compared to running on a single node against a single 10,000,000 sequence db file

Distributing Jobs on the BioSlax Cloud Most cases (real world situation) network speeds vary scalability is not linear Need to consider overheads pre-processing time (writing sub programs to split files, etc) network delays processing power of the individual VMs Despite overheads, for large processing jobs, significant speed up is very likely. Nothing more than cluster computing on the cloud BUT cloud offers ability to scale the number of machines in the cluster without hardware costs and without queues

Distributing Jobs on the BioSlax Cloud Scripts and sample database are contained in a single tgz file and can be downloaded from: ftp://sf01.bic.nus.edu.sg/incoming/bioslax/euasiagrid2010/euasiagrid_distcomp.tgz Note: Bioperl must be installed (http://www.bioperl.org) Tre agrep mus be installed (http://laurikari.net/tre/) Bioperl and Tre agrep available as SLAX LZM packages ftp://sf01.bic.nus.edu.sg/incoming/bioslax/tre.lzm ftp://sf01.bic.nus.edu.sg/incoming/bioslax/zz01b_perl-update.lzm

BioSlax Cloud – Distributing your jobs