NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching

NGS Bioinformatics Workshop1.3 Tutorial - Sequence Alignment and Searching March 22nd, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB

Learning Objectives • A few more Linux tips • FASTA and BLAST on the web • Local BLAST • Local installation of BLAST • Making a blast database • Running a local blast query • Parsing search results using a script

First, a few more Linux tips… • “cat”, “more” or “less”: list contents of a (text) file • “head” or “tail”: display the start or end of a file • One can ‘redirect’ the contents of a program into or out of a file: • ‘<‘ for input (‘<<TAG‘ appends from keyboard) • ‘>’ for output (‘>>’ appends rather than overwrites) • TRY: cat >myfile.txt <<EOF This is my file! Another line EOF • “more” or “less”: like “cat” but controllable

More…. • “wget”: downloads an internet file • “which”: displays where a program is located on the system • “mkdir”: makes a directory • “cp”: copies files or directories • “mv”: moves a file or directory • “ln”: creates alias names/locations to files • ln –s sourcetarget# -s link “symbolic” • e.g. ln –s ncbi-blast-2.2.26+ ncbi • “export”: exposes an environment variable, e.g. • export BLAST=/usr/local/ncbi Environment variables provide general configuration and global contextual information that operating system scripts and computer programs can read if they need to know about such context.

File archives (and programs) • “.gz” files: gzip, gunzip archive commands • “.tar” files: tar command • Flags: • -x # extract • -f filename # file to extract • -v # verbose output • -z # uncompressgz file on the fly… • e.g. • tar –xvffile.tar • Tar –zxvf file.tar.gz • “.bz2”: bzip2 and bunzip2 archive commands • “.zip”: zip and unzip archive commands

http://www.ebi.ac.uk/Tools/sss/fasta/ FASTA Walkthrough Database: Knowledgebase (complete source for protein information, combines other databases), Swiss-Prot (only manually-curated proteins) Paste your sequence here!!!! Program: fasta (DNA or protein sequence against a like database) Matrix: scoring matrix used (first click More options… ) Results: interactive or email Don’t forget to push SUBMIT!

http://blast.ncbi.nlm.nih.gov

BLAST Walkthrough Note: the program used; nucleotide blast (blastn), protein blast (blastp), etc. is chosen at an earlier screen Paste your sequence here!!! Database: nr, swissprot, etc. Algorithm: blastp (protein blast) To configure blastp options, click Algorithm parameters.

BLAST Walkthrough Algorithm paramters Expect value: can be changed Matrix Filtering: Off by default, can turn on Don’t forget to push BLAST!

BLAST Results Protein domains in your sequence Graphical representation of the alignment results; each line represents an alignment; color indicates similarity List of the hits from the database you searched against (ID, name, E-value of top HSP) click on score to jump down to textual alignment Individual display for each alignment (HSP) http://www.youtube.com/watch?v=HXEpBnUbAMo&feature=related

You can also use a Script to directly do the BLAST’ing – e.g. Biopython Example 1: If you have a nucleotide sequence you want to search against the nucleotide database (nt) using BLASTN, and you know the GI number of your query sequence, you can use: from Bio.Blast import NCBIWWW result_handle= NCBIWWW.qblast("blastn", "nt", "8332116") Then, save the result… save_file= open("my_blast.xml", "w") # save as XML format save_file.write(result_handle.read()) save_file.close() result_handle.close()

More BLASTing… Example 2: Alternatively, if we have our query sequence already in a FASTA formatted file, we just need to open the file and read in this record as a string, and use that as the query argument: from Bio.Blast import NCBIWWW fasta_string= open("m_cold.fasta").read() result_handle= NCBIWWW.qblast("blastn", "nt", fasta_string) See http://biopython.org/DIST/docs/tutorial/Tutorial.html for more sample BLAST query code (or see the equivalent sections of other open-bio toolkits)

Locally installed BLAST - Advantages • Search many input sequences at once • Customizable databases • No need for internet access • Faster for small to medium-sized databases • Integrate BLAST searches into a larger, automated bioinformatics analysis • Can also run local BLAST through open-bio scripts (e.g. see http://biopython.org/DIST/docs/tutorial/Tutorial.html)

Parts of the standalone BLAST equation FASTA file containing sequences to build database to BLAST against (NCBI or your own file) BLAST Programs File downloaded from NCBI contains: blastp blastn blastx tblastx rpsblast makeblastdb etc… makeblastdb Your Query Sequence BLAST database Output

Let’s try this ourselves…. • We will: • Obtain and install the BLAST executables (Linux) • Set up a BLAST database • Copy an archive • Use ‘makeblastdb’ to create a novel database to search against • Use the ‘blastp’ program to carry out a BLAST analysis over the command-line • Output your BLAST results into a more flexible format • Use a small BioPython script to parse the output

Step 1 – Installing BLAST tools • Go to http://blast.ncbi.nlm.nih.gov and follow links to latest release and follow instructions for favorite operating system: • Windows 32/64: .exe installer • Linux 32/64: compiled binaries (RPM or tar.gz) • Other Unix: compiled binaries (in tar.gz) • Apply platform-specific configuration details for your operating system • Read the good documentation: http://www.ncbi.nlm.nih.gov/books/NBK1763/

Installing from Linux .tar.gz archive wgetftp://ftp.ncbi.nlm.nih.gov/blast/executables/Ã blast+/LATEST/ncbi-blast-2.2.26+-x64-linux.tar.gz tar -zxvfncbi-blast-2.2.26+-x64-linux.tar.gz sudo mv ncbi-blast-2.2.26+ /usr/local cd /usr/local/ sudoln -s ncbi-blast-2.2.26+ ncbi export BLAST=/usr/local/ncbi export PATH=$BLAST/bin:$PATH

Step 2 – Getting a BLAST database • Option A: • Pre-formatted databases: download archives from NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/ • Option B: • “Roll your own”: construct de novo from a file of custom FASTA sequences using makeblastdb

Option A: Obtain a existing NCBI database... (Linux) sudomkdir $BLAST/db wgetftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz tar -zxvfswissprot.tar.gz $BLAST/db cd

Option B: Create a simple BLAST database from a local FASTA sequence file The makeblastdb application produces BLAST databases from FASTA files. In the simplest case the FASTA definition lines are not parsed by makeblastdb and may be completely unstructured (but can only be BLAST’ed and not be directly retrieved) makeblastdb-in mydb.fsa-dbtypenucl creates a BLAST database from a nucleotide FASTA sequences which can be put into the “db” directory for searching. Of course, like all blast programs, there are a rich set of parameters which can be used to customize the generation of the database (see the BLAST manual).

With either option, still need some configuration details cd # back to home directory # need to point to the database… cat >.ncbirc <<EOF [BLAST] BLASTDB=/usr/local/ncbi/db EOF

Step 3 – Executing a BLAST operation • Command line programs (only) but parameters are generally equivalent to (or a superset of) the NCBI web BLAST application • Sample run: • Retrieve a sequence from the database: blastcmd –dbswissprot–entry Q9MAH0 –outform "%f" –out test_query.txt • Blast it back against the same database: blastp–query text_query.txt –dbswissprot–out output.txt –outfmt 5

Different BLAST programs • blastn – search nucleotide database using a nucleotide query • blastp – search protein database using a protein query • blastx – search protein database using a translated nucleotide query • tblastn – search translated nucleotide database using a protein query • tblastx – search translated nucleotide database using a translated nucleotide query • psiblast – Position-Specific Iterated BLAST • rpsblast – Reversed Position Specific BLAST • See BLAST documentation for related utility programs

Command line parameters: statistics • -evalue: expect value, normally set to 10 • -word_size: “k-tuple” size; increase for speed, decrease for sensitivity • -gapopen: cost to open a gap; increase for stringency • -gapextend: cost to extend a gap; increase for stringency • -matrix: substitution scoring matrix (default BLOSUM62); change if sequences too related or too distant To get more information use option “-help”

Command line parameters: input/output • -query in.txt: specify input file • -out out.txt: specify output file • -db nr: which database (created with makeblastdb) • -dust yes/no: filter low complexity regions in nucleotide sequence search yes/no (default is yes) • -seg yes/no: filter low complexity regions in protein sequence search yes/no (default is no) • -html: format output as HTML • -outfmt: specify output format, e.g. 5 = XML blast output (use –help flag to see other options)

Additional useful program options • Depending on program: • -num_threads: use multiple CPUs (speeds up search) • -subject: specify a second input sequence instead of a database (former ‘bl2seq’) • -task megablast: much faster for highly similar nucleotide sequences • -task blastn_short: find similar short sequences (e.g. primer sequences)

Step 4 – Parse the output • If you just have one query sequence, simply view the BLAST text file • If you are doing a lot of queries on the database and looking for “best hits”, you may wish to use a parsing script (e.g. biopython or equivalent)

Parsing BLAST output with Biopython • Good to use the BLAST XML format for this… result_handle= open("my_blast.xml") • Now that we’ve got a handle, we are ready to parse the output. The code to parse it is really quite small. • If you expect a single BLAST result (i.e. you used a single query): from Bio.Blast import NCBIXML blast_record= NCBIXML.read(result_handle)

More parsing… • or, if you have lots of results (i.e. multiple query sequences): from Bio.Blast import NCBIXML blast_records= NCBIXML.parse(result_handle) for blast_record in blast_records: ... # Do something with blast_record

What’s in a BLAST record? E_VALUE_THRESH = 0.04 for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****‘ print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '...‘ print hsp.match[0:75] + '...‘ print hsp.sbjct[0:75] + '...'

Gives output something like this… ****Alignment**** sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation protein WCOR413-like protein alpha form mRNA, complete cds length: 783 e value: 0.034 tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc... ||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| | ||| ||... tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc... Again, see http://biopython.org/DIST/docs/tutorial/Tutorial.html for more details...

NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching

NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching

Presentation Transcript

Lecture 4 Sequence alignment and searching

Techniques for Protein Sequence Alignment and Database Searching

Sequence Alignment Tutorial #3

Sequence Analysis, Pair Wise Alignment, and Database Searching

Sequence Alignment and Database Searching

Sequence Alignment Tutorial #2

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment

Bioinformatics Tutorial I BLAST and Sequence Alignment

NGS Bioinformatics Workshop 1.2 Tutorial – Sequence Formats, Databases and Visualization Tools

NGS Bioinformatics Workshop 1.5 Tutorial – Genome Annotation

Sequence Alignment Tutorial

Sequence Alignment and Approaches to Database Searching

Sequence Alignment and Database Searching

Sequence alignment is central to bioinformatics!

COT 6930 HPC and Bioinformatics Sequence Alignment

NGS Bioinformatics Workshop 1.4 Tutorial - Comparative Sequence Analysis and Visualization

Sequence alignments and sequence database searching Igor Kuznetsov Bioinformatics workshop Part I

NGS Bioinformatics Workshop 2.1 Next Generation Sequencing and Sequence Assembly Algorithms

NGS Bioinformatics Workshop 1.3 Sequence Alignment and Searching

NGS Bioinformatics Workshop 1.5 Genome Annotation

CAP5510 – Bioinformatics Multiple Sequence Alignment

Pairwise Sequence Alignment and Database Searching