Blast
Download
1 / 30

Blast - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Blast. Review. Basic Local Alignment Search Tool Initially developed and used to search genomic sequence Ex) I have a short piece of mRNA/EST that I sequenced -- and I'd like to see where it is in a genome Especially useful before genome browsers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Blast' - china


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Review
Review

  • Basic Local Alignment Search Tool

  • Initially developed and used to search genomic sequence

    • Ex) I have a short piece of mRNA/EST that I sequenced -- and I'd like to see where it is in a genome

    • Especially useful before genome browsers

    • Before browsers and annotation, used to "check" if the region that contained my gene of interest had been sequenced yet

    • Still very useful

  • Initial application

    • query sequence

    • target database


Blast output blast of human promoter vs mouse promoter in bmp4
Blast output: Blast of Human Promoter VS Mouse Promoter in BMP4

Query= range=chr14:52413865-52418867 5'pad=0 3'pad=0 revComp=FALSE

strand=? repeatMasking=none

(5003 letters)

>mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE

strand=? repeatMasking=none

Length = 5004

Score = 387 bits (195), Expect = e-110

Identities = 258/278 (92%), Gaps = 2/278 (0%)

Strand = Plus / Plus

Query: 1 catgcaccgactagtcgccgtaccttccaaaaatacccatgggagtctgggcttccctga 60

|||||||| ||||||| | |||||||||||||||||||||||||||||||||||||||||

Sbjct: 3285 catgcaccaactagtctctgtaccttccaaaaatacccatgggagtctgggcttccctga 3344

Query: 61 gtttagtgtaactgctgcccaaactgatgattaaaacactgagtaatcaccatttaccct 120

|||||| |||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct: 3345 gtttagaataactgctgcccaaactgatgattaaaacactgagtaatcaccatttacccc 3404

Query: 121 gagtaattagagataatgaaacacctctaaaatcaggttatggagaaaggagctgttgga 180

||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||

Sbjct: 3405 gagtaattagagataatgaaacacctctaaaatcaggttatggagaagggagctgttgga 3464

Query: 181 tcggtttaaaaggtggccttccataaactgttaaaagatttccaaacgttgagaaacagg 240

| |||||||| ||||||||||| |||||||||||| |||| ||||||||| ||||||||

Sbjct: 3465 ttggtttaaagggtggccttccgtaaactgttaaaggattcccaaacgtt--gaaacagg 3522

Query: 241 ctgtgtgcagaactgtgtgccagagagagatacctctt 278

||||| |||| ||||||||| || ||||||||||||

Sbjct: 3523 ctgtgagcagggctgtgtgcccgaaggagatacctctt 3560


Review1
Review BMP4

  • Highly refined and specialized

    • nucleotide-nucleotide (blastn), protein-protein (blastp), translated nucleotide against protein (blastx), protein against translated nucleotide (tblastn), translated nucleotide against translated nucleotide (tblastx)

  • Variations

    • bl2seq

    • Wu-BLAST


Blast algorithms
Blast Algorithms BMP4

  • Remove regions of low complexity

  • Make k-letter words (k-mer) of query sequence (a hash?) Default for blastn is w=11

  • Compare k-mers in query sequence with pre-computed k-mers in target sequence

  • Note, the k-mers in the target sequence ( database) are pre-computed --> significant speed up


How does blast work
How does BLAST work? BMP4

An alignment begins with the query sequence converted into "words" (default for blastn is w=11)

ATGCCCTGACGCAATGACCTTAAGGAT

ATGCCCTGCCG word1

TGCCCTGCCGC word2

GCCCTGCCGCA etc

CCCTGACGCAA

CCTGACGCAAT

:

:


Hash based alignment
Hash-Based Alignment BMP4

base-10 numbers

5805 = 5*10^3 + 8*10^2 + 0*10^1+5*10^0

k=8

ATGCCTGGGCT

A=0, C=1, G=2, and T=3 (base 4 number)

ATGCCTGG = 0*4^7+3*4^6+2*4^5+1*4^4+1*4^3+3*4^2+2*4^1+2*4^0 = 14714

Now we can compare “chunks” of sequence much faster - speed increase by factor of 8

Can pre-compute hashes for entire genome, and only compare hashes

Premise for popular alignment tools – BLAST, BLAT,and UIcluster


Blast

Appendix 1: The Files Produced by Formatdb (From formatdb.html)

------------------------------------------

Using formatdb without the "-o T" indexing option results in three

BLAST database files (.nhr, .nin, ,nsq). Using the "-o T" option will result

in additional files.

If gi's are present in the FASTA definition lines of the source file, there

will be four additional files created (.nsd, nsi, nni, nnd). These are

ISAM indices for mapping a sequence identifier to a particular sequence in

the BLAST database. If gi's are not use there will be only two additional

files created (.nsd, .nsi).

These files are listed below for both an example nucleotide and a protein

sequence database. The actual sequence data is stored in the files with

extension "nsq" or "psq". The compression ratio for these files is

about 4:1 for nucleotides and 1:1 for protein sequence source files.

Extension Content Format

---------------------------------------------

Nucleotide database formatted without "-o T"

nhr deflines binary

nin indices binary

nsq sequence data binary


Blast works
Blast Works formatdb.html)

After a word alignment between query, and target, blast attempts to extend the alignment in both directions.

The alignment continues which contributes to the alignment "score." Once the score falls below a threshold, the extension halts.

The resulting alignment is called an HSP, or high scoring pair.


Blast example
BLAST Example formatdb.html)

  • Processing of BLAST output is not particularly special in terms of Bioinformatics Techniques

  • It is useful for comparing sequences and identifying "similarity"

    • similarity – a measure of how many bases/amino acids are shared between 2 sequences

    • At the 2 extremes:

      • 100 "percent identity" means 2 sequences are identical

      • 0 "percent identity" means 2 sequences share no residues

      • note that 2 random nucleotide sequences, without gaps will have 25% identity

      • 2 sequences may be 80% identical at the aa level, but only 55% at the nt level

      • similarity does not describe length

        • 1 pair is 98% identical for 8 nucleotides

        • 1 pair is 75% identical for 100 nucleotides

        • Which pair of sequences is significant and more closely related?

        • Measures exist that attempt to quantify these issues

        • Not covered (Statistical Methods for Bioinformatics)


Blast

BLASTN 2.2.6 [Apr-09-2003] formatdb.html)

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query=

(500 letters)

Database: blastdb/yeast.nt

17 sequences; 12,155,026 total letters

Searching.done

Score E

Sequences producing significant alignments: (bits) Value

gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome ... 32 1.2

gi|6322623|ref|NC_001143.1| Saccharomyces cerevisiae chromosome ... 32 1.2

gi|7839148|ref|NC_001136.2| Saccharomyces cerevisiae chromosome ... 32 1.2

gi|6323989|ref|NC_001146.1| Saccharomyces cerevisiae chromosome ... 30 4.7

gi|6323501|ref|NC_001145.1| Saccharomyces cerevisiae chromosome ... 30 4.7

gi|7276232|ref|NC_001137.2| Saccharomyces cerevisiae chromosome ... 30 4.7

>gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome XVI,

complete chromosome sequence

Length = 948061

Score = 32.2 bits (16), Expect = 1.2

Identities = 16/16 (100%)

Strand = Plus / Minus

Query: 1 tctggcgtgacaccag 16

||||||||||||||||

Sbjct: 511257 tctggcgtgacaccag 511242


Query subject hsp high scoring pair
Query, Subject, formatdb.html)HSP (high scoring pair)

>gi|6226515|ref|NC_001224.1| Saccharomyces cerevisiae mitochondrion,

complete genome

Length = 85779

Score = 40.1 bits (20), Expect = 0.050

Identities = 20/20 (100%)

Strand = Plus / Plus

Query: 1562 ttaattattaaaaataataa 1581

||||||||||||||||||||

Sbjct: 74540 ttaattattaaaaataataa 74559

Score = 40.1 bits (20), Expect = 0.050

Identities = 20/20 (100%)

Strand = Plus / Plus

Query: 1562 ttaattattaaaaataataa 1581

||||||||||||||||||||

Sbjct: 2168 ttaattattaaaaataataa 2187


Blast1
BLAST formatdb.html)

www.ncbi.nih.gov (web interface)

Highlight parameters

output in html/text

word size

number of results

quality of hit

>hg16_dna range=chr14:52413865-52418867 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none

CATGCACCGACTAGTCGCCGTACCTTCCAAAAATACCCATGGGAGTCTGG

GCTTCCCTGAGTTTAGTGTAACTGCTGCCCAAACTGATGATTAAAACACT

GAGTAATCACCATTTACCCTGAGTAATTAGAGATAATGAAACACCTCTAA

AATCAGGTTATGGAGAAAGGAGCTGTTGGATCGGTTTAAAAGGTGGCCTT

CCATAAACTGTTAAAAGATTTCCAAACGTTGAGAAACAGGCTGTGTGCAG

AACTGTGTGCCAGAGAGAGATACCTCTTGGGCTGTTAAAAGGTGAATTTC

CTTCCAAAACATTCTTAAAGTGCGTTTTGTAGAACACCTTGGTGATGACC

CTGAGGTAGACCCCAGTAAATGAGCTCACTTCCCTCCTCCCCCCAAGGCC

TAAAGACAGAGGTGGCCTGCAGACAGGCTGGGGCCACGTTTTTTACAGTT

AGAAAAATCTGTTTCTAGCTCAGAAGCCGTCTTAAGAACCGACACAGCAG

TTATTTTTGTTGGTGTTATGGTTGTTGTTGATTTTTTTGTTCATTTTTGG

TGTACCTAAGAGTCCTGACACTCTGTCTTTCCAAGGAGATAACCACAGAA

  • application

  • database

  • sequence


Blast

Setting up Standalone BLAST for UNIX: formatdb.html)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Basically, there are three steps needed to setup the Standalone BLAST

executable for the UNIX platform.

1) Download the UNIX binary, uncompress and untar the file. It is

suggested that you do this in a separate directory, perhaps called

"blast".

2) Create a .ncbirc file. In order for Standalone BLAST to operate, you

have will need to have a .ncbirc file that contains the following lines:

[NCBI]

Data="path/data/"

Where "path/data/" is the path to the location of the Standalone BLAST

"data" subdirectory. For Example:

Data=/root/blast/data

The data subdirectory should automatically appear in the directory where

the downloaded file was extracted. Please note that in many cases it may

be necessary to delimit the entire path including the machine name and

or the net work you are located on. Your systems administrator can help

you if you do not know the entire path to the data subdirectory.

Make sure that your .ncbirc file is either in the directory that you

call the Standalone BLAST program from or in your root directory.

3) Format your BLAST database files. The main advantage of Standalone

BLAST is to be able to create your own BLAST databases. This can be done

with any file of FASTA formatted protein or nucleotide sequences. If you

are interested in creating your own database files you should refer to

the sections "Non-redundant defline syntax" and "Appendix 1: Sequence

Identifier Syntax" of the README in the BLAST database directory

(ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA

description available from the BLAST search pages

(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).


Installing blast on old css itaniums
Installing BLAST on "old" CSS Itaniums formatdb.html)

http://www.ncbi.nih.gov/Ftp/index.html

Because I want to install on a remote machine, I'll use ftp – a program that stands for 'file transfer protocol'

Allows me to access files by command line

Another version is ncftp – with more features

file completion

download stats

ssh login@login-pa.engineering.uiowa.edu (this will log you into an itanium that is still supported by NCBI).

mkdir ~/blast

cd blast

ncftp ftp.ncbi.nih.gov (or ftp: login ftp, passwd e-mail@youraddress.edu)

bin

cd blast/executables/LATEST

get blast-2.2.10-ia32.linux.tar (this should work on an itanium)

## These are out of date: get blast-2.2.6-ia32-linux.tar.gz (linux) (or blast-2.2.6-hppa-hpux.tar.gz) ~ 15 meg

cd ../../../..

cd blast/db/FASTA/

get yeast.nt.gz (~ 3.5 meg)

quit

gunzip blast-2.2.10-ia32.linux.tar.gz

tar –xvf blast*

[for HPUX: cd blast-2.2.6]

[for HPUX, : mv ../yeast.nt.gz .]

gunzip yeast.nt.gz

mkdir blastdb

mv yeast.nt blastdb

./formatdb -i blastdb/yeast.nt -p F


Linux example
Linux Example formatdb.html)

  • Download binary (for CSS machines – "x64"):

  • ftp://ftp.ncbi.nih.gov/blast/executables/release

  • mkdir ~/blast

  • mv blast-2.2.18-x64-linux.tar.gz ~/blast

  • gunzip blast-2.2.18-x64-linux.tar.gz

  • tar –xvf blast-2.2.18-x64-linux.tar.gz


Linux example1
Linux Example formatdb.html)

  • Download database(s) of choice

  • ftp://ftp.ncbi.nih.gov/blast/db/

  • ftp://ftp.ncbi.nih.gov/blast/db/nt.00.tar.gz

    (nt.00.tar.gz-nt.05.tar.gz)

    Note – these are too LARGE for your CSS account

  • mkdir ~/blast/db

  • chmod -R 755 db/

  • mv nt.*.tar.gz ~/blast/db


Create our own db
Create Our Own DB formatdb.html)

  • Download chrX: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_X/hs_ref_chrX.fa.gz

  • mv ~/Deskstop/hs_ref_chrX.fa.gz ~/blast/db

  • gunzip hs_ref_chrX.fa.gz

  • ../blast-2.2.18/bin/formatdb -i hs_ref_chrX.fa -p F -o T -n hs_ref_chrX.fa

  • This will create approx 8 files.


Install blast locally
Install BLAST locally formatdb.html)

  • Have all we need to run locally

  • Note, there are further environment settings to make it easier to run from any location on the system – that is in the readme

  • Need a sequence in a file to blast against database


Install blast locally1
Install Blast Locally formatdb.html)

Create file called dmd.hs (in ~/blast) with the following sequence:

>gi|5032280|ref|NM_000109.2| Homo sapiens dystrophin (DMD), transcript Dp427c, mRNA

CGTTAAATGCAAACGCTGCTCTGGCTCATGTGTTTGCTCCGAGGTATAGGTTTTGTTCGACTGACGTATC

AGATAGTCAGAGTGGTTACCACACCGACGTTGTAGCAGCTGCATAATAAATGACTGAAAGAATCATGTTA

GGCATGCCCACCTAACCTAACTTGAATCATGCGAAAGGGGAGCTGTTGGAATTCAAATAGACTTTCTGGT

TCCCAGCAGTCGGCAGTAATAGAATGCTTTCAGGAAGATGACAGAATCAGGAGAAAGATGCTGTTTTGCA

CTATCTTGATTTGTTACAGCAGCCAACTTATTGGCATGATGGAGTGACAGGAAAAACAGCTGGCATGGAA

cd ~/blast

./blast-2.2.18/bin/blastall -p blastn -i dmd.hs -d db/hs_ref_chrX.fa -o dmd.hs.out

./blast-2.2.18/bin/blastall -p blastn -i dmd.hs -d db/hs_ref_chrX.fa -o dmd.hs.out –e 0.00001


Blast

BLASTN 2.2.15 [Oct-15-2006] formatdb.html)

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query= gi|5032280|ref|NM_000109.2| Homo sapiens dystrophin (DMD),

transcript Dp427c, mRNA

(350 letters)

Database: hs_ref_chrX.fa

22 sequences; 152,577,922 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

ref|NT_011757.15|HsX_11914 Homo sapiens chromosome X genomic con... 694 0.0

ref|NT_011638.12|HsX_11795 Homo sapiens chromosome X genomic con... 40 0.042

ref|NT_011681.15|HsX_11838 Homo sapiens chromosome X genomic con... 38 0.17

ref|NT_079573.3|HsX_79638 Homo sapiens chromosome X genomic cont... 38 0.17

ref|NT_011669.16|HsX_11826 Homo sapiens chromosome X genomic con... 36 0.65

ref|NT_011651.16|HsX_11808 Homo sapiens chromosome X genomic con... 36 0.65

ref|NT_011630.14|HsX_11787 Homo sapiens chromosome X genomic con... 34 2.6

ref|NT_113965.1|HsX_111684 Homo sapiens chromosome X genomic con... 34 2.6

ref|NT_011786.15|HsX_11943 Homo sapiens chromosome X genomic con... 34 2.6


Blast

>ref|NT_011757.15|HsX_11914 Homo sapiens chromosome X genomic contig, reference assembly

Length = 34879939

Score = 694 bits (350), Expect = 0.0

Identities = 350/350 (100%)

Strand = Plus / Minus

Query: 1 cgttaaatgcaaacgctgctctggctcatgtgtttgctccgaggtataggttttgttcga 60

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct: 31139409 cgttaaatgcaaacgctgctctggctcatgtgtttgctccgaggtataggttttgttcga 31139350

Query: 61 ctgacgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgcataataaa 120

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct: 31139349 ctgacgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgcataataaa 31139290

ETC…


What next
What Next? genomic contig, reference assembly

  • We could use BLAST to do something useful

  • Take 400,000 sequence reads, and use blast to identify locations of read sequences in genome

    • BLAST each sequence individually

    • Cluster sequences

    • Generate report

  • How would we do all of that?


  • Blast
    Perl genomic contig, reference assembly

    • A program to read thru a big sequence file – and strip out each individual read

    • Use Perl as a "wrapper" – to run Blast

    • Parse results (or possibly use a PM to parse results)

    • Process, aggregate, summarize, compare to control sequence – and report generation


    Alternatively blat
    Alternatively …BLAT genomic contig, reference assembly

    • BLAST-Like Alignment Tool

    • http://genome.ucsc.edu/FAQ/FAQblat


    From the faq
    From the FAQ genomic contig, reference assembly

    http://genome.ucsc.edu/

    From a practical standpoint, Blat has several advantages over BLAST:

    • speed (no queues, response in seconds) at the price of lesser homology depth

    • the ability to submit a long list of simultaneous queries in fasta format

    • five convenient output sort options

    • a direct link into the UCSC browser

    • alignment block details in natural genomic order

    • an option to launch the alignment later as part of a custom track


    Blast suite
    Blast Suite genomic contig, reference assembly

    http://genome.ucsc.edu/goldenPath/help/blatSpec.html

    Blat produces two major classes of alignments: at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, and at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts.

    The main programs in the blat suite are:

    * gfServer – a server that maintains an index of the genome in memory and uses the index to quickly find regions with high levels of sequence similarity to a query sequence.

    * gfClient – a program that queries gfServer over the network, and then does a detailed alignment of the query sequence with regions found by gfServer.

    * blat –combines client and server into a single program, first building the index, then using the index, and then exiting.

    * webBlat – a web based version of gfClient that presents the alignments in an interactive fashion.


    Example
    Example genomic contig, reference assembly

    • Download binaries (or source files and compile)

    • http://genome-test.cse.ucsc.edu/~kent/exe/

    • (for CSS – use the Linux version)

      blat database query output.psl –out=blast

    • database and query files may be either: .fa , .nib or .2bit file


    System
    system genomic contig, reference assembly

    system LIST

    -- executes any program on the system

    #!/usr/bin/perl

    system("/mnt/r0-blastdb/blast-bin/blastall –p blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out");

    # this assumes test.txt and test.out are in same directory as your program

    .


    Back ticks
    back ticks genomic contig, reference assembly

    #!/usr/bin/perl

    $output = `/mnt/r0-blastdb/blast-bin/blastall –p blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out`;

    print "$output\n";

    # command is passed on, and interpreted by the shell

    # output of command returned