Sequence Alignment and Approaches to Database Searching

Sequence Alignment and Approaches to Database Searching Jessica Kissinger 2001

Why do we align sequences? • To discover functional, structural and evolutionary similarities • Because “similarity” may be an indicator of “homology” and thus provide some insight into function or gene identification.

Origins of “similar” sequences A Gene Duplication A1 A2 A1 A2 Gene Duplication Speciation A1 A2 A1 A2 Species A Species B Gene Conversion Horizontal Gene Transfer

The various algorithms • Dynamic programming algorithms provide a rigorous mathematical approach to sequence alignment. They are guaranteed to find the best alignment for a given scoring matrix and gap penalty. • Local alignments, as opposed to global alignments are better for DB searching and for finding similar domains

Scoring Matrices are designed to detect signal above background, to detect similarities beyond what would be observed by chance alone

Why do we need these matrices? • Database searching • Need different levels of sensitivity • Close relationships (Low PAM, high Blosum) • Distant relationships (High PAM, low Blosum)

Dot Plot Nuts & Bolts Dot Plot: Word Size = 1 g c t g g a a g g c a t g * * * * * c * * a * * * g * * * * * a * * * g * * * * * c * * a * * * c * * t * *

Dot Plots Nuts & Bolts Dot Plot: Word Size = 2 g c t g g a a g g c a t g * * c * a * g * a * g * * c * a c * t

Dot Plot Nuts & Bolts Dot Plot: Word Size = 3 g c t g g a a g g c a t g * c a g a g * c a c t

Plasmodium falciparum circumsporozoite protein MMRKLAILSVSSFLFVEALFQEYQCYGSSSNTRVLNELNYDNAGTNLYNELEMNYYGKQENWYSLKKNSRSLGENDDGNN NNGDNGREGKDEDKRDGNNEDNEKLRKPKHKKLKQPGDGNPDPNANPNVDPNANPNVDPNANPNVDPNANPNANPNANPN ANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNVDPNANPNANPNANPNANPNANPNANPNANPN ANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNKNNQGNGQGHNMPNDPNRNVDENANANNAVKN NNNEEPSDKHIEQYLKKIKNSISTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYENDIEKKICKMEKCSSVFNVVNSSI GLIMVLSFLFLN Plasmodium vivax circumsporozoite protein MKNFILLAVSSILLVDLFPTHCGHNVDLSKAINLNGVNFNNVDASSLGAAHVGQSASRGRGLGENPDDEEGDAKKKKDGK KAEPKNPRENKLKQPGDRADGQPAGDRADGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRADGQPAGDRADGQPAGD RADGQPAGDRAAGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAG DRAAGQPAGDRAAGQPAGDRAAGQPAGNGAGGQAAGGNAGGGQGQNNEGANAPNEKSVKEYLDKVRATVGTEWTPCSVTC GVGVRVRRRVNAANKKPEDLTLNDLETDVCTMDKCAGIFNVVSNSLGLVILLVLALFN

Plasmodium falciparum CS protein Plasmodium vivax CS protein Window=2

Plasmodium falciparum CS protein Plasmodium vivax CS protein window = 7

Database Searching • Database Searching ≠ Sequence alignment • Database searching is the application of knowledge gained from previous experiments to the problem of gene discovery • Similarity ≠ Homology

Database Searching • The Assumptions • The sequences being sought have an evolutionary ancestral sequence in common with the query sequence • The best guess at the actual path of evolution is the path that requires the fewest evolutionary events (most parsimonious) • All substitutions are not equally likely and should be weighted accordingly • Insertions and deletions are less likely than substitutions and should be weighted accordingly

Database Searching • Applied Considerations • The choice of search algorithm influences the sensitivity and selectivity of the search • The choice of matrix determines both the pattern and the extent of substitution in the sequences the database search is most likely to discover

Protein vs Nucleotide • Which molecules should you search with? • Which databases should you search, nucleotide or protein?

Why can’t we just look at the DNA sequence for the protein? • It was one thought that we might be able to calculate a minimum mutation matrix, i.e. one in which the minimum number of steps needed to change from one aa to another we counted. The problem is, because of the degeneracy of the genetic code, often likely and unlikely mutations would receive the same score

BLAST • BLAST is less sensitive than SW • Basic BLAST uses a word size of 3 for proteins and is more sensitive than FASTA (even though FASTA uses a word of size 2) • Basic BLAST uses a word size of 11 or 12 for nucleic acid sequences • The Heuristic is applied to the words in BLAST via a “threshold value, T” for alignments of words.

Basic BLAST Algorithms • BLASTN - compares a nucleotide query to a nucleotide database • BLASTP - compares a protein query to a protein database • BLASTX - compares a nucleotide query sequence translated in all reading frames against a protein sequence database • TBLASTN - compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. • TBLASTX - compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

BLAST Nuts & Bolts • Search a database with initial words and the expanded word set(neighborhood words) with scores above some threshold, T. • If a match is found between a word and a DB entry, attempt to extend the alignment until the score falls off by some value, i.e. the score is no longer “maximal”

Blast in a Nutshell

PAM 120 A 3 R -3 6 N -1 -1 4 D 0 -3 2 5 C -3 -4 -5 -7 9 Q -1 1 0 1 -7 6 E 0 -3 1 3 -7 2 5 G 1 -4 0 0 -4 -3 -1 5 H -3 1 2 0 -4 3 -1 -4 7 I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1 A R N D C Q E G H I L K M F P S T W Y V B Z X *

Blast Extension Database Sequence T G Y A A S S S T Y M Q V G P R E G V L K P R E G A I Word has a “hit” Extend the word During extension, the matrix is used to calculate the score. Extension continues until the score is reached or the score deteriorates by the specified fall or cutoff.

BLAST Nuts & Bolts • Normal word sizes for proteins are W=3 with T = 14 or W=4 with T=16. • Normal word sizes for nucleic acids are W=11 or W=12 • The default scoring matrix for nucleic acid sequences is (+1, -3) for NCBI BLAST and (+5, -4) for WUBLAST

Gapped BLAST - 3 Changes to the Algorithm • Criterion for extending word pairs modified, this gives an increase in speed • Ability to create gapped alignments added • Smith-Waterman calculations are used to produce the final alignment

Word Extension • In the older versions of BLAST, if a word pair with a score above T was encountered when screening the DB, it was extended. • In the newer version, two non-overlapping words located at some distance X (the “hitdist”)from each other must hit the same sequence in the DB before an extension is performed. • To maintain sensitivity, must lower the value of T. This yields more hits, but few are extended.

Gapped Alignment • Original BLAST found many HSP and used all to generate a SUM statistic • If you gap then you only need to find only one rather than all ungapped alignments. • This allows T to be raised and increases the initial scan • Gapped alignments are achieved via dynamic programming to extend a central pair of aligned residues in both directions.

PSI-BLAST • Distant relationships are often best detected by motif or profile searches rather than pairwise comparisons • PSI-BLAST searches are iterated, with a position-specific matrix generated from significant alignments found in round i used in round i + 1. • BLAST uses a generalized matrix • May not be as sensitive as motif search but is very general and easy to use.

A PSSM (position specific scoring matrix) for PSI-BLAST A R N D C Q E G H I L K M F P S T W Y V 20 N 0 0 3 -2 -4 2 0 0 -2 0 0 2 -2 -4 -3 2 0 -5 -3 -3 21 S -2 0 3 0 -4 0 0 0 -2 -4 -4 1 -3 -4 -3 2 2 4 -3 -3 22 G 1 0 2 -2 -3 0 -2 1 2 -2 0 1 -2 -3 -3 1 -2 -4 -3 0 23 W -2 2 1 1 -4 0 1 0 2 -1 -3 0 -3 2 -3 1 -2 3 -2 -3 24 D -3 0 0 4 -4 -1 3 -3 1 -2 0 0 -2 -4 0 -2 0 -5 -3 -1 25 Q -2 0 1 0 -4 2 3 0 -2 -1 -4 -1 -3 -3 -3 1 2 -4 0 -3

There are 2 Blast Variants • NCBI BLAST (http://ncbi.nlm.nih.gov/BLAST/) or via local install • WUBLAST (http://blast.wustl.edu/) for information. This program is most often used at database web sites and for local installs.

Available GenBank Peptide Sequence Databases nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. swissprot Last major release of the SWISS-PROT protein sequence database (no updates) Drosophila genome Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP). yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations ecoli Escherichia coli genomic CDS translations pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank Patent Protein sequences derived from the Patent division of GenBank

Available Genbank Nucleotide Sequence Databases nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant". month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP). dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr) gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

Available GenBank Nucleotide Databases continued yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences E. coli Escherichia coli genomic nucleotide sequences pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank Patent Nucleotide sequences derived from the Patent division of GenBank vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/ mito Database of mitochondrial sequences alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).

Essential BLAST Parameters • W = word size • T = neighborhood word score threshold (varies by word size and matrix used) • V = number of descriptions to report • B = number of alignments to report • M= value of a nucleotide match • N = value of a nucleotide mismatch • X = word hit extension drop off score • E = Expected frequency of chance occurances • S = Score at which a single HSP would satisfy E • -matrix = defines a matrix to use • -filter = defines a specific filter program

Command line BLAST Format: algorithm db query options Example: blastp nr myprot.txt -matrix=pam70 V=10 B=10 Example: blastn nt mynuc.txt M=5 N=-4 E=1.0e-5 Example: blastn nt mynuc.txt M=5 N=-4 E=1.0e-5 > blast.out

Making your own BLAST DB • Any sequence file of fasta formatted sequences can be turned into a BLAST DB. • How you do this depends on which BLAST variant you are using. • NCBI BLAST-protein DB: setdb myseqfile • NCBI BLAST-nucleotide DB: pressdb myseqfile • WUBLAST - protienDB: formatdb -p myseqfile • WUBLAST-nucleotideDB:

Sequence Alignment and Approaches to Database Searching

Sequence Alignment and Approaches to Database Searching

Presentation Transcript

Lecture 4 Sequence alignment and searching

Techniques for Protein Sequence Alignment and Database Searching

Pairwise Alignment and Database Searching

Alignment methods and database searching

Sequence Analysis, Pair Wise Alignment, and Database Searching

Sequence Alignment and Database Searching

Sequence-based database searching Unit 9

Biological Sequence Comparison / Database Homology Searching

Sequence Database Searching

Biological Sequence Comparison / Database Homology Searching

Alignment methods and database searching

Heuristic Methods for Sequence Database Searching

Sequence Alignment vs. Database

Sequence Alignment and Database Searching

Techniques for Protein Sequence Alignment and Database Searching (part2)

Previous Lecture: Sequence Database Searching

NGS Bioinformatics Workshop 1.3 Sequence Alignment and Searching

Heuristic Methods for Sequence Database Searching

Pairwise Sequence Alignment and Database Searching

Techniques for Protein Sequence Alignment and Database Searching

Lecture 4 Sequence alignment and searching