1 / 36

Sequence Alignment and Approaches to Database Searching

Sequence Alignment and Approaches to Database Searching. Jessica Kissinger 2001. Why do we align sequences?. To discover functional, structural and evolutionary similarities

siusan
Download Presentation

Sequence Alignment and Approaches to Database Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment and Approaches to Database Searching Jessica Kissinger 2001

  2. Why do we align sequences? • To discover functional, structural and evolutionary similarities • Because “similarity” may be an indicator of “homology” and thus provide some insight into function or gene identification.

  3. Origins of “similar” sequences A Gene Duplication A1 A2 A1 A2 Gene Duplication Speciation A1 A2 A1 A2 Species A Species B Gene Conversion Horizontal Gene Transfer

  4. The various algorithms • Dynamic programming algorithms provide a rigorous mathematical approach to sequence alignment. They are guaranteed to find the best alignment for a given scoring matrix and gap penalty. • Local alignments, as opposed to global alignments are better for DB searching and for finding similar domains

  5. Scoring Matrices are designed to detect signal above background, to detect similarities beyond what would be observed by chance alone

  6. Why do we need these matrices? • Database searching • Need different levels of sensitivity • Close relationships (Low PAM, high Blosum) • Distant relationships (High PAM, low Blosum)

  7. Dot Plot Nuts & Bolts Dot Plot: Word Size = 1 g c t g g a a g g c a t g * * * * * c * * a * * * g * * * * * a * * * g * * * * * c * * a * * * c * * t * *

  8. Dot Plots Nuts & Bolts Dot Plot: Word Size = 2 g c t g g a a g g c a t g * * c * a * g * a * g * * c * a c * t

  9. Dot Plot Nuts & Bolts Dot Plot: Word Size = 3 g c t g g a a g g c a t g * c a g a g * c a c t

  10. Plasmodium falciparum circumsporozoite protein MMRKLAILSVSSFLFVEALFQEYQCYGSSSNTRVLNELNYDNAGTNLYNELEMNYYGKQENWYSLKKNSRSLGENDDGNN NNGDNGREGKDEDKRDGNNEDNEKLRKPKHKKLKQPGDGNPDPNANPNVDPNANPNVDPNANPNVDPNANPNANPNANPN ANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNVDPNANPNANPNANPNANPNANPNANPNANPN ANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNKNNQGNGQGHNMPNDPNRNVDENANANNAVKN NNNEEPSDKHIEQYLKKIKNSISTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYENDIEKKICKMEKCSSVFNVVNSSI GLIMVLSFLFLN Plasmodium vivax circumsporozoite protein MKNFILLAVSSILLVDLFPTHCGHNVDLSKAINLNGVNFNNVDASSLGAAHVGQSASRGRGLGENPDDEEGDAKKKKDGK KAEPKNPRENKLKQPGDRADGQPAGDRADGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRADGQPAGDRADGQPAGD RADGQPAGDRAAGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAG DRAAGQPAGDRAAGQPAGDRAAGQPAGNGAGGQAAGGNAGGGQGQNNEGANAPNEKSVKEYLDKVRATVGTEWTPCSVTC GVGVRVRRRVNAANKKPEDLTLNDLETDVCTMDKCAGIFNVVSNSLGLVILLVLALFN

  11. Plasmodium falciparum CS protein Plasmodium vivax CS protein Window=2

  12. Plasmodium falciparum CS protein Plasmodium vivax CS protein window = 7

  13. Database Searching • Database Searching ≠ Sequence alignment • Database searching is the application of knowledge gained from previous experiments to the problem of gene discovery • Similarity ≠ Homology

  14. Database Searching • The Assumptions • The sequences being sought have an evolutionary ancestral sequence in common with the query sequence • The best guess at the actual path of evolution is the path that requires the fewest evolutionary events (most parsimonious) • All substitutions are not equally likely and should be weighted accordingly • Insertions and deletions are less likely than substitutions and should be weighted accordingly

  15. Database Searching • Applied Considerations • The choice of search algorithm influences the sensitivity and selectivity of the search • The choice of matrix determines both the pattern and the extent of substitution in the sequences the database search is most likely to discover

  16. Protein vs Nucleotide • Which molecules should you search with? • Which databases should you search, nucleotide or protein?

  17. Why can’t we just look at the DNA sequence for the protein? • It was one thought that we might be able to calculate a minimum mutation matrix, i.e. one in which the minimum number of steps needed to change from one aa to another we counted. The problem is, because of the degeneracy of the genetic code, often likely and unlikely mutations would receive the same score

  18. BLAST • BLAST is less sensitive than SW • Basic BLAST uses a word size of 3 for proteins and is more sensitive than FASTA (even though FASTA uses a word of size 2) • Basic BLAST uses a word size of 11 or 12 for nucleic acid sequences • The Heuristic is applied to the words in BLAST via a “threshold value, T” for alignments of words.

  19. Basic BLAST Algorithms • BLASTN - compares a nucleotide query to a nucleotide database • BLASTP - compares a protein query to a protein database • BLASTX - compares a nucleotide query sequence translated in all reading frames against a protein sequence database • TBLASTN - compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. • TBLASTX - compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

  20. BLAST Nuts & Bolts • Search a database with initial words and the expanded word set(neighborhood words) with scores above some threshold, T. • If a match is found between a word and a DB entry, attempt to extend the alignment until the score falls off by some value, i.e. the score is no longer “maximal”

  21. Blast in a Nutshell

  22. PAM 120 A 3 R -3 6 N -1 -1 4 D 0 -3 2 5 C -3 -4 -5 -7 9 Q -1 1 0 1 -7 6 E 0 -3 1 3 -7 2 5 G 1 -4 0 0 -4 -3 -1 5 H -3 1 2 0 -4 3 -1 -4 7 I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1 A R N D C Q E G H I L K M F P S T W Y V B Z X *

  23. Blast Extension Database Sequence T G Y A A S S S T Y M Q V G P R E G V L K P R E G A I Word has a “hit” Extend the word During extension, the matrix is used to calculate the score. Extension continues until the score is reached or the score deteriorates by the specified fall or cutoff.

  24. BLAST Nuts & Bolts • Normal word sizes for proteins are W=3 with T = 14 or W=4 with T=16. • Normal word sizes for nucleic acids are W=11 or W=12 • The default scoring matrix for nucleic acid sequences is (+1, -3) for NCBI BLAST and (+5, -4) for WUBLAST

  25. Gapped BLAST - 3 Changes to the Algorithm • Criterion for extending word pairs modified, this gives an increase in speed • Ability to create gapped alignments added • Smith-Waterman calculations are used to produce the final alignment

  26. Word Extension • In the older versions of BLAST, if a word pair with a score above T was encountered when screening the DB, it was extended. • In the newer version, two non-overlapping words located at some distance X (the “hitdist”)from each other must hit the same sequence in the DB before an extension is performed. • To maintain sensitivity, must lower the value of T. This yields more hits, but few are extended.

  27. Gapped Alignment • Original BLAST found many HSP and used all to generate a SUM statistic • If you gap then you only need to find only one rather than all ungapped alignments. • This allows T to be raised and increases the initial scan • Gapped alignments are achieved via dynamic programming to extend a central pair of aligned residues in both directions.

  28. PSI-BLAST • Distant relationships are often best detected by motif or profile searches rather than pairwise comparisons • PSI-BLAST searches are iterated, with a position-specific matrix generated from significant alignments found in round i used in round i + 1. • BLAST uses a generalized matrix • May not be as sensitive as motif search but is very general and easy to use.

  29. A PSSM (position specific scoring matrix) for PSI-BLAST A R N D C Q E G H I L K M F P S T W Y V 20 N 0 0 3 -2 -4 2 0 0 -2 0 0 2 -2 -4 -3 2 0 -5 -3 -3 21 S -2 0 3 0 -4 0 0 0 -2 -4 -4 1 -3 -4 -3 2 2 4 -3 -3 22 G 1 0 2 -2 -3 0 -2 1 2 -2 0 1 -2 -3 -3 1 -2 -4 -3 0 23 W -2 2 1 1 -4 0 1 0 2 -1 -3 0 -3 2 -3 1 -2 3 -2 -3 24 D -3 0 0 4 -4 -1 3 -3 1 -2 0 0 -2 -4 0 -2 0 -5 -3 -1 25 Q -2 0 1 0 -4 2 3 0 -2 -1 -4 -1 -3 -3 -3 1 2 -4 0 -3

  30. There are 2 Blast Variants • NCBI BLAST (http://ncbi.nlm.nih.gov/BLAST/) or via local install • WUBLAST (http://blast.wustl.edu/) for information. This program is most often used at database web sites and for local installs.

  31. Available GenBank Peptide Sequence Databases nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. swissprot Last major release of the SWISS-PROT protein sequence database (no updates) Drosophila genome Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP). yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations ecoli Escherichia coli genomic CDS translations pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank Patent Protein sequences derived from the Patent division of GenBank

  32. Available Genbank Nucleotide Sequence Databases nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant". month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP). dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr) gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

  33. Available GenBank Nucleotide Databases continued yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences E. coli Escherichia coli genomic nucleotide sequences pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank Patent Nucleotide sequences derived from the Patent division of GenBank vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/ mito Database of mitochondrial sequences alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).

  34. Essential BLAST Parameters • W = word size • T = neighborhood word score threshold (varies by word size and matrix used) • V = number of descriptions to report • B = number of alignments to report • M= value of a nucleotide match • N = value of a nucleotide mismatch • X = word hit extension drop off score • E = Expected frequency of chance occurances • S = Score at which a single HSP would satisfy E • -matrix = defines a matrix to use • -filter = defines a specific filter program

  35. Command line BLAST Format: algorithm db query options Example: blastp nr myprot.txt -matrix=pam70 V=10 B=10 Example: blastn nt mynuc.txt M=5 N=-4 E=1.0e-5 Example: blastn nt mynuc.txt M=5 N=-4 E=1.0e-5 > blast.out

  36. Making your own BLAST DB • Any sequence file of fasta formatted sequences can be turned into a BLAST DB. • How you do this depends on which BLAST variant you are using. • NCBI BLAST-protein DB: setdb myseqfile • NCBI BLAST-nucleotide DB: pressdb myseqfile • WUBLAST - protienDB: formatdb -p myseqfile • WUBLAST-nucleotideDB:

More Related