Protein Analysis Workshop 2006 Pairwise and multiple sequence alignments Alain Schenkel Tuomas Hätinen Bioinformatics group Institute of Biotechnology University of Helsinki
Overview • Motivation – Why alignments? • Sequence comparison • Dotplot • The alignment problem • Pairwise alignment algorithms • Exact algorithms • Heuristic algorithms • Database searches • Multiple sequence alignments • Web tools: • Build alignments using SRS or EBI server, • Blast at NCBI, EBI, • PairsDB, …
Motivation • Proteins perform most of the functions required in biological systems: • Signaling (kinases, ...) • Enzymes (proteases, …) • Structural (collagen, elastin, …) • Immune system (antibodies, ...) • Storage and transport (hemoglobin, …) • … • Large amount of information available in current databanks. Goal: Want to extrapolate information about the function of a newly discovered sequence by comparing it to annotated sequences.
Does it make sense? • All functional information is ultimately contained within the sequence. • Proteins are evolutionary related: • Selective pressure is on function, and thus on residues with functional role (eg: active site or structural key residues are conserved). • Modular nature of proteins. • Two sequences have the same structure if corresponding residues are similar enough on physico-chemical level.
Application of sequence alignments • Determining function of newly discovered genetic or protein sequences. • Identification of functional patterns/domains. • Predicting structure of proteins. • Determining evolutionaryrelationships among genes, proteins, and entire species. Aligning and comparing sequences, and searching databases for similar sequences – a cornerstoneof bioinformatics!!
Sequence Comparison Alignment Dotplots The pairwise alignment problem
Pairwise alignment Pairwise alignment = identification of residue-residue correspondence. ????? 101 AGVIGTILLISYGIRRLIKKSPSDVKP 115 ||:||.|||::|..|||.|:.|:||.| GLP_HORSE 60 AGIIGIILLLAYVSRRLRKRPPADVPP 86 For the alignment to be meaningful, the correspondence should reflect the functional, or evolutionary, …, relationship (if any). What criteria should we use to obtain biologically meaningful alignments?
Some terminology • Identity: • percentage of pairs of identical residues between two aligned sequences. • Similarity: • percentage of pairs of similar residues between two aligned sequences. • one must define what similar means. Eg: • as observed in well studied evolutionary related protein families, • physico-chemical amino acid properties: hydropathy, size, … • Homology: • two sequences are homologous if and only if they have a common ancestor. • it´s either yes or no. • not to be confused with similarity!
Dotplots Sequence 1 • The simplest way of comparing two sequences: • A dot is placed where both sequence elements are identical. • Gives an overview of all possible alignments. • Each diagonal indicates a possible (ungapped) alignment. Sequence 2 One possible alignment: ATCTTCGAT | |||| ---TACGAT
Filtering Out the Noise in Dotplots • Dots may be scored according to a sliding window and a similarity cutoff to reduce noise: • The smaller the window, the more noise. • With large windows, the sensitivity for short sequences is reduced. Window size = 5, Similarity cutoff = 3 LETVHKKLYAGQYQNAGQFCDDIWLMLDNA L S T I K R K L D * T G Q * Y Q E P W Q … LETVHKKLYAGQYQNAGQFCDDIWLMLDNA | | || |||| | || ||| | LSTIKRKLDTGQYQEPWQYVDDVWLMFNN LETVHKKLYAGQYQNAGQFCDDIWLMLDNA | | || |||| | || ||| | LSTIKRKLDTGQYQEPWQYVDDVWLMFNN LETVHKKLYAGQYQNAGQFCDDIWLMLDNA | | || |||| | || ||| | LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
Using Dotmatcher from SRS • SRS at EBI: http://srs.ebi.ac.uk/ • SRS at EMBnet Austria: http://emb2.bcc.univie.ac.at:8080/srs/ • ... or any servers listed at http://downloads.lionbio.co.uk/publicsrs.html Check out the SRS version (bottom of page): different versions index different databases, so the search results might be different depending on the version.
DotmatcherP (for proteins) Enter sequences in FASTA format! Advanced options: Change default window size, threshold score and scoring matrix
DotmatcherP Comparing a protein with itself. Eg: Drosophila Melanogaster SLIT • Identification of repeated protein domains
DotmatcherP Comparing two different sequences: • Identification of conserved protein domains. • Using the default parameters window size = 10 and threshold = 23:
DotmatcherP • If we lower the window size and the threshold, we observe lots of noise. • Eg, with window size = 5, threshold = 10:
Another Dotplot server: Dotlet • Has more options and provides more flexibility than Dotmatcher. • Some very useful features: • If only one sequence is entered, dotlet automatically compares it against itself (finding repeats, low complexity regions, etc.). • Same application for both nucleic acid and protein sequences. • When comparing nucleic acid to nucleic acid, dotlet will reverse complement one of the sequences and perform a second comparison. Enables, eg, to see structures like stem-loops. • Possible to compare a protein to a nucleic acid sequence. The nucleic acid sequence is translated in the three forward frames and pixels are set to the highest of the scores. Enables, eg, to detect introns/exons, frameshift, etc.
2. Enter name for sequence (optional) 1. Enter sequence 3. Repeat for second sequence (optional) 4. Select scoring matrix, window size and zoom 4. Click ”compute”!
Each pixel corresponds to a residue in the horisontal sequence and to a residue in the vertical sequence The pixels color depends on how similar the two sequences are around these two positions Tuning of grayscale in order to make background noise disappear Residues that match well in the alignment are coloured blue Possible to scroll the dotplot here Dotlet reverse complements one of the sequences stem-loops can be detected Possible to scroll the alignment here
Dotplot - Summary • Comparing a sequence with itself, can be used to identify: • Repeated domains, • Regions of low complexity (eg, …GYCAAAAAAAAALK…). • Comparing two protein sequences, can be used to identify: • Local regions of similarity, • Conserved protein domains.
Dotplot - Summary • Good: • visual detection of feature/similarity, • exploring the sequence organisation. • Bad: • resolving regions of low similarity, • does not provide an alignment (no insertions/deletions). To obtain an alignment, we need a method for lining up the diagonals in a dotplot. GATCTA GATC_A
The Pairwise Alignment Problem • Lign up diagonal by edit operations: • substitution (mutation) • gap or indel (insertion/deletion) sequence 1 substitution sequence 2 deletion seq1 IGTILLISYGIRRLIKKSPSDVKP----LPSPDTDVP || ||| | ||| | | || | || | | seq2 IGIILLLAYVSRRLRKRPPADVPPPASTVPSADAPPP gap insertion But there are many ways to align 2 sequences we need to score alignments to decide which is the best.
Scoring the Edit Operations • For example: • identical: +10 (it´s good) • substitution: +2 for S-A, -1 for K-P, … • gap: -3 PSDVKP--P | || | | PADVPPPAP Score: +50+2-1+2*(-3) = 45 Choosing an appropriate scoring scheme: where biological information is introduced (eg, reward the evolutionary most likely alignment). Standard notation: • | for identical • : for very similar (eg, size and hydropathy) • . for somewhat similar (eg, size or hydropathy)
Gap penalty TIL--------LISYGIRRLIK TILKKSPSDVKLISYGIRRLIK • Different scores for • gap opening, eg: -5 • gap extension, eg: L(-1) with L=length of extension • gap opening > gap extension Few long gaps is better than IG-TI--LYDL-SYYAG---IR IGKIIPRL--LVAY--VLIGSR many small gaps gap opening gap extension TIL--------LISYGIRRLIK TILKKSPSDVKLISYGIRRLIK gap score= -5 -6
Gap penalty • Can also consider special penalty for gaps at end/beginning of alignment (eg, zero penalty). • Need to be careful in adjusting the gap score to the substitution score: • too strong penalty no gaps, • too weak penalty too many gaps. • Insertions and deletions have been found to occur in nature at significantly lower frequency than mutations.
Residue Substitution • A substitution score for each aa pair a substitution matrix. • Most used: based on evolutionary relationship. • Two types: • PAM series, • BLOSUM series.
PAM (Percent Accepted Mutation) PAM250 • PAM1: observed mutations in carefully selected sets of closely related proteins (1572 sequences from 71 families). (1978) • Idea: observed substitutions are the result of 1 mutation (not many). • PAMn: iterate PAM1 n times to obtain substitution rate between more divergent sequences. Use when PAM: 0 30 80 110 200 250 %identity: 100 75 60 50 25 20
BLOSUM (BLOck Substitution Matrix) • Based on a larger set than PAM is. • More recent than PAM. (1992) • Different approach than PAM: • not based on an explicit evolutionary model, • observed aa substitutions in a set of conserved aa patterns called blocks. • BLOSUMn: fromblocks which are n% identical. • BLOSUM62: empirically shown to be among the best at detecting weak similarity. BLOSUM62
BLOSUM 8 BLOSUM 62 BLOSUM 45 PAM 1 PAM 120 PAM 250 Less divergent More divergent Tips for using substitution matrices • Generally, BLOSUM matrices perform better than PAM for local similarity searches. • For database searches, the most commonly used matrix is BLOSUM62. • When comparing closely related proteins, one should use lower PAM or higher BLOSUM, for distantly related proteins higher PAM or lower BLOSUM matrices • Caution: substitution matrices are statistical in nature. In a given alignment, a substitution may or may not correspond to an actual mutation.
Pairwise alignment algorithms Exact algorithms Heuristic algorithms Database scanning
Pairwise Alignment Algorithms • Given a scoring scheme, an alignment algorithm tries to find the best alignment between 2 sequences according to that scheme. • Exact algorithms: • guaranteed to return an alignment with the best possible score. • Heuristic alignments: • not guaranteed to return best alignments. • but they are quicker (and hopefully still return good alignments). • Two types of alignment: • Global: forced over the entire length of 2 sequences. • Local: between substrings of 2 sequences..
Global vs Local Alignment • Global alignments: • are sensitive to gap penalties, • do not take into account the modular nature of proteins, • can be used to compare 2 proteins with same function (in, eg, human/mouse). • Local alignments: are sensitive to modular nature of proteins. They can be used to: • look for conserved domains or motifs in 2 proteins, • search for local similarities in large sequences, • database searches, • scanning an entire genome with a short sequence.
Exact Algorithms: Dynamic Programming How can we find the best alignment between 2 sequences? • Exhaustive search among all possible alignments is not possible (eg, for 2 sequences of 100 and 95 residues: 55 millions alignments with 5 gaps). • Problem solved by dynamic programming: • initialize top row and left column, • compute best local scores iteratively, • keep track of where best local score comes from, • traceback to obtain the best alignments. • May exist several best solutions: an alignment reported to you may be one among a number of possibilities. best global score Example of 2 best solutions: ATTCTCTGA -TAC--TGA ATTCTCTGA -TA--CTGA The example is from www.pasteur.fr
Global Alignment Servers (Exact Algorithm) Use the Needleman-Wunsch algorithm (1970). • Server at SRS: NeedleP. (http://srs.ebi.ac.uk/ Tools) • Server at EBI: EMBOSS-Align • Let´s submit to http://www.ebi.ac.uk/emboss/align/index.html the sequences : >uniprot|P35858|ALS_HUMAN Insulin-like growth factor-binding protein complex MALRKGGLALALLLLSWVALGPRSLEGADPGTPGEAEGPACPAACVCSYDDDADELSVFC SSRNLTRLPDGVPGGTQALWLDGNNLSSVPPAAFQNLSSLGFLNLQGGQLGSLEPQALLG LENLCHLHLERNQLRSLALGTFAHTPALASLGLSNNRLSRLEDGLFEGLGSLWDLNLGWN SLAVLPDAAFRGLGSLRELVLAGNRLAYLQPALFSGLAELRELDLSRNALRAIKANVFVQ LPRLQKLYLDRNLIAAVAPGAFLGLKALRWLDLSHNRVAGLLEDTFPGLLGLRVLRLSHN AIASLRPRTFKDLHFLEELQLGHNRIRQLAERSFEGLGQLEVLTLDHNQLQEVKAGAFLG LTNVAVMNLSGNCLRNLPEQVFRGLGKLHSLHLEGSCLGRIRPHTFTGLSGLRRLFLKDN GLVGIEEQSLWGLAELLELDLTSNQLTHLPHRLFQGLGKLEYLLLSRNRLAELPADALGP LQRAFWLDVSHNRLEALPNSLLAPLGRLRYLSLRNNSLRTFTPQPPGLERLWLEGNPWDC GCPLKALRDFALQNPSAVPRFVQAICEGDDCQPPAYTYNNITCASPPEVVGLDLRDLSEA HFAPC >uniprot|O08770|GPV_RAT Platelet glycoprotein V precursor (GPV) (CD42D). MLRSVLLSAVLSLVGAQPFPCPKTCKCVVRDAVQCSGGSVAHIAELGLPTNLTHILLFRM DRGVLQSHSFSGMTVLQRLMLSDSHISAIDPGTFNDLVKLKTLRLTRNKISHLPRAILDK MVLLEQLFLDHNALRDLDQNLFQKLLNLRDLCLNQNQLSFLPANLFSSLGKLKVLDLSRN NLTHLPQGLLGAQIKLEKLLLYSNRLMSLDSGLLANLGALTELRLERNHLRSIAPGAFDS LGNLSTLTLSGNLLESLPPALFLHVSWLTRLTLFENPLEELPEVLFGEMAGLRELWLNGT HLRTLPAAAFRNLSGLQTLGLTRNPLLSALPPGMFHGLTELRVLAVHTNALEELPEDALR GLGRLRQVSLRHNRLRALPRTLFRNLSSLVTVQLEHNQLKTLPGDVFAALPQLTRVLLGH NPWLCDCGLWPFLQWLRHHLELLGRDEPPQCNGPESRASLTFWELLQGDQWCPSSRGLPP DPPTENALKAPDPTQRPNSSQSWAWVQLVARGESPDNRFYWNLYILLLIAQATIAGFIVF AMIKIGQLFRTLIREELLFEAMGKSSN
gap penalties choose scoring matrix gap penalties
NeedleP at SRS options for gap penalties choose scoring matrix (optional)
Local Alignment Servers (Exact Algorithm) • Server at EMBnet: LALIGN, uses SIM algorithm (1991) • http://www.ch.embnet.org/software/LALIGN_form.html • Server at SRS: • http://srs.ebi.ac.uk/ Tools. • WaterP. Uses the Smith-Waterman algorithm (1981) • MatcherP. Can be used to find various local alignments between 2 sequences. Slower than WaterP. • Server at EBI (Smith-Waterman algorithm). • http://www.ebi.ac.uk/emboss/align/index.html
Heuristic Algorithms • Motivations: • Exact algorithms are exhaustive but computationally expensive. • Exact algorithms are impractical for comparing a query sequence to millions of other sequences in a database (database scanning), • and so, database scanning requires faster alignment algorithm (at the cost of optimality).
Heuristic Algorithms • Probing a database with a query is similar to aligning a query with a very long sequence. • Main idea: • Use dynamic programming, but limited to (sub-)sequences which are likely to produce interesting alignments with the query. • Heuristic part of the algorithm: eliminate from search uninteresting sequences (need to make a guess). • Algorithms: • FASTA : Lipman-Pearson (1985). • BLAST (Basic Local Alignment Search Tool) : Altshul et al. (1990). need fast local alignment methods.
BLAST Overview • Many versions for different query-database cases: • blastp: protein - protein • blastn: nucleotide - nucleotide • blastx: nucleotide protein - protein • tblastn: protein - protein nucleotide • tblastx: nucleotide protein - protein nucleotide • Comes in many flavours. • Fast and reliable. • Easy to use.
BLAST Overview • BLAST computes “an alignment”, not necessarily the exact optimal alignment. • Given the query and the database (long sequence): • Find all words of length k (typical: k=4) that match the query with a score high enough. • Look for subsequences in the database that contain these words. • Extend subsequences to see if match score can be increased. • Compute total score when no more extensions are possible. • Rank the alignments. How should the different matched (sub-)sequences be ranked?
Significance of Alignments • Scores cannot be used to rank alignments: • a bad but long alignment may have a higher score than a good but short alignment. • We need a normalized scoring schemethat would allow to compare alignments, and evaluate their biological significance. • Idea: • Probe the database with random sequences. • This gives a distribution of scores (it follows the extreme-value distribution). • Establish a threshold for significance.
Extreme-Value Distribution Score distribution for random sequences probability that the score of our query is no better than random: P-value score score of our query Difficulty: finding a significance threshold.
Quantifying the Significance of Alignments For an alignment with raw score S: • P-value: • The probability of an alignment occurring with score S or better if the aligned-against sequence is random. • The lower the P-value, the more significant the alignment. • E-value: • Expected number of alignments with scores equivalent to or better than S to occur by chance only. • The lower the E-value, the more significant the alignment. • E-value = P-value * size of database.
Rough Guide for P-values and E-values • P-Value (reported by many programs): 0≤ P-val ≤ 1 • E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size of database
Heuristic Algorithms Servers • Pairwise alignment: • BLAST: http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi • Database screening: • FASTA: http://www.ebi.ac.uk/fasta33/ , SRS, … • BLAST: • SRS (at EBI or ...) • http://www.ncbi.nlm.nih.gov/BLAST/ • http://www.ebi.ac.uk/blast/index.html • http://www.ch.embnet.org/software/bBLAST.html • http://www.ch.embnet.org/software/aBLAST.html • Evaluating the significance of an alignment: • PRSS: http://www.ch.embnet.org/software/PRSS_form.html
BLAST Servers • Blast has many options : • choice of database, substitution matrix, … • basic or advanced section. • BLAST interfaces are different: • NCBI: excellent help pages and tutorial • SRS: easy multiple alignment access • EMBnet: simple text + graphical output. Remark: there is a server with a powerful implementation of Smith-Waterman for database screening: http://www.ebi.ac.uk/MPsrch/. Runs about 50 times slower, but is more sensitive and returns less false positives than Blast.
BLAST at NCBI >1IGR:A INSULIN-LIKE GROWTH FACTOR RECEPTOR EICGPGIDIRNDYQQLKRLENCTVIEGYLHILLISKAEDYRSYR FPKLTVITEYSLGDLFPNLTVIRGWKLFYNYALVIFEMTNLKDI GLYNLRNITRGAIRIEKNADLCYLSTVDWSLILDAVSNNYIVGN KPPKECGDLCPGTMEEKPMCEKTTINNEYNYRCWTTNRCQKMCP STCGKRACTENNECCHPECLGSCSAPDNDTACVACRHYYYAGVC VPACPPNTYRFEGWRCVDRDFCANILSAESSDSEGFVIHDGECM QECPSGFIRNGSQSMYCIPCEGPCPKVCEEEKKTKTIDSVTSAQ MLQGCTIFKGNLLINIRRGNNIASELENFMGLIEVVTGYVKIRH SHALVSLSFLKNLRLILGEEQLEGNYSFYVLDNQNLQQLWDWDH RNLTIKAGKMYFAFNPKLCVSEIYRMEEVTGTKGRQSKGDINTR NNGERASCESDVDDDDKEQKLISEEDLN Let´s submit the query sequence at http://www.ncbi.nlm.nih.gov/BLAST/
We paste our sequence here and launch the search substitution matrix