Sequence alignments and sequence database searching Igor Kuznetsov Bioinformatics workshop Part I

Sequence alignments and sequence database searching Igor Kuznetsov Bioinformatics workshop Part I Sponsored by Kansas NSF EPSCoR and K-BRIN E-mail: igor@ku.edu Office: 1002 Haworth

Pair-wise sequence alignment

What is pair-wise sequence alignment? • Given two sequences of characters and a scoring scheme for comparing individual characters, it is a one-to-one mapping of equivalent characters between the two sequences that preserves the order of characters in both sequences: Sequence 1: THISISASTRING Sequence 2: THISISANOTHERLONGERSTRING THISISA----------------------STRING | | | | | | | | | | | | | THISISANOTHERLONGERSTRING

helix 1 turn helix 2 Here similar amino acid sequences perform the same function and have the same structure: helix-turn-helix motif. Biological sequences consist of characters that encode biological function. Proteins: 20 characters (A,G,E,C,W,S,D,F,H,I,K,L,P,Q,M,N,V,Y,R,T) CAP protein: Unknown: NLRHRETSTLSGVTRETVTRTLGKLEKKGL Known: GTPHRELSSI SGLARETVTRCLTRHFALCF

Alignment is THE tool of bioinformatics. It is used to quantify and to visualize sequence similarity. It is assumed that: • “Good” alignment = Similar sequences • Similar sequences are likely to have similar function and/or structure. • Can use knowledge-based approach – if one sequence has known structure/function, sequence alignment can be used to “map” this knowledge onto other, similar sequences.

The total number of possible alignments is huge. • We need an objective function (scoring system) to identify the best alignment (called an optimal alignment). How to tell that alignment is “good”?

Three possible variants for each position in a sequence alignment : 1. Exact match – a pair of identical characters are aligned. 2. Inexact match (substitution) – two different characters are aligned. 3. Gap (insertion/deletion) – a character is missing in one sequence. gap of length 1 exact match ACT -ACGGATAG ATTTAGGG - -AG substitution gap of length 2 Total: 2 substitutions, 7 exact matches, 2 gaps

S(a,b) X: ATGCTGGGAY: ATCCT -- GA g A general scoring system for sequence alignment consists of two parts: • A score for aligning a pair of characters (a,b): S(a,b)=S(b,a) • A penalty for each gap as a function of its length, k: g=f(k) Then the total pair-wise alignment score between two sequences X and Y is: For the highlighted position: a=G, b=C and S(G,C)=S(C,G)

Amino acid scoring matrix, s(a,b) • A symmetric 20x20 matrix (we have 210 possible residue pairs) • If residues a and b are similar, s(a,b) > 0 • If residues a and b are dissimilar, s(a,b) < 0

Gap penalty function • It is gaps that make alignment computationally complex. • Structural meaning of gaps in proteins – a non-essential part of the sequence is lost or a new part is added. If alignment is wrong, incorrect part will be assumed as inserted/deleted: Seq 1: correct Seq 2: protein 1 protein 2 Seq 1: incorrect Seq 2:

Matrix for s(a,b) sequence 1 sequence 2 An example: scoring an alignment of two DNA sequences using a similarity matrix We will use a linear affine gap penalty: g(k)=+*(k-1)

The meaning of the alignment score When we perform a summation over all positions of the alignment of sequences X and Y, the total score S(X,Y) gives us an estimate of the likelihood that the alignment represents similarity compared to aligning X and Y at random: S(X,Y) > 0 – more likely than random (many similar amino acids are aligned) S(X,Y) < 0 – less likely than random (many dissimilar amino acids are aligned) S(X,Y) = 0 – random alignment

Dot Plot – the simplest way to visualize pair-wise alignment invertedrepeat AT - TA palindrome ATTA Put a dot in the cell (i,j) if character in row i is the same as in column j, A(i) = A(j).

On-line Java-based tool for computing Dot Plots • http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Meaningless alignment Global alignment Optimal sequence alignments We can compute two types of optimal alignments: • GLOBAL – aligns entire sequence 1 against entire sequence 2 • LOCAL – aligns region of sequence 1 against region of sequence 2 Local alignment Very often local alignment is a better choice, since it finds only similar subsequences and does not attempt to align totally unrelated segments.

Optimal local alignment CCCCCCCCCCCCATATATATACCCCCCCCCCCC vs. GGGGATATTTATAGGGGGGGGGGG ATATATATA ATATTTATA We want only this part

Sequence 1 Sequence 2 2nd best 3rd best Optimal (best) Sub-optimal local alignments • In order to find all biologically meaningful similarities between two sequences we need to find all high scoring local alignments, not just the optimal one. • Using a technique called de-clumping local alignment programs can compute any given number of sub-optimal alignments (2nd best, 3rd best, etc):

This gap is penalized GCGCGCCGCGCGATAT   ATATAGCGCGCCCGCGC ATATTTTATATA             These gaps are not penalized Global alignment with zero end gap penalties • Used to align short sequence with long sequence: GCGCGCCGCGCGATATATATAGCGCGCCCGCGC vs. ATATTTTATATA

Basic notation • Percent identity = #identical pairs/alignment length • Percent similarity = #similar pairs/alignment length(similar pairs are those that receive a positive score) • Alignment score = positions + gaps Alignment length = 36 Alignment score = 19 Percent identity = 16.7% Percent similarity = 52.8%

Example: global vs. local Optimal global alignment obtained using BLOSUM50, A=-12, B=-2 Optimal local alignment obtained usingBLOSUM50, A=-12, B=-2

Global vs. Local: • Use global alignmentif • You expect, based on some biological information, that your sequences will match over the entire length. • Your sequences are of similar length. • Use local alignment if • You expect that only certain parts of two sequences will match (as in the case of conserved segment that can be found in many different proteins). • Your sequences are very different in length. • You want to search a sequence database (we will talk about it in details later). • NOTE: local alignment works only with similarity matrices because the total score must be able to take +, - and 0 values.

Selecting the right protein scoring matrix • PAM matrices: • PAM250 is suitable for highly diverged sequences (>30% identity) and is the best choice for aligning unknown proteins. • PAM160 is suitable for sequences which are 50-60% identical. • PAM40 is suitable for very similar proteins (70-90% identity) • BLOSUM matrices: • BLOSUM50-62 is suitable for highly diverged sequences (>30% identity) and is the best choice for aligning unknown proteins. • BLOSUM80 is suitable for 50% identity level. • BLOSUM90 is suitable for very similar proteins. • PAM250  BLOSUM45, PAM160  BLOSUM62PAM120  BLOSUM80 • Note that an increase in numbering in PAM corresponds to a decrease in BLOSUM numbering and vice versa.

Low gap penalties -> artificially high percent of identity betweensequences Lowering the gap penalties results in a different optimal alignment with higher % identity.

Low complexity regions • Real biological sequences have many regions where one or a few characters are over-represented (so-called low complexity regions): ATGGPTIVLLVAAAAAAAAAAGPTPGLILW | | | | | | | | | EVVIKPSMCDHAAAATAAAAALCMKFC • Such regions will bias the alignment because they tend to align with each other. Even absolutely unrelated sequences will have regions of false “similarity”. • Mask these regions using PSEG for proteins, NSEG or DUST for DNA.

Masking low complexity regions in proteins (PSEG program)

General suggestions for doing pairwise alignment • Always translate protein-coding DNA sequences – protein sequences contain more information.

General suggestions for doing pairwise alignment 2. Make a dotplot of each sequence with itself • Look for low complexity regions:

General suggestions for doing pairwise alignment 2. Make a dotplot of each sequence with itself • Look for repeats:

General suggestions for doing pairwise alignment • Translate protein-coding DNA sequences – protein sequences contain much more information. • Make a pairwise dotplot and a dotplot of each sequence with itself. • Watch out for low complexity regions and repeats. • Mask low complexity regions, remove repeats. • Do a local alignment first to see whether your sequences are similar over their entire lengths. • If they are or if you are absolutely positive that despite the lack of apparent sequence similarity your sequences can be aligned globally, use global alignment. • ALWAYS perform a visual inspection.

General suggestions for doing pairwise alignment • Use the default gap penalties supplied by the program first. The choice of gap penalties depends on the matrix. • Some matrices are good for distantly related sequences, some are good for closely related sequences. • Do alignments with different matrices and gap penalties. Regions that produce consistent alignments usually can be trusted. • Usually, the number of optimal alignments > 1. • The optimal alignment found by alignment program is not necessarily the correct one from the biological point of view. • One of the nearly-optimal alignments may be what you need. You may need to adjust the optimal alignment manually according to your biological “hunch”.

Pairwise alignment programs • Stand-alone programs from FASTA package http://fasta.bioch.virginia.edu/ • ALIGN – global alignment (affine gap penalties). • ALIGN0 – global alignment with 0 end gap penalties (affine gap penalties). • LALIGN – n best scoring local alignments (affine gap penalties). • On-line alignment programs available from KU Bioinformatics: http://jay.bioinformatics.ku.edu/EMBOSS/index.html • NEEDLE (global alignment with affine gap penalties) • WATER (local alignment with affine gap penalties) • DotPlot • All possible types of alignments on one web-server at USC: • http://www-hto.usc.edu/software/seqaln/seqaln-query.html

Application of DP to search sequence databases.

Objective of sequence database searches: • Given a DNA or protein sequence, find all sequences in the database that are similar to this sequence. • Main problem is how to pick all similar sequences and filter out all dissimilar sequences. Final result always depends on the definition of similarity.

Some terminology used in sequence searching • Query sequence – the sequence of interest which is compared to the database sequences. • Significance threshold – the critical value of the variable (alignment score, probability, etc) used to assess the alignment between query sequence and a database sequence and to draw conclusions about similarity between them. • Hit – an alignment between query sequence and database sequence that scores above the significance threshold.

hit X X X X hit hit X X X SEARCHING SEQUENCE DATABASES • OBJECTIVE: given a query sequence, find all database sequences that are significantly similar to the query sequence. • IMPLEMENTATION: • Align query sequence to each database sequence using local alignment (N best alignments in total). • Keep only those alignments that score higher than certain significance threshold.

E-value • The significance of observed similarity between query and database sequence is assessed using the E-value. • E-valueisa normalizedstatistic that givesthe number of matches with a score equal to or greater than the observed score, x, that are expected to occur by chance when searching a database of given size N. • The lower the E-value, the higher the chance that similarity between query and database sequence is significant. • However, similarities we deem “significant” depend on an arbitrary choice of E-value.

The twilight zone • When percent identity between two aligned sequences drops below 25-30%, the estimates of statistical significance fail to distinguish between related and unrelated sequences. In other words, true similarity cannot be distinguished from random match. • This sequence identity threshold is usually referred to as the twilight zone of pairwise sequence alignment. This is the green area of the overlap we saw previously. • Proteins with similar structure/function that have pairwise sequence identity below 25-30% can score lower than structurally and functionally dissimilar proteins.

Mathematically Rigorous Search: Smith-Waterman Method • SSEARCH – a stand-alone program from FASTA package http://fasta.bioch.virginia.edu/ • On-line alignment available from EMBL-EBI: http://www.ebi.ac.uk/MPsrch/ • SW search with column-cost and affine gap penalties • SW is the most rigorous method to do a database search using a single query sequence. It is also the slowest one. • Heuristic search methods are much faster: BLAST – Basic Local Alignment Search Tool.

W=3 W=3 Query sequence A Database sequence hit 1 hit 2 The main idea of BLAST Nucleotide BLAST looks for one exact match of length W (W=11 by default) ATCGCCATGCTTAATTGGGCTT CATGCTTAATT exact match Protein BLAST requires two neighboring hits of length W (W=3 by default)

An alignment that BLAST can’t find using default parametersbecause there is no hit of length 11 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Significance of BLAST hits • Statistically significant does not always imply “biologically meaningful”. • No strict rule how to choose E-value. Always a trade-off between sensitivity and specificity. • E-value < 10-6 – almost surely homologous. Will miss remote homologues. • Between 10-2 and 10-6 – probably homologous. • Between 10-2 and 10 – might or might not be interesting… • In any case, use your own judgments. • Pairwise sequence comparison cannot detect remote homologues reliably when sequence identity drops below 25-30%. Need more sophisticated approaches.

Protein databases for BLAST

DNA databases for BLAST

Sequence alignments and sequence database searching Igor Kuznetsov Bioinformatics workshop Part I