437 Views

Download Presentation
##### Pairwise sequence alignment

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Pairwise sequence alignment**Based on presentation by Irit Gat-Viks, which is based on presentation by Amir Mitchel, Introduction to bioinformatics course, Bioinformatics unit, Tel Aviv University. and of Benny shomer, Bar-Ilan university**Where we are in the course?**• Ways to interrogate biobanks: • By identifier-based search (GenBank etc.) • By genome location (genome browsers) • By mining annotation files with scrips • Now: searching by sequence similarity**What is it good for?**• Function inference – if we know something about A and A is similar to B, we can say something about B – guilt by association • Conservation arguments – if we know that A and B do something similar, by looking at the conserved segments we can infer which parts of A and B are important for their function • Looking for repeats etc. • Identifying the position of an mRNA/any transcript in the genome • Resequencing • Etc.**Issues with sequence similarity**• Things we’re after • A score: how well do two sequences fit? • Statistics: is this score significant or expected at random? • Regions: which parts of the query and the target sequence are actually similar/different? • Next time • How to efficiently search a large sequence database**Start from simple: Dot plots**• The most intuitive method to compare two sequences. • Each dot represents a identity of two characters. • No real score/significance, but very easy to assess visually**To Reduce Random Noise in Dot Matrix**• Specify a window size, w • Take w residues from each of the two sequences • Among the w pairs of residues, count how many pairs are matches • Specify a stringency**Protein Sequences**single residue identity 6 out of 23 identical**ABCDEFGEFGHIJKLMNO**tandem duplication compared to no duplication tandem duplication compared to self**What Is This?**5’ GGCGG 3’ Palindrome (Intrastrand)**Identifies low complexity/repeat regions**Compare a sequence with itself…**Dotlet example**• http://myhits.isb-sib.ch/cgi-bin/dotlet • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=672 • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=353120**Definition**Alignment: A matching of two sequences. A good alignment will match many identical (similar) characters in the two sequences VLSPAD-TNVK-AWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQ-VLHVWAKVEA--AGHG**How similar are two sequences?**• The common measure of sequence similarity is their alignment score • Simpler measures, e.g., % identity are also common • These require algorithm that compute the optimal alignment between sequences**How to present the alignment?**• | - character-wise identity • : - very similar amino acids • . – less similar amino acids • - gap in out of the sequences**Pairwise Alignment - Scoring**• The final score of the alignment is the sum of the positive scores and penalty scores: + Number of Identities + Number if Similarities - Number of Dissimilarities - Number of Gap openings - Number of Gap extensions Alignment score**Comparison methods**• Global alignment – Finds the best alignment across the whole two sequences. • Local alignment – Finds regions of similarity in parts of the sequences.GlobalLocal _____ _______ __ ____ __ ____ ____ __ ____**Global Alignment**• Algorithm of Needleman and Wunsch (1970) • Finds the alignment of two complete sequences:ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Semi-global alignment allows “free ends” GFHKKKADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQJKLLKJ**Local Alignment**• Algorithm of Smith and Waterman (1981) • Makes an optimal alignment of the best segment of similarity between two sequences. ADLG CDRYFQ |||| |||| | ADLG CDRYYQ • Can return a number of well aligned segments.**Finding an optimal alignment**• Pairwise alignment algorithms identify the highest scoring alignment from all possible alignments. • Different scoring systems can produce (very) different best alignments!!! • Unfortunately the number of possible alignments if pretty huge • Dynamic programming to the rescue**Intuition of Dynamic Programming**• Lets say we want to align XYZ and ABC • If we already computed the optimal way to: • Align XY and AB – Opt1 • Align XY and ABC – Opt2 • Align XYZ and AB – Opt3 • We now need to test three possible alignments Opt1Z or Opt2Z or Opt3- • Opt1COpt2-Opt3C • (where “-” indicates a gap). • Thus, if we construct small alignments first, we are able to extend then by testing only 3 scenarios.**Formally: solving global alignment**Global Alignment Problem: Input: Two sequences S=s1…sn, T=t1….tm (n~m) Goal: Find an optimal alignment according to the alignment quality (or scoring). Notation: • Let (a,b) be the score (weight) of the alignment of character a with character b. • Let V(i,j) be the optimal score of the alignment of S’=s1…si and T’=t1…tj (0 i n, 0 j m)**V(i,j) := optimal score of the alignment**of S’=s1…si and T’=t1…tj (0 i n, 0 j m) V(k,l) is computed as follows: • Base conditions: • V(i,0) = k=0..i(sk,-) • V(0,j) = k=0..j(-,tk) • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = max V(i-1,j) + (si,-) V(i,j-1) + (-,tj) Alignment with 0 elements spacing S’=s1...si-1 with T’=t1...tj-1 si with tj. S’=s1...si with T’=t1...tj-1 and ‘-’ with tj.**Costs: match 2, mismatch/indel -1**Snapshot of computing the table Optimal Alignment - Tabular Computation • Use dynamic programming to compute V(i,j) for all possible i,j values: for i=1 to n do begin For j=1 to m do begin Calculate V(i,j) using V(i-1,j-1), V(i,j-1), V(i-1,j) end end**Optimal Alignment - Tabular Computation**• Add back pointer(s) from cell (i,j) to father cell(s) realizing V(i,j). • Trace back the pointers from (m,n) to (0,0) • Needleman-Wunsch, ‘70 Backtracking the alignment**Solving Local Alignment**• Algorithm of Smith and Waterman (1981). • V(i,j) : the value of optimal local alignment between S[1..i] and T[1..j] • Assume the weights fulfill the following condition: (x,y) = 0 if x,y match 0 o/w (mismatch or indel)**Algorithm - Recursive Definition**Base Condition: i,j V(i,0) = 0, V(0,j) = 0 Recursion Step: i>0, j>0 0, V(i,j) = max V(i-1, j-1) + (si, tj), V(i, j-1) + (-, tj), V(i-1, j) + (si, -) Compute i*, j* s.t. V(i*, j*) = max1i n, 1 j mV(i,j) Computing Local Alignment (2) • A scheme of the algorithm: • Find maximum similarity between suffixes of S’=s1...si and T’=t1...tj • Discard the prefixes S’=s1...si, and T’=t1...tj whose similarity is 0 (and therefore decrease the overall similarity) • Find the indices i*, j* of S and T respectively after which the similarity only decreases.**As usual the pointers are created while filling the values**in the table, • The alignments are found by tracking the pointers from cell (i*, j*) until reaching an entry (i’, j’) that has value 0.**Computational complexity**• Computing the table requires O(n2) operations for both global and local alignment • Saving the pointers for traceback - O(n2) • But – what if we are only interested in the optimal alignment score? • Only need to remember the last row – O(n) space**Outline**• We now figured out • What an alignment is • What alignment score consists of • How to ± efficiently compute an optimal alignment • Still left to figure out • Where do we obtain good σ(i,j) values • When do we use global/local alignment • How to use alignment to search large databases**Scoring amino acid similarity**• Identity: Count the number of identical matches, divide by length of aligned region. The homology rule: above 25% for amino acids, above 75% for nucleotides. • Similarity: A less well defined measure • A problematic idea: Give positive score for aligning amino acids from the same group Can we find a better definition for similarity?**Scoring System based on evolution**• Some substitutions are more frequent than other substitutions • Chemically similar amino acids can be replaced without severely effecting the protein’s function and structure • Orthologous proteins: proteins derived from the same common ancestor • By comparing reasonably close orthologous proteins we can compute the relative frequencies of different amino acid changes • Amino acid substitution matrices: Families of matrices that list the probability of change from one amino acid to another during evolution (i.e., defining identity and similarity relationships between amino acids). • The two most popular matrices are the PAM and the BLOSUM matrix**PAM matrix**• PAM units measure evolutionary distance. • 1 PAM unit indicates the probability of 1 point mutation per 100 residues. • Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance. • JTT matrices are a newer generation of PAMs**Log Odds matrices**• The score might arise from bias in amino acid frequency -> We use the log odds of the PAM matrix. (120 PAM)**Rules of thumb**• The most widely used PAM250 is good for about 20% identity between the proteins • 40% --> PAM120 • 50% --> PAM80 • 60% --> PAM60**PAM vs. BLUSOM**• Choosing n • Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance. • Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance • Blosum matrices are newer (based on more sequences) 62 120 250**Gap parameters**• Observation: Alignments can differ significantly when using different gap parameters. • Assumption: For each matrix there are constant default parameters that produce optimum alignments. • Each matrix was checked with different parameters until a “true” alignment was reached. • Where can we obtain “true” alignments? • We can use sequence alignments based on structural alignments. • The structural alignments are “true” for our purpose.**Match**DNA scoring matrices • Uniform substitutions in all nucleotides: Mismatch**DNA scoring matrices**• The bases are divided to two groups, purines (A,G) and pyrmidines (C,T) • Mutations are divided into transitions and transversions. • Transitions – purine to purine or pyrmidine to pyrmidine (4 possibilities) . • Transversions – purine to pyrmidine or pyrmidine to purine. (8 possibilities). • By chance alone transversions should occur twice than transition. • De-facto transitions are more frequent than transversions • Bottom line: Meaningful DNA substitution matrices can be defined**Mismatch**transversion Mismatch transition Match DNA scoring matrices • Non-uniform substitutions in all nucleotides:**Evaluation**• Remember that one of our goals was to estimate significance of the scores • How can we estimate the significance of the alignment? • Which alignment is better ? ? Does the score arise from order or from composition ?**Evaluation - bootstrap approach**• Data with same composition but different order: • Shuffle one of the sequences. • Re-align and score. • Repeat numerous times. • Calculate the mean and standard deviation of shuffled alignments scores.**Evaluation - bootstrap approach**• Data with the same composition but with a different order: Shuffle one of the sequences Compare alignment score with mean of shuffled alignments Align with the second sequence Calculate mean and standard deviation of shuffled alignments**Evaluation**• We can compare the score of the original alignment with the average score of the shuffled alignments. • Thumb rule:If:original alignment >>average score + 6*SDThen:the alignment is statistically significant.