1 / 67

Pairwise Sequence Alignments

Pairwise Sequence Alignments. Bioinformatics. Some Bioinformatics Programming Terminology. Model. A model is a set of propositions or equations describing in simplified form some aspects of experience.

odessa
Download Presentation

Pairwise Sequence Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Sequence Alignments Bioinformatics

  2. Some Bioinformatics Programming Terminology

  3. Model • A model is a set of propositions or equations describing in simplified form some aspects of experience. • A valid model includes all essential elements and their interactions of the concept or system it describes.

  4. Algorithm • An algorithm is a complete, unambiguous procedure for solving a specified problem in a finite number of steps. • Algorithms leave nothing undefined and require no intuition to achieve their end.

  5. Five Features of an Algorithm: • An algorithm must stop after a finite number of steps. • All steps of the algorithm must be precisely defined. • Input to the algorithm must be specified. • Output of the algorithm must be specified. There must be at least one output. • An algorithm must be effective - i.e. its operations must be basic and doable.

  6. Data Structures: Foundation of an Algorithm • One of the most important choices of writing a program. • For the same operation, different data structures can lead to vastly more or less efficient algorithms. • The design of data structures and algorithms goes hand in hand. • Once the data structure is well defined, usually the algorithm can be simple.

  7. Data Structure Primitives: • Strings • Arrays

  8. A String Is a Linear Sequence of Characters. • This implies several important properties: • Finite strings have beginnings and ends. Thus they also have a length. • Strings imply an alphabet. • The elements of a strings are ordered. • A string is a one dimensional array.

  9. Two Dimensional Arrays

  10. Pairwise Sequence Alignment is Fundamental to Bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • Two Dimensional Arrays are the basis of Pairwise Alignments • It is used to identify domains or motifs that are shared between proteins • It is the basis of BLAST searching • It is used in the analysis of genomes

  11. Are there other sequences like this one? • Huge public databases - GenBank, Swissprot, etc. • Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes • Similarity searching is based on alignment between two strings in a 2-D array

  12. Why Search for Similarity? 1. I have just sequenced something. What is known about the thing I sequenced? 2. I have a unique sequence. Is there similarity to another gene that has a known function? 3. I found a new protein in a lower organism. Is it similar to a protein from another species? 4. I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR of some other experiment.

  13. Definitions • Similarity: The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. • Identity: The extent to which two sequences are invariant. • Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + + GTW++ MA + L + AVT + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V LHRWEN 81

  14. Definitions • Identical- When a corresponding character is shared between two species or populations, that character is said to be identical. • Similar - The degree to which two species or populations share identities. • Homologous - When characters are similar due to common ancestry, they are homologous.

  15. Evolution and Alignment • Homology - two (or more) sequences have a common ancestor. • This is a statement about evolutionary history. • Similarity - two sequences are similar, by some criterion. • It does not refer to any historical process, just to a comparison of the sequences by some method. • It is a logically weaker statement.

  16. Caution • In bioinformatics these two terms are often confused and used interchangeably. • The reason is probably that significant similarity is such a strong argument for homology.

  17. Similarity ≠ Homology 1) 25% similarity ≥ 100 AAs is strong evidence for homology 2) Since homology is an evolutionary statement, there should be additional evidence which indicates “descent from a common ancestor” • common 3D structure • usually common function 3) Homology is all or nothing. You cannot say "50% homologous"

  18. retinol-binding protein (NP_006735) b-lactoglobulin (P02754) Page 42

  19. Definitions, Con’t. • Analogous - Characters are similar due to convergent evolution. • Orthologous - Homologous sequences (or characters) between different species that descended from a common ancestral gene during speciation; They may or may not be responsible for a similar function. • Paralogous - Homologous sequences within a single species that arose by gene duplication. • Homology is therefore NOT synonymous with similarity. • Homology is a judgment, similarity is a measurement.

  20. Proteins or Genes Related by Evolution Share a Common Ancestor • Random mutations in the sequences accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently. • Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

  21. Function is Conserved • Alignments can reveal which parts of the sequences are likely to be important for the function, if the proteins are involved in similar processes. • In parts of the sequence of a protein which are not very critical for its function, random mutations can easily accumulate. • In parts of the sequence that are critical for the function of the protein, hardly any mutations will be accepted; nearly all changes in such regions will destroy the function.

  22. Sequence Alignments • Comparing sequences provides information as to which genes have the same function • Sequences are compared by aligning them – sliding them along each other to find the most matches with a few gaps • An alignment can be scored – count matches, and can penalize mismatches and gaps • It is much easier to align proteins. Why?

  23. Why Search with Protein, not DNA Sequences? 1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix 3) protein databanks are much smaller than DNA databanks

  24. Similarity is Based on Dot Plots 1) two sequences on vertical and horizontal axes of graph 2) put dots wherever there is a match 3) diagonal line is region of identity (local alignment) 4) apply a window filter - look at a group of bases, must meet % identity to get a dot

  25. Definition • Pairwise alignment: • The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

  26. Dot Plots A Simple Way to Measure Similarity

  27. Simple Dot Plot

  28. Dot plot filtered with 4 base window and 75% identity

  29. Dot matrix provides visual picture of alignment • It is used to easily spot segments of good sequence similarity. • The two sequences are placed on each side of 2-dimensional matrix, and each cell in the matrix is then filled with a value for how well a short window of the sequences match at that point.

  30. Simple Dot Plot

  31. A Limitation to Dot Matrix Comparison • Where part of one sequence shares a long stretch of similarity with the other sequence, a diagonal of dots will be evident in the matrix. • However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity. • That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.

  32. A Solution • This background noise can be filtered out by comparing groups of l nucleotides, rather than single nucleotides, at each position. • For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4. • Therefore, the number of background matches will be lower:

  33. A Filtered Dot Plot

  34. The Dot Matrix Algorithm • The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l. • For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t. • Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

  35. The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l. • For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t. • Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

  36. Dot Matrix Sequence Comparison Examples

  37. Examples • These examples comes from the webpage www.bioinformaticsonline.org • This has a nice discussion of results in another package, DNA Strider. • I used COMPARE program in SeqWeb. • I used the BLOSUM 62 scoring matrix.

  38. Comparing a Protein with Itself • Proteins can be compared with themselves to show internal duplications or repeating sequences. • A self-matrix produces a central diagonal line through the origin, indicating an exact match between the x and y axes. • The parallel diagonals that appear off the central line are indicative of repeated sequence elements in different locations of the same protein.

  39. Haptoglobin • Haptoglobin is a protein that is secreted into the blood by the liver. This protein binds free hemoglobin. • The concentration of "free" hemoglobin (that is, outside red blood cells) in plasma (the fluid portion of blood) is ordinarily very low. • However, free hemoglobin is released when red blood cells hemolyze for any reason. • After haptoglobin binds hemoglobin, it is taken up by the liver. • The liver recycles the iron, heme, and amino acids contained in the hemoglobin protein.

  40. Our Comparison • Files used • 1006264A Haptoglobin H2 • DNA sequencing shows that the intragenic duplication within the human haptoglobin Hp2 allele was formed by a non-homologous, probably random, crossing-over within different introns of two Hp1 genes. • A repeated sequence (starting with ADDGCP...) is observed beginning at positions 30-90 and 90-150 - probably due to a duplication event in one of these locations.

  41. Window: 30 Stringency: 3 Blosum 62 matrix

  42. One of the strengths of dot-matrix searches is that they make repeats easy to detect by comparing a sequence against itself. • In self comparisons, direct repeats appear as diagonals parallel to the main line of identity.

  43. Comparison of Two Similar Sequences

  44. Our Comparison • Files Used: • P03035 • Repressor protein from E. coli Phage p22 • RPBPL • Repressor protein from E. coli phage Lambda • Lambda phages infect E. coli. They can be lytic and destroys the host cell, making hundreds of progeny. • They can also be lysogenic, and live quietly within the DNA of the bacteria. • A gene makes the repressor protein that prevents the phage from going destructively lytic. • Phage p22 is a related phage that also makes a repressor. • Both proteins form a dimer and bind DNA to prevent lysis.

  45. Dot Matrix Sequence Comparison • A row of dots represents a region of sequence similarity. • Background matching also appears as scattered dots. • There is a decrease in background noise as window and stringency parameters increase.

  46. Window: 10 Stringency: 1 Blosum 62 matrix

More Related