1 / 184

Topic 1 Outline

Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University. Topic 1 Outline. Similarity and Alignment Define homology, similarity by descent and similarity by convergence Common mutations and their mathematical models Alignments

ember
Download Presentation

Topic 1 Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for Similarity in SequencesGary BensonDepartments of Computer Science and BiologyBoston University

  2. Topic 1 Outline Similarity and Alignment • Define homology, similarity by descent and similarity by convergence • Common mutations and their mathematical models • Alignments • Scoring Alignments • Gap penalty functions • Computing the best scoring alignment – the Longest Common Subsequence (LCS) problem

  3. Similarity and Biomolecules Similarity is expected among biomolecules that are descended from a common ancestor. Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function while importantcatalytic, regulatory or structural regions remain similar.

  4. Similarity and Evolution Evolution has duplicated and shuffled bits and pieces of molecules to produce new linear arrangements that combine function in novel ways. Regions of similarity often suggest an evolutionary tie and/or common functional properties between very different molecules.

  5. Three common similarity problems • Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. • Start with a small set of sequences and identify similarities and differences among them. • In many sequences or very long sequences, detect commonly occurring patterns.

  6. What is Similarity?How can we measure it?

  7. Morphology Morphology is the form and structure of an organism. Should shared morphology mean similarity?

  8. Hands

  9. Aquatic Shape

  10. Shared morphology Shared morphology does not necessarily imply common ancestry. The animals with hands have all evolved from a common ancester with a hand. The ichthyosaur, shark and porpoise each evolved sea life adaptations independently.

  11. Homology When similarity is due to common ancestry, we call it homology.

  12. Modern molecular biology seeks to understand cellular processes through the action of DNA, RNA, and protein molecules. This will ultimately lead to a biochemical understanding of: • The pathogenesis of infectious diseases like AIDS, hepatitusand SARS. • The mutagenic properties of environmental toxins and how they lead to diseases like cancer. • The etiology of human genetic disease. • Strategies to prevent and treat diseases through drug and vaccine design, gene therapy, risk reduction, etc.

  13. How homology helps Given molecular sequences X and Y: X ~ Y AND INFO(Y) ==> INFO(X) (“ ~ ” means similar)

  14. Are the Sequences Similar?

  15. Are the Sequences Similar • How similar? • What parts are the most similar? Remember, the common ancestor of the two sequences may have existed millions of years ago.

  16. How can we tell if the two sequences are similar? Similarity judgements should be based on: • The types of changes or mutations that occur within sequences. • Characteristics of those different types of mutations. • The frequency of those mutations.

  17. Common mutations in DNA Substitution: A C G T T G A C A C G A T G A C Deletion: A C G T T G A C A C G A C Insertion: A C G T T G A C A C G C A A G T T G A C

  18. Common mutations Duplication: A C G T T G A C A C G T T G AT T G A C Inversion (double stranded DNA shown): A C G T T G A C T G C A A C T G A C T C A A C C A C A G T T G G

  19. Frequency of mutations Substitution >Insertion, Deletion > > Duplication > Inversion

  20. Evolutionary history of sequences

  21. Alignments There are many ways to align two sequences. We just saw one way: T T A C G T ACA G A T T A T - - G G A A C A - - - T A Here is another: T T A C G T – A C A G A T T A T - - - G G A A C - - A T - A Which is better? Remember, we can not choose based on the evolutionary history, because that is unknown.

  22. Alignments and Paths through the Alignment Array

  23. Alignments and Paths through the Alignment Array t a c g - c a a - - - a c g t g a a t t

  24. Alignments and Paths:An Alternate Alignment t - - a c g c a - - a a c g t g - - a a t t

  25. Finding the Best Alignment:Ranking Alignments by Score Score an alignment by • Partitioning it into columns • Assign a weight to each column • Sum the column weights

  26. Distance Scoring Distance scoring: • Alignment gets a non-negative score. • Alignment of identical sequences scores zero, all others > zero. • Best alignment has smallest score. Typical scoring functions are: • d(a,a) = 0; identity • d(a,b) = d(b,a) > 0; a ≠ b; substitution • g = d(a, – ) > 0; indel (gap)

  27. Similarity Scoring Similarity scoring: • Alignment scores may be positive, zero, or negative. • More similar means larger positive score. • The best alignment has largest score. Typical scoring functions are: • s(a,b) is { > 0 if a and b are similar in one or more characteristics or are observed to substitute frequently for each other; ≤ 0 otherwise }; substitution • g = s(a, – ) < 0; indel (gap)

  28. Gap penalty functions • Single character gap penalty g(a, – ) = c (c a constant or a value dependent on a) • Affine (linear) gap penalty g(k) = α+ βk (α is a gap opening penalty, β is a gap extension penalty) • Concave gap penalty g(k) =α+ β(m(k)) m(k) is a function like log(k) which grows more slowly as k increases.

  29. Distance Scoring Alignment parameters: d(a, a) = 0; d(a, b) = + 2, g = + 4 A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0 = 18

  30. Similarity Scoring Scoring parameters: s(a, a) = + 5, s(a, b) = - 3, g = - 8 A – G C C G T A T A C G A - - T - T 5 5 5 5 8 3 8 8 8 + = - 15 -

  31. Similarity scoring with affine gap Alignment parameters: s(a, a) = + 5, s(a, b) = - 3, g(k) = α + βk, α = - 5, β = - 4 A – G C C G T A T A C G A - - T - T 5 5 5 5 8 3 8 4 8 + = - 11 -

  32. Computing the Optimal Alignment:The LCS Problem as Prototype The Longest Common Subsequence (LCS) problem is a method for comparing sequences. Although the solution does not produce an alignment, it illustrates a method of dynamic programming that is very similar to that used by alignment algorithms.

  33. Longest Common Subsequence Problem Let X be a string of characters. A subsequence X’ of X is formed by discarding zero or more letters of X. Note that the letters in X’maintain their same order as in X. Let X and Y be two strings. A common subsequence Z is a subsequence of both. A longest common subsequence (LCS) is the longest such Z. Examples: X = a b c d e b a Y = b e b d c ea c d Z = b d e a X = a b c d e b a X’ = a b d b

  34. LCS Problem Given: Two sequences X and Y. Find: An LCS for X and Y. A divide and conquer solution can be developed by looking at what happens to the last letters in each sequence. That is, are they part of the LCS solution or not?

  35. Possible ways to split the problem

  36. LCS recursion

  37. Filling the dynamic programming array

  38. Filling the dynamic programming array

  39. Necessary values in adjacent cells

  40. Completed LCS array

  41. Tracing back for a solution LCS = bdea

  42. LCS time complexity There are (n + 1)(m + 1) cells in the LCS score array. Each cell is filled by examining 3 other cells in constant time. The time complexity to fill the array is O(nm). Tracing back for an LCS solution takes at most n + m steps. The total time complexity is therefore O(nm).

  43. Topic 2 Outline Types of Alignment Substitution Matrices • Global vs Local Alignment • Recursions for Global, time complexity • Global alignment with affine gap penalty, time complexity • Similarity scoring and local alignment • Recursion for local, time complexity • Finding suboptimal local alignments: declumping • Substitution Matrices

  44. Global vs Local Alignment Given two strings, X and Y: • global alignment produces an alignment that contains all of X and all of Y. • local alignment produces an alignment that contains only the best matching substrings, one from X and one from Y. X Y X Y

  45. Global vs Local Alignment Global alignment is useful when • The sequences are known to be related throughout their length, for example, similar protein sequences from close species. Local alignment is useful when • The sequences are believed to contain parts that are closely related.

  46. Global Alignment Problem Given: two sequences X and Y and alignment scoring functions, Find: the best scoring alignment that includes all of X and all of Y. Solution: Dynamic Programming

  47. Global Alignment Analysis of global alignment is similar to the LCS. Alignments can end in one of three ways. In terms of the prefix strings x1…xi and y1…yj, we have: 1. xi and yj are aligned with each other. (Here it makes no difference whether xi and yj are the same.) G[i,j] = G[i – 1, j – 1] + s(xi, yj) X: C G T Y: C G C

  48. Global Alignment • xi is deleted (aligned against a dash). G[i, j] = G[i – 1, j] + g X: C A T Y: C A - • yj is deleted (aligned against a dash). G[i, j] = G[i, j – 1] + g X: C A – Y: C A A

  49. Global alignment recursion(similarity scoring)

  50. Global alignment example match = +2, mismatch = - 3, gap = - 4

More Related