1 / 21

Multiple Sequence Alignment

Multiple Sequence Alignment. Vasileios Hatzivassiloglou University of Texas at Dallas. Center star algorithm for multiple sequence global alignment. T is the set of strings that we want to align Pick S  T that minimizes The initial alignment starts with S ( ≡ S 1 )

xena-gross
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas

  2. Center star algorithm for multiple sequence global alignment • T is the set of strings that we want to align • Pick ST that minimizes • The initial alignment starts with S (≡S1) • Suppose we have already aligned S1, S2, ..., Si as S′1, S′2, ..., S′i. Then we add the remaining strings one at a time by aligning Si+1 with S′1, obtaining S′i+1 and S′′1. We replace S′1 with S′′1 and add spaces to S′2, ..., S′i wherever spaces were added to S′1.

  3. Finding S • S is the best representative of the set T in terms of the distance metric d • If T is considered as a cluster of strings, then S is the centroid of the cluster • To find S, align each string with every other ( pairs) and calculate the sum for each candidate. Pick the choice that minimizes this sum

  4. Example • Three strings: GTA, CGT, CAG • Step 1: Calculate all three pairwise similarities and pick the string that minimizes total distance; let’s say it’s CGT • Step 2-1: Align CGT with GTA • CGT- • -GTA • Step 2-2: Extend uninvolved, processed strings with spaces (not needed now)

  5. Example (continued) • Step 3-1: Align CGT- with CAG • C-GT- • CAG-- • Step 3-2: Extend uninvolved, processed strings with spaces (-GTA) • C-GT- • --GTA • CAG--

  6. Algorithm complexity – Finding S • To find S, we consider k candidates • For each candidate, we calculate the sum of k-1 terms – O(k2) such terms total • If the maximum string length is n, then each term can be calculated in O(n2) time • Total for finding S is O(k2n2)

  7. Algorithm complexity – Subsequent alignments • Each subsequent alignment at step i+1 aligns a string S′1 of length at most in with a string Si+1 of length at most n • Each alignment can be found in time O(in∙n) • Total time for these alignments is

  8. Algorithm complexity – Extensions with spaces • At step i+1 there is an extension of i-1 strings each of length at most in • For each such string, we need to consider a total of n new space positions • Time required is • Overall total time for the algorithm is O(k2n2)

  9. Error bounds • It is useful to know how far the solution found by an approximate algorithm is from the true optimal solution • Sometimes (but not always) it is possible to provide error bounds, that is give upper and lower bounds for the quantity • Bounds may depend on n and k

  10. Error analysis assumptions • Sometimes we need additional assumptions in order to derive useful bounds • For the approximate algorithm for multiple string alignment, we assume the triangle inequality for measure d:

  11. Background on distances • A distance or metricd is formally defined as a function A×A→ℜ on a set A (called a metric space) with the following properties: • d(x,y)≥0 (non-negativity) • d(x,y)=0 iff x=y (identity of indiscernibles) • d(x,y)=d(y,x) (symmetry) • d(x,y)≤d(x,z)+d(z,y) (triangle inequality) • Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the Lp spaces, and inner product spaces.

  12. Background on distances • A distance or metricd is formally defined as a function A×A→ℜ on a set A (called a metric space) with the following properties: • d(x,y)≥0 • d(x,y)=0 iff x=y • d(x,y)=d(y,x) • d(x,y)≤d(x,z)+d(z,y) • Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the Lp spaces, and inner product spaces. follows from 2, 3, and 4 pseudometric quasimetric semimetric

  13. Deriving an error bound • Let v0 be the score for the optimal alignment and v* the score for the alignment produced by the center star algorithm • Let d0(i,j) (d*(i,j)) be the corresponding induced distances on strings Si and Sj

  14. Lower bound for v0 Because the induced distance can be no less than the distance between the strings themselves Choice of S1

  15. Upper bound for v* Triangle inequality Symmetry Each string is aligned with S1 optimally (there may be additional spaces in matching positions, which do not change the distance)

  16. Combining the bounds • Better bound for low k

  17. Motif data notation • A motif is denoted by three parameters • Its length l • The number of allowed spaces g • The number of allowed changes d • (l, d, g) notation • Changes and gaps allowed because of mutations across organisms • In a “good” motif, g and d are small compared to l • Most work assumes g = 0

  18. Finding the motif consensus • Assume known motif instance positions and length (e.g., via multiple alignment) • Also known as the known site problem • Input: A set of motif instances • Output: What is the motif consensus? • Further, is the consensus a valid motif, or is it statistically indistinguishable from what we would expect from other randomly chosen regions?

  19. Statistical estimation • An important approach to many data mining and machine learning tasks • Requirement: The problem must be expressed as a probability function that depends on a number of modeled parameters whose value is unknown • The estimation task: Find the optimal values for these parameters

  20. Estimation example • Can be performed without an explicit probabilistic model • Example: Future markets are exchanges where contracts are traded for future execution • Contract price reflects probabilities of events

  21. Obama contract at intrade.com

More Related