1 / 18

DNA Sequencing Problem

DNA Sequencing Problem. Lin Zhou. Contents. Background Information Formal Definition NP-Completeness Conclusions. Background Information. DNA Sequences too long to be sequenced (billions of base pairs) Shear DNA into millions of small fragments

signa
Download Presentation

DNA Sequencing Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DNA Sequencing Problem Lin Zhou

  2. Contents • Background Information • Formal Definition • NP-Completeness • Conclusions

  3. Background Information • DNA Sequences too long to be sequenced (billions of base pairs) • Shear DNA into millions of small fragments • Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)

  4. Background Information

  5. Background Information • Task: assemble individual short fragments (reads) into a single genomic sequence (“superstring”)

  6. Formal Definition • Shortest Common Superstring (SCSS) • Given: Strings s1, s2,…., sn • Question: Find a string T that contains all strings s1, s2,…., snas substrings, such that the length of T is minimized

  7. Formal Definition as Decision Problem • Shortest Common Superstring (SCSS) • Given: Set of strings S={s1, s2,…., sn} and integer k; • Question: Does there exist a string T such that∀si ∈ S, T ∩ si=si, and |T| < k ?

  8. An Example

  9. NP-Completeness • To show NP-Completeness, needs to show • SCSS problem is in NP • Another problem that is in NP-C is reducible to SCSS problem

  10. NP-Completeness • SCSS Problem is in NP • To show: SCSS is verifiable by a deterministic machine in polynomial time • If we are given a “yes” instance, we can check if it includes all the strings in linear time. • Thus the problem is in NP.

  11. NP-Completeness • Another NP-C problem is reducible to SCSS in polynomial time • Which problem to choose? • Minimum Set Cover (MSC) • Hamiltonian Path (HP)

  12. MSC  SCSS? • Attempt: Generate a string from each set? • SCSS looks only for the prefix/suffix match • Set Cover matching can be element from any where of the set (in fact the content of a set does not have order) • “No” instances of SCSS constructed from MSC may not be “No” instances of MSC. • Not feasible

  13. HP  SCSS • Hamiltonian Path Problem (Directed): • Given: Graph G=(V,E) where V={v1,v2,…,vn} is the set of vertices and E={e1,e2,…,em} is the set of directed edges. • Question: Is it possible to find a path that visited all the vertices in V exactly once? • How to transform?

  14. HP  SCSS • For each vertex vi in V : • For each outgoing edge and its end vertex (ej,vj) , • We generate string v'ivjv'iand vjv'ivj+1, where v'i is the complement of vi and vj+1 is the end vertex of the next outgoing edge ej+1. • eg, for a vertex v1 with edge (e3,v3), (e4, v4), (e5,v5) we need to generate: • v'1v3v'1, v3v'1v4, v'1v4v'1, v4v'1v5, v'1v5v'1, v5v'1v3 • Create connector strings vi#v'i. • For start and end vertex v1 and vn, we create^#v1 and vn#$ • There is a HP iff there is a superstring of length 2m+3n

  15. HP  SCSS • Given v'1v3v'1, v3v'1v4, v'1v4v'1, v4v'1v5, v'1v5v'1, v5v'1v3 • v1 to connect to (in Hamiltonian Path) : • v3: v1#v'1v3v'1v4v'1v5v'1v3#v'3… • v4: v1#v'1v4v'1v5v'1v3v'1v4#v'4… • v5: v1#v'1v5v'1v3v'1v4v'1v5#v'5… • This configuration forces any vertex with x outgoing edges to form a superstring of 2x+2 starting with v'1 and ending with one neighboring vertex which is the next vertex on the Hamiltonian Path.

  16. HP  SCSS • v1#v'1v3v'1v4v'1v5v'1v3#v'3… • If we also count the starting ‘v1’and the ‘#’, in average each vertex will need 2x+3 characters in the super string, where x is the number of outgoing edges from v1. • Thus if given m edges and n vertices, the result shortest super string would have 2m+3n characters.

  17. HP  SCSS • If there is a super string of 3m+2n length. The definition of superstring guaranteed visit of each vertices at least once. • Suppose if a vertex is visited more than once, the length would be > 3m+2n. Thus the path only visited each vertex only once. • If there is a HP in the graph, we can generate the super string by the trace of the path, and the generated super string will have the length of 3m+2n.

  18. Thank you! Questions and answers

More Related