fragment assembly of dna l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Fragment assembly of DNA PowerPoint Presentation
Download Presentation
Fragment assembly of DNA

Loading in 2 Seconds...

play fullscreen
1 / 33

Fragment assembly of DNA - PowerPoint PPT Presentation


  • 286 Views
  • Uploaded on

Fragment assembly of DNA. A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them. Fragment assembly of DNA. Biological background Models Algorithms Heuristics. Biological background. Problem as puzzle

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fragment assembly of DNA' - vaughan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fragment assembly of dna

Fragment assembly of DNA

A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.

fragment assembly of dna2
Fragment assembly of DNA
  • Biological background
  • Models
  • Algorithms
  • Heuristics

® Pei-Jie Wu

biological background
Biological background
  • Problem as puzzle
  • We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair.
  • Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows.

® Pei-Jie Wu

biological background4
Biological background
  • Target: The long sequence to reconstruct.
  • Fragment vs. Subsequence
  • Shotgun method:Based on fragment overlap
  • Fragment assembly: A collection of fragments to put together

® Pei-Jie Wu

biological background the ideal case
Biological background--The ideal case
  • Case: p.106
  • Aligned the input set, ignoring spaces at the extremities
  • Overlaps: the end part of a fragment is similar to the beginning of another
  • Consensus sequence base on majority vote

® Pei-Jie Wu

biological background complications
Biological background--Complications
  • The main factors that add to the complexity of the problem are:
    • Error
    • Unknown orientation
    • Repeated regions
    • Lack of coverage.

® Pei-Jie Wu

biological background complications7
Biological background--Complications

Errors

  • It usually means algorithms that require more time and space when computer program deal with error.
  • The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments.
  • Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters.
  • Figures 4.2, 4.3, 4.4

® Pei-Jie Wu

biological background complications8
Biological background--Complications

Errors

  • Two other types of errors: chimera and Contamination
  • Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target
    • Figure 4.5
    • Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage.
  • Contamination is from host or vector DNA
    • Solution: Most vectors are well know, so we can screen the data before starting assembly.

® Pei-Jie Wu

biological background complications9
Biological background--Complications

Unknown orientation

  • We generally do not know to which strand a particular fragment belongs to.
  • The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement.
  • Figure 4.6
  • Complexity: 2n

® Pei-Jie Wu

biological background complications10
Biological background--Complications

Repeated regions

  • Repeats are sequences that appear two or more times in the targrt molecule.
    • Short repeats
    • Longer repeats
  • If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors
  • Figure 4.7

® Pei-Jie Wu

biological background complications11
Biological background--Complications

Repeated regions

  • Problems:
    • If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy.
    • Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9)
  • Direct repeats: repeated copies in the same strand.
  • Inverted repeats: repeated regions in opposite strands (Figure 4.10)

® Pei-Jie Wu

biological background complications12
Biological background--Complications

Lack of coverage

  • Coverage: position i of the target as the number of fragments that cover this position.
  • Contigs: The contiguously covered regions
  • Figure 4.11
  • Solutions:
    • Sampling more fragments
    • Directed sequencing or walking

® Pei-Jie Wu

biological background alternative methods for dna sequencing
Biological background--Alternative methods for DNA sequencing
  • Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project.
  • Problem:
    • It is expensive to build special primers
    • Sequential rather than parallel
  • Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes.

® Pei-Jie Wu

models
Models
  • Shortest common superstring (SCS)
  • RECONSTRUCTION
  • MULTICONTIG
    • All three assume that the fragment collection is free of contamination and chimeras.

® Pei-Jie Wu

models shortest common superstring
Models--Shortest common superstring
  • Seeking the shortest superstring of a collection of given strings
  • PROBLEM: Shortest common superstring (SCS)
  • INPUT: a collectionF of strings.
  • OUTPUT: a shortest possible string S such that for every fF , S is a superstring of f.

® Pei-Jie Wu

models shortest common superstring16
Models--Shortest common superstring
  • Example 4.1
  • Example 4.2
    • Figure 4.12
    • Figure 4.13
  • A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies

® Pei-Jie Wu

models reconstruction
Models--Reconstruction
  • Takes into account both errors and unknown orientation
  • Dynamic programming sequence comparison algorithm
  • Use distance rather than similarity
  • Expression: p.116

® Pei-Jie Wu

models reconstruction18
Models--Reconstruction
  • PROBLEM: RECONSTRUCTION
  • INPUT: a collectionF of strings and an error tolerance  between 1 and 0.
  • OUTPUT: (p.117)
  • Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level 
  • Does not model repeats, lack of coverage, and size of target

® Pei-Jie Wu

models multicontig
Models--Multicontig
  • Involve internal linkage of the fragments in the layout
  • Nonlink: there is a fragment that properly contains the overlap on both sides
  • Weakest link: the smallest size of any link
  • t-contig: the weakest link of a layout is at least as large as t
  • Example 4.4
  • Definition: p.119

® Pei-Jie Wu

algorithms
Algorithms
  • Greedy algorithm
  • Acyclic subgraphs

(no errors and know orientation)

® Pei-Jie Wu

algorithms representing overlaps
Algorithms--Representing overlaps
  • Over multigraph OM(F) of a collection F is the directed, weighted multigraph
  • Set V of nodes of this structure is just F itself.
  • A directed edge from a to a different fragment b with weight t  0 exists if the suffix of a with t characters is a prefix of b
  • May be many edges from a to b
  • No self-loops

® Pei-Jie Wu

algorithms paths originating superstrings
Algorithms--Paths originating superstrings
  • Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e
  • Figure 4.15
    • Example in p.121
  • Equation 4.3
  • Hamiltonian paths: A path that goes through every vertex
  • Equation 4.4
    • Minimizing |S(P)|  maximizing w(P)

® Pei-Jie Wu

algorithms shortest superstrings as paths
Algorithms--Shortest superstrings as paths
  • A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b.
  • THEOREM 4.1
  • COROLLARY 4.1
  • LEMMA 4.1
  • THEOREM 4.2

® Pei-Jie Wu

algorithms the greedy algorithm
Algorithms--The greedy algorithm
  • Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.
  • OM(F)  OG(F)
  • “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

® Pei-Jie Wu

algorithms the greedy algorithm25
Algorithms--The greedy algorithm
  • Three conditions we have to test before accepting an edge in our Hamiltonian path:
    • Edges are processed in nonincreasing order by weight
    • The procedure ends when we have exactly n-1 edges, or
    • when the accepted edges induce a connected subgraph.
  • Figure 4.16
  • Example 4.5
    • Figure 4.17

® Pei-Jie Wu

algorithms acyclic subgraphs
Algorithms--Acyclic subgraphs
  • Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA.
  • “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly.
  • Figure 4.18

® Pei-Jie Wu

algorithms acyclic subgraphs27
Algorithms--Acyclic subgraphs
  • The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph.
  • Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph.
  • THEOREM 4.5
  • Algorithm: Topological sorting
  • Example 4.6
    • Figure 4.19, 4.20 and 4.21

® Pei-Jie Wu

heuristics
Heuristics
  • None of the formalisms proposed for fragment assembly are entirely adequate
  • Fragment assembly can be viewed as a multiple alignment problem with some additional feature:
    • Each fragment can participate with either the direct or the reverse-complemented sequence.
    • The sequences themselves are usually much shorter than the alignment itself.

® Pei-Jie Wu

heuristics29
Heuristics
  • Three criteria according to the second feature:
    • Scoring
      • Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal
      • Lower the entropy, the better
    • Coverage:
      • A fragment covers a column i if it participates in this column either with a character or with an internal space.
    • Linkage
      • The way individual fragment are linked in the layout is another determinant of layout quality.
      • Figure 4.22

® Pei-Jie Wu

heuristics assembly in practice
Heuristics--Assembly in practice
  • Practical implementations often divide the whole problem in three phase:
    • Finding overlaps
    • Building a layout
    • Computing the consensus

® Pei-Jie Wu

heuristics assembly in practice31
Heuristics--Assembly in practice

Finding overlaps

  • The first step in any assembly problem is fragment overlap delection.
  • Determine reverse complement
  • Consider fragments entirely contained in other fragment
  • Recall Section 3.2.3
    • Figure 4.23

® Pei-Jie Wu

heuristics assembly in practice32
Heuristics--Assembly in practice

Ordering fragments

  • Finding a good ordering of fragments in a contig
  • No algorithm that is simple and general enough
  • There are four issues to keep in mind when building paths:
    • Every path has a corresponding complement path
    • It is not necessary to include contain fragments
    • Cycles usually indicate the presence of repeats
    • Unbalanced coverage may be related to repeats as well (see Figure 4.13)

® Pei-Jie Wu

heuristics assembly in practice33
Heuristics--Assembly in practice

Alignment and consensus

  • Building a layout from a path in an overlap graph
  • Two techniques related to alignment construction:
    • The first one helps in building a good layout from a path in the presence of errors.
      • Example 4.7
      • Implement: Figure 4.24
    • The second one focuses on locally improving an already constructed layout
      • Example 4.8 in Figure 4.25
      • Implement: sum-of-pairs scoring scheme

® Pei-Jie Wu