Fragment assembly of DNA

1 / 33

# Fragment assembly of DNA - PowerPoint PPT Presentation

Fragment assembly of DNA. A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them. Fragment assembly of DNA. Biological background Models Algorithms Heuristics. Biological background. Problem as puzzle

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Fragment assembly of DNA' - vaughan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Fragment assembly of DNA

A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.

Fragment assembly of DNA
• Biological background
• Models
• Algorithms
• Heuristics

® Pei-Jie Wu

Biological background
• Problem as puzzle
• We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair.
• Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows.

® Pei-Jie Wu

Biological background
• Target: The long sequence to reconstruct.
• Fragment vs. Subsequence
• Shotgun method:Based on fragment overlap
• Fragment assembly: A collection of fragments to put together

® Pei-Jie Wu

Biological background--The ideal case
• Case: p.106
• Aligned the input set, ignoring spaces at the extremities
• Overlaps: the end part of a fragment is similar to the beginning of another
• Consensus sequence base on majority vote

® Pei-Jie Wu

Biological background--Complications
• The main factors that add to the complexity of the problem are:
• Error
• Unknown orientation
• Repeated regions
• Lack of coverage.

® Pei-Jie Wu

Biological background--Complications

Errors

• It usually means algorithms that require more time and space when computer program deal with error.
• The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments.
• Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters.
• Figures 4.2, 4.3, 4.4

® Pei-Jie Wu

Biological background--Complications

Errors

• Two other types of errors: chimera and Contamination
• Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target
• Figure 4.5
• Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage.
• Contamination is from host or vector DNA
• Solution: Most vectors are well know, so we can screen the data before starting assembly.

® Pei-Jie Wu

Biological background--Complications

Unknown orientation

• We generally do not know to which strand a particular fragment belongs to.
• The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement.
• Figure 4.6
• Complexity: 2n

® Pei-Jie Wu

Biological background--Complications

Repeated regions

• Repeats are sequences that appear two or more times in the targrt molecule.
• Short repeats
• Longer repeats
• If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors
• Figure 4.7

® Pei-Jie Wu

Biological background--Complications

Repeated regions

• Problems:
• If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy.
• Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9)
• Direct repeats: repeated copies in the same strand.
• Inverted repeats: repeated regions in opposite strands (Figure 4.10)

® Pei-Jie Wu

Biological background--Complications

Lack of coverage

• Coverage: position i of the target as the number of fragments that cover this position.
• Contigs: The contiguously covered regions
• Figure 4.11
• Solutions:
• Sampling more fragments
• Directed sequencing or walking

® Pei-Jie Wu

• Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project.
• Problem:
• It is expensive to build special primers
• Sequential rather than parallel
• Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes.

® Pei-Jie Wu

Models
• Shortest common superstring (SCS)
• RECONSTRUCTION
• MULTICONTIG
• All three assume that the fragment collection is free of contamination and chimeras.

® Pei-Jie Wu

Models--Shortest common superstring
• Seeking the shortest superstring of a collection of given strings
• PROBLEM: Shortest common superstring (SCS)
• INPUT: a collectionF of strings.
• OUTPUT: a shortest possible string S such that for every fF , S is a superstring of f.

® Pei-Jie Wu

Models--Shortest common superstring
• Example 4.1
• Example 4.2
• Figure 4.12
• Figure 4.13
• A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies

® Pei-Jie Wu

Models--Reconstruction
• Takes into account both errors and unknown orientation
• Dynamic programming sequence comparison algorithm
• Use distance rather than similarity
• Expression: p.116

® Pei-Jie Wu

Models--Reconstruction
• PROBLEM: RECONSTRUCTION
• INPUT: a collectionF of strings and an error tolerance  between 1 and 0.
• OUTPUT: (p.117)
• Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level 
• Does not model repeats, lack of coverage, and size of target

® Pei-Jie Wu

Models--Multicontig
• Involve internal linkage of the fragments in the layout
• Nonlink: there is a fragment that properly contains the overlap on both sides
• t-contig: the weakest link of a layout is at least as large as t
• Example 4.4
• Definition: p.119

® Pei-Jie Wu

Algorithms
• Greedy algorithm
• Acyclic subgraphs

(no errors and know orientation)

® Pei-Jie Wu

Algorithms--Representing overlaps
• Over multigraph OM(F) of a collection F is the directed, weighted multigraph
• Set V of nodes of this structure is just F itself.
• A directed edge from a to a different fragment b with weight t  0 exists if the suffix of a with t characters is a prefix of b
• May be many edges from a to b
• No self-loops

® Pei-Jie Wu

Algorithms--Paths originating superstrings
• Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e
• Figure 4.15
• Example in p.121
• Equation 4.3
• Hamiltonian paths: A path that goes through every vertex
• Equation 4.4
• Minimizing |S(P)|  maximizing w(P)

® Pei-Jie Wu

Algorithms--Shortest superstrings as paths
• A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b.
• THEOREM 4.1
• COROLLARY 4.1
• LEMMA 4.1
• THEOREM 4.2

® Pei-Jie Wu

Algorithms--The greedy algorithm
• Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.
• OM(F)  OG(F)
• “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

® Pei-Jie Wu

Algorithms--The greedy algorithm
• Three conditions we have to test before accepting an edge in our Hamiltonian path:
• Edges are processed in nonincreasing order by weight
• The procedure ends when we have exactly n-1 edges, or
• when the accepted edges induce a connected subgraph.
• Figure 4.16
• Example 4.5
• Figure 4.17

® Pei-Jie Wu

Algorithms--Acyclic subgraphs
• Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA.
• “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly.
• Figure 4.18

® Pei-Jie Wu

Algorithms--Acyclic subgraphs
• The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph.
• Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph.
• THEOREM 4.5
• Algorithm: Topological sorting
• Example 4.6
• Figure 4.19, 4.20 and 4.21

® Pei-Jie Wu

Heuristics
• None of the formalisms proposed for fragment assembly are entirely adequate
• Fragment assembly can be viewed as a multiple alignment problem with some additional feature:
• Each fragment can participate with either the direct or the reverse-complemented sequence.
• The sequences themselves are usually much shorter than the alignment itself.

® Pei-Jie Wu

Heuristics
• Three criteria according to the second feature:
• Scoring
• Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal
• Lower the entropy, the better
• Coverage:
• A fragment covers a column i if it participates in this column either with a character or with an internal space.
• The way individual fragment are linked in the layout is another determinant of layout quality.
• Figure 4.22

® Pei-Jie Wu

Heuristics--Assembly in practice
• Practical implementations often divide the whole problem in three phase:
• Finding overlaps
• Building a layout
• Computing the consensus

® Pei-Jie Wu

Heuristics--Assembly in practice

Finding overlaps

• The first step in any assembly problem is fragment overlap delection.
• Determine reverse complement
• Consider fragments entirely contained in other fragment
• Recall Section 3.2.3
• Figure 4.23

® Pei-Jie Wu

Heuristics--Assembly in practice

Ordering fragments

• Finding a good ordering of fragments in a contig
• No algorithm that is simple and general enough
• There are four issues to keep in mind when building paths:
• Every path has a corresponding complement path
• It is not necessary to include contain fragments
• Cycles usually indicate the presence of repeats
• Unbalanced coverage may be related to repeats as well (see Figure 4.13)

® Pei-Jie Wu

Heuristics--Assembly in practice

Alignment and consensus

• Building a layout from a path in an overlap graph
• Two techniques related to alignment construction:
• The first one helps in building a good layout from a path in the presence of errors.
• Example 4.7
• Implement: Figure 4.24
• The second one focuses on locally improving an already constructed layout
• Example 4.8 in Figure 4.25
• Implement: sum-of-pairs scoring scheme

® Pei-Jie Wu