- 281 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Fragment assembly of DNA' - vaughan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Fragment assembly of DNA

A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.

Biological background

- Problem as puzzle
- We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair.
- Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows.

® Pei-Jie Wu

Biological background

- Target: The long sequence to reconstruct.
- Fragment vs. Subsequence
- Shotgun method:Based on fragment overlap
- Fragment assembly: A collection of fragments to put together

® Pei-Jie Wu

Biological background--The ideal case

- Case: p.106
- Aligned the input set, ignoring spaces at the extremities
- Overlaps: the end part of a fragment is similar to the beginning of another
- Consensus sequence base on majority vote

® Pei-Jie Wu

Biological background--Complications

- The main factors that add to the complexity of the problem are:
- Error
- Unknown orientation
- Repeated regions
- Lack of coverage.

® Pei-Jie Wu

Biological background--Complications

Errors

- It usually means algorithms that require more time and space when computer program deal with error.
- The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments.
- Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters.
- Figures 4.2, 4.3, 4.4

® Pei-Jie Wu

Biological background--Complications

Errors

- Two other types of errors: chimera and Contamination
- Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target
- Figure 4.5
- Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage.
- Contamination is from host or vector DNA
- Solution: Most vectors are well know, so we can screen the data before starting assembly.

® Pei-Jie Wu

Biological background--Complications

Unknown orientation

- We generally do not know to which strand a particular fragment belongs to.
- The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement.
- Figure 4.6
- Complexity: 2n

® Pei-Jie Wu

Biological background--Complications

Repeated regions

- Repeats are sequences that appear two or more times in the targrt molecule.
- Short repeats
- Longer repeats
- If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors
- Figure 4.7

® Pei-Jie Wu

Biological background--Complications

Repeated regions

- Problems:
- If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy.
- Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9)
- Direct repeats: repeated copies in the same strand.
- Inverted repeats: repeated regions in opposite strands (Figure 4.10)

® Pei-Jie Wu

Biological background--Complications

Lack of coverage

- Coverage: position i of the target as the number of fragments that cover this position.
- Contigs: The contiguously covered regions
- Figure 4.11
- Solutions:
- Sampling more fragments
- Directed sequencing or walking

® Pei-Jie Wu

Biological background--Alternative methods for DNA sequencing

- Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project.
- Problem:
- It is expensive to build special primers
- Sequential rather than parallel
- Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes.

® Pei-Jie Wu

Models

- Shortest common superstring (SCS)
- RECONSTRUCTION
- MULTICONTIG
- All three assume that the fragment collection is free of contamination and chimeras.

® Pei-Jie Wu

Models--Shortest common superstring

- Seeking the shortest superstring of a collection of given strings
- PROBLEM: Shortest common superstring (SCS)
- INPUT: a collectionF of strings.
- OUTPUT: a shortest possible string S such that for every fF , S is a superstring of f.

® Pei-Jie Wu

Models--Shortest common superstring

- Example 4.1
- Example 4.2
- Figure 4.12
- Figure 4.13
- A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies

® Pei-Jie Wu

Models--Reconstruction

- Takes into account both errors and unknown orientation
- Dynamic programming sequence comparison algorithm
- Use distance rather than similarity
- Expression: p.116

® Pei-Jie Wu

Models--Reconstruction

- PROBLEM: RECONSTRUCTION
- INPUT: a collectionF of strings and an error tolerance between 1 and 0.
- OUTPUT: (p.117)
- Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level
- Does not model repeats, lack of coverage, and size of target

® Pei-Jie Wu

Models--Multicontig

- Involve internal linkage of the fragments in the layout
- Nonlink: there is a fragment that properly contains the overlap on both sides
- Weakest link: the smallest size of any link
- t-contig: the weakest link of a layout is at least as large as t
- Example 4.4
- Definition: p.119

® Pei-Jie Wu

Algorithms--Representing overlaps

- Over multigraph OM(F) of a collection F is the directed, weighted multigraph
- Set V of nodes of this structure is just F itself.
- A directed edge from a to a different fragment b with weight t 0 exists if the suffix of a with t characters is a prefix of b
- May be many edges from a to b
- No self-loops

® Pei-Jie Wu

Algorithms--Paths originating superstrings

- Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e
- Figure 4.15
- Example in p.121
- Equation 4.3
- Hamiltonian paths: A path that goes through every vertex
- Equation 4.4
- Minimizing |S(P)| maximizing w(P)

® Pei-Jie Wu

Algorithms--Shortest superstrings as paths

- A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b.
- THEOREM 4.1
- COROLLARY 4.1
- LEMMA 4.1
- THEOREM 4.2

® Pei-Jie Wu

Algorithms--The greedy algorithm

- Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.
- OM(F) OG(F)
- “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

® Pei-Jie Wu

Algorithms--The greedy algorithm

- Three conditions we have to test before accepting an edge in our Hamiltonian path:
- Edges are processed in nonincreasing order by weight
- The procedure ends when we have exactly n-1 edges, or
- when the accepted edges induce a connected subgraph.
- Figure 4.16
- Example 4.5
- Figure 4.17

® Pei-Jie Wu

Algorithms--Acyclic subgraphs

- Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA.
- “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly.
- Figure 4.18

® Pei-Jie Wu

Algorithms--Acyclic subgraphs

- The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph.
- Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph.
- THEOREM 4.5
- Algorithm: Topological sorting
- Example 4.6
- Figure 4.19, 4.20 and 4.21

® Pei-Jie Wu

Heuristics

- None of the formalisms proposed for fragment assembly are entirely adequate
- Fragment assembly can be viewed as a multiple alignment problem with some additional feature:
- Each fragment can participate with either the direct or the reverse-complemented sequence.
- The sequences themselves are usually much shorter than the alignment itself.

® Pei-Jie Wu

Heuristics

- Three criteria according to the second feature:
- Scoring
- Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal
- Lower the entropy, the better
- Coverage:
- A fragment covers a column i if it participates in this column either with a character or with an internal space.
- Linkage
- The way individual fragment are linked in the layout is another determinant of layout quality.
- Figure 4.22

® Pei-Jie Wu

Heuristics--Assembly in practice

- Practical implementations often divide the whole problem in three phase:
- Finding overlaps
- Building a layout
- Computing the consensus

® Pei-Jie Wu

Heuristics--Assembly in practice

Finding overlaps

- The first step in any assembly problem is fragment overlap delection.
- Determine reverse complement
- Consider fragments entirely contained in other fragment
- Recall Section 3.2.3
- Figure 4.23

® Pei-Jie Wu

Heuristics--Assembly in practice

Ordering fragments

- Finding a good ordering of fragments in a contig
- No algorithm that is simple and general enough
- There are four issues to keep in mind when building paths:
- Every path has a corresponding complement path
- It is not necessary to include contain fragments
- Cycles usually indicate the presence of repeats
- Unbalanced coverage may be related to repeats as well (see Figure 4.13)

® Pei-Jie Wu

Heuristics--Assembly in practice

Alignment and consensus

- Building a layout from a path in an overlap graph
- Two techniques related to alignment construction:
- The first one helps in building a good layout from a path in the presence of errors.
- Example 4.7
- Implement: Figure 4.24
- The second one focuses on locally improving an already constructed layout
- Example 4.8 in Figure 4.25
- Implement: sum-of-pairs scoring scheme

® Pei-Jie Wu

Download Presentation

Connecting to Server..