250 likes | 385 Views
This lecture delves into the evolution of genome sequencing techniques, focusing on the Sanger method introduced in 1982, which paved the way for automation. We explore two primary approaches to whole-genome sequencing: Clone-By-Clone and Shotgun Assembly, highlighting their respective benefits and challenges. The successful sequencing of Drosophila melanogaster and Homo sapiens serves as case studies, illustrating the complexities of sequencing, assembly processes, and the importance of coverage and error handling in generating accurate genomic data.
E N D
ECE697S: Topics in Computational Biology Lecture 4: Sequence Assembly
Modern Sequencing Methods • Sanger (1982) introduced a sequencing method amenable to automation. • Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly • Drosophila melongaster sequenced (Myers et al. 2000) • Homo sapien sequenced (Venter et al. 2001)
Sanger (1982) introduced chain-termination sequencing. Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G. Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them.
Automated Sequencing Perkin-Elmer 3700: Can sequence ~500bp with 98.5% accuracy
Reads and Contigs Sequencing machines are limited to about ~500-750bp, so we must break up DNA into short and long fragments, with reads on either end. Reads are then assembled into contigs, then scaffolds.
Clone-by-Clone vs. Shotgun • Traditionally, long fragments are mapped, and then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments. • Shotgun assembly is cheaper, but requires more computational resources. • Drosophila was successfully sequenced using shotgun assembly.
Difficulties • Good coverage does not guarantee that we can “see” repeats. • Read coverage is generally not “truly” random, due to complications in fragmentation and cloning. • Any automated approach requires extensive post-processing.
The Fruit Fly • Drosophila melongaster was sequenced in 2000 using whole genome shotgun assembly. • Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes. • The genome is still being refined.
NIH used a Clone-By-Clone strategy; Celera used shotgun assembly. Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day. Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.
Human Genome Sequence • Taken as the consensus sequence among 5 subjects. • Gene identification is by homology, we have around ~20,000 genes. • Euchromatic DNA is “coding”, rest is “junk”.
Abstraction • The basic question is: given a set of fragments from a long string, can we reconstruct the string? • What is the shortest common superstring of the given fragments?
Overlap-Layout-Consensus • Construct a (directed) overlap graph, where nodes represent reads and edges represent overlap. Paths are contigs in this graph. • Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph. • Note: This is an idealization, since we must handle errors!
Approximation Algorithms • The shortest common superstring problem is NP-complete. • Greedily choosing edges is a 4-approximation, conjectured to be a 2-approximation. • Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al. 1976 gives such metrics).
Handling Repeats • We can estimate how much coverage a given set of overlapping reads should yield, based on coverage. • Repeats will “seem” to have unusually good coverage. • Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.
Hybridization Suppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay. Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.
Sequencing-By-Hybridization • Then instead of reads, we have regularly sized fragments, k-mers. • Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph. • Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).
Bridges of Königsberg Theorem (Euler 1736): A graph has a path visiting every edge exactly once if and only if it is connected and has 2 or fewer vertices of odd degree.
Pros and Cons • An Eulerian path in a graph can be found in linear time, if one exists. • Errors in the hybridization experiments may prevent us from finding a solution. • Can we just use reads as “virtual” hybridization data?
Graph Preprocessing • Read errors mean up to k missing/erroneous edges. But we cannot correct this until we are done assembling! • Greedily mutate reads to minimize size of set of k-mers. • We also need to deal with repeats, which requires contracting certain paths to single edges…
Multiple Sequence Alignment • Construct a de Bruijn graph as before where each sequence is a path in the graph. • Find a heaviest paths in this graph; these are “consensus” sequences. • Align each consensus sequence with library sequences to find common subsequences.