Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads

Paul Medvedev’s brain + Michael Brudno’s brain = Brainchild! Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads Presented by Lucas Lochovsky

Outline • Whole Genome Assembly • Review of Related Work • The Medvedev-Brudno Method • Methods: Bidirected Overlap Graph • Methods: Adjustments to the Standard Min-cost Biflow Problem • Methods: Maximizing the Global Read-Count Likelihood • Methods: Efficiently Solving a Min-cost Biflow (Linear) • Methods: Show Me the Contigs • Results • Discussion

Who Wants to Assemble an Entire Genome? Two approaches: I) Clone-by-clone assembly (a.k.a. “hierarchical shotgun” approach) • Break genome into 150 kb fragments • Insert into BAC and map onto chromosomes • Shotgun sequence each BAC and assemble into a contig

Who Wants to Assemble an Entire Genome? (cont’d) II) Whole Genome Shotgun (WGS) Assembly • Break genome into shotgun-sized fragments and sequence • Match the overlapping regions of contiguous sequences • Demonstrated by Celera Genomics to be feasible for whole genome assembly • Sequenced human genome at 1/10’th the cost of the public Human Genome Project

Who Wants to Assemble an Entire Genome? (cont’d) Next Generation Sequencing (NGS): potential third player? • Improved speed and cost-effectiveness relative to the other methods… • … but much shorter read length (25-200 bp) • Only proven on resequencing projects, i.e. a reference genome is already available • Not yet proven on ab initio genome sequencing • No advance knowledge available for ab initio (Def’n: “from first principles”) genome sequencing

Review of Related Work • Genome assembly intuition: identify shortest common superstring of all reads • Equivalent to finding the shortest genome that is possible with the given reads • Problem: each repeat only used once – “over-collapsing” the repeats • Accommodate multiple uses of repeat regions: makes sequence assembly amenable to graph-theoretic methods

Review of Related Work (cont’d) EULER assembler (Pevzner, Tang and Waterman) • Represent reads as edges and overlaps as vertices in a de Bruijn graph • Assembly can be efficiently solved as an Eulerian Path Problem: each edge must be visited exactly once • Repeats dealt with by using multiple edges for a single repeat read

Review of Related Work (cont’d) String graph (Myers) • Represent reads as vertices, and read overlaps as edges • Remove redundant edges • Establish edge constraints • Unique? (flow is exactly one) • Required? (min. flow is 1) • Optional? (min. flow is 0) • Find shortest walk

Review of Related Work (cont’d) • Flow entering and exiting each vertex in these graphs must be balanced • Gave rise to the idea of network flow methods for assembly • EULER assembler: minimum cost circulation problem • String graph: place constraints on copy-counts before solving the flow • Network flow fails for the assembly problem when there are repeats longer than the span of a single read

Review of Related Work (cont’d) Sequence assembly using NGS • Several methods available now (e.g. SSAKE, VCAKE, SHARCGS, etc.) • All of these assume that the length of the assembled genome must be minimized • Results in over-collapsing of repeats • Given ubiquity of repeats in eukaryotic genomes, authors considered this a poor assumption

The Medvedev-Brudno Method • Change goal of sequence assembly • Maximize the likelihood that the resultant genome was the source of the given reads • Take advantage of the high coverage of NGS to statistically estimate the copy-count of each read: identify and quantify repeats • Maximizing the likelihood of observed read frequencies can be cast as mininum cost bidirected flow (biflow) problem • Allows solution to be obtained with an off-the-shelf network flow solver • Authors claim 99.99% accuracy

The Medvedev-Brudno Method (cont’d) • Second important aspect is the use of matepair information for joining contigs • Other systems look for all paths between mated reads • The Medvedev-Brudno Method looks only for short paths between some pairs of reads • Question: How do you decide the upper bound for these “short paths”? And how do you decide which pairs of reads to examine?

Methods: Bidirected Overlap Graph • Bidirected graphs are kind of like directed graphs, except each edge has an orientation on each of its ends • Gives rise to three types of edges: • Edges where one arrow points out of a vertex, and one arrow points into a vertex • Edges with both arrows pointing out, and • Edges with both arrows pointing in (easiest one to do in PowerPoint!) • For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex

Methods: Bidirected Overlap Graph (cont’d) • In a bidirected overlap graph, each vertex is a double-stranded read • Edges represent read overlaps • Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) • Suppose we have two ds reads r1 and r2 • Each read can be oriented to the left or to the right • The three possible overlaps are: • i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) • ii) r1 points left and r2 points right • iii) r1 points right and r2 points left

Methods: Bidirected Overlap Graph (cont’d) • A walk along this graph that visits every vertex at least once produces the original double-stranded genome (under the assumptions that the whole genome was covered by the reads, and that the reads are error-free) • In this paper, the overlap graph is constructed by placing an edge between two reads if they overlap by a minimum number of characters omin • Question: How is omin determined? • Then perform transitive edge reduction: remove overlaps covered by two shorter overlaps

Methods: Adjustments to the Standard Min-cost Biflow Problem Standard Min-cost Biflow Problem • Set upper and lower flow bounds on each edge • Flow function f : E→ N must obey the constraint for each edge e • For each vertex, the incoming flow is balanced with the outgoing flow • Objective: Find the flow that minimizes

Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) Medvedev-Brudno Min-cost Biflow Problem • Upper and lower flow bounds on vertices as well • Accomplished by splitting every vertex v into two: v+ and v- • v- serves as the “incoming” vertex, and inherits v’s incoming edges • v+ serves as the “outgoing” vertex, and inherits v’s outgoing edges • Finally add one edge between v- and v+ and assign it the upper and lower flow bounds for v

Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) • Second variation: represent the cost ce as a convex function • A function is convex if every point on or above it forms a convex set • A convex set refers to an area where, for every pair of points within that area, every point on the straight line segment connecting those points also lies within that area

Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) • An area that is not convex would have some sort of concave portion that would contradict the above property of convex sets • In the overlap graph, convex functions are modelled with piecewise-linear approximations, allowing the flow to be solved as a linear min-cost flow problem

Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) • Supersource and supersink added to convert flow problem into circulation problem • Each vertex has a lower bound of 1, since each read must appear in the finished genome at least once • Edge bounds are set to 0 (lower bound) and infinity (upper bound) • Put prohibitively large cost on the edge leading from the supersource and the edge leading to the supersink to ensure that the assembly uses the smallest number of contigs possible • Flow through each vertex represents number of times it appears in the assembled genome

Methods: Maximizing the Global Read-Count Likelihood • Start with the probability of a k-mer i being sampled a certain number of times from a genome G • Let N(G) be the length of the genome assembly of G, and let gi be the frequency of i in G • Under the assumption of uniform sampling, the probability of sampling i is gi/N(G) • Let Xibe the random variable that represents the number of trials whose outcome is i • Each random variable for every possible k-mer has a binomial distribution. Their joint distribution is the following multinomial distribution:

Methods: Maximizing the Global Read-Count Likelihood (cont’d) • From this, derive the global read-count likelihood, the likelihood of k-mer distributions (gi) given the sampling outcomes (xi): • Goal is to maximize L, or, equivalently, minimize the negative log of L • To translate this problem into a convex min-cost biflow problem, we need convex functions for each k-mer cis.t. • Problem: the Xirandom variables are not independent…

Methods: Maximizing the Global Read-Count Likelihood (cont’d) • … unless the number of trials approaches infinity • The number of trials is usually large enough to warrant the approximation of the multinomial distribution as the product of the binomial distributions for each Xi • In this binomial approximation, genome length N(G) is constant, and independent of the sampling frequencies • Therefore, use N instead, which is the actual length of the genome G

Methods: Maximizing the Global Read-Count Likelihood (cont’d) • New approximation of L: • Now • And • ci is used as the convex functions for the vertices of the min-cost biflow graph described earlier

Methods: Efficiently Solving a Min-Cost Biflow (Linear) • Problem: No existing efficient implementation of a min-cost biflow algorithm • Solution: Roll your own! Introducing the Medvedev-Brudno Min-cost Biflow Solver! • It’s fast! • It’s 2-approximate! • Get one today! Now available at half-price for one day only!* *offer contingent on willingness of authors to go into business with this

Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • Can solve directed network flow by reducing the problem to a linear program (LP) • Use an edge incidence matrix derived from the overlap graph • If cell has a value of 1, then edge n is an in-edge for vertex m • If the value is -1, n is an out-edge • 0 means n and m are not on speaking terms • Use incidence matrix as constraint matrix for LP: optimal LP solution corresponds to a minimum flow

Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • The incidence matrix is Totally Unimodular (TU) • Translation from wowspeak: Every linear combination of a TU matrix M and the identity matrix I has integer coefficients. Therefore, every solution to the l.c. consists of integers • Makes it possible to produce an integral solution with LP, rather than resort to Integer Programming -> NP-hard

Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • Possible for +2 or -2 to appear in the incidence matrix, since two in-edges/out-edges can enter a single vertex • Incidence matrix is actually a binet matrix • Optimal LP solution for binet matrices is guaranteed to be half-integral (i.e. the coefficients are multiples of 0.5)

Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • The binet matrix can be reduced to a TU matrix through monotonization, i.e. doubling the number of columns and rows • TU matrix corresponds to a directed graph • Min-cost directed flow can be solved must faster than LP

Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) Monotonization Procedure • For every vertex v in the bidirected graph, replace with two vertices v1 and v2 in the new graph • Each of v’s in-edges are replaced with two edges, one of which points into v1, while the other points out of v2 • Likewise, each of v’s out-edges are replaced with two edges, one of which points out of v1, while the other points into v2 • Bounds and costs from original graph are transferred to the new graph, and the solution of the new graph will be transferred to the original graph • Problem can now be solved with off-the-shelf software (no more work for us!)

Methods: Show Me the Contigs • Assuming the flow’s been solved, now it’s time to decompose it into a collection of walks, which translates into assembled contigs • Graph is first simplified by removing all edges with a flow of zero • Additional simplifications possible by removing vertices v where: • There is exactly one edge going into v and one edge leading out of v, and the flow on both edges is the same • Vertices where there is also a loop with the same flow as the other two edges, and • Split and join vertices, where the flow on the in- edges is the same as those of the out-edges

Methods: Show Me the Contigs (cont’d) • After at most 2|V| of these simplifications, the number of edges will be reduced by 105 fold, and only “conflict” vertices remain (those that didn’t match the previous criteria) Conflict Node Resolution • Look for edges at these vertices with opposite orientations supported by matepairs • Use BFS to find all reads within a certain distance from the vertex • Match those reads that are matepairs • For those matepairs where one read is on the incoming side and the other is on the outgoing side, find the shortest path between them using Dijkstra’s algorithm

Methods: Show Me the Contigs (cont’d) • Make note of the number of mates that fall within the expected insert distance • Pairs of in/out edges that have a significant number of matepairs that fall within the insert distance are joined into a common edge • The previous step is repeated until no more edges can be joined in this manner • Graph simplification continues in iterative phases until somebody decides it’s time to stop

Results • Unable to obtain real data for experimental validation • Generated synthetic reads from E. coli genome, which must be 4.6 megabases long, or else it takes up 4.6 MB of hard drive space • Simulated matepairs’ distances were uniformly distributed within 10% of the expected insert size • Reads were 25 bp long, and error-free • Coverage rates involved 50x, 75x, 100x, and 200x • Minimum overlap length varied between 17 and 21 • Overall running time on one machine: ~1 hour • Question: What kind of machine?

Read Count Results • Compared vertex flow with read frequency in the original genome • High degree of accuracy • Error rate between 10-4 and 10-6 • Generally more tendency to overestimate read frequency • Authors claim only slight improvements beyond 75x coverage, but 200x coverage is fantastically good

Assembly Results • Take the edges of the graph produced after the conflict node resolution and generate the sequence it spells out • Compute N50: The length of the shortest contig s.t. 50% of the genome lies in longer contigs • Also compute N90: Similar to N50, but the cutoff is 90% • Finally, compute errors by aligning each contig to the reference genome and seeing how many local alignments it takes to completely tile the contig (minus one because it always takes at least one alignment to do it)

Assembly Results (cont’d) • Length of contigs that contain 50% of the genome varied between 23-28 kb • Length of contigs that contain 90% of the genome varied between 7-8 kb • N50 error rate: ~1/100-180 kb • N90 error rate: ~1/100-160 kb • Greedy algorithm can be fooled by several strong edge matches • Contig size is good relative to other whole genome assemblies involving small read sizes

Discussion • Successfully demonstrated that ab initio genome assembly is feasible with 25 bp reads and matepair information • Still needs improvements for sequencing unknown (bacterial) genomes • In the future, matepair information could be further used to join contigs into supercontigs, and better help resolve conflict nodes • Convex network flow can be more broadly applied within computational biology

Discussion (cont’d) • First major assumption: Reads are error-free • Can be overcome with higher coverage • Second major assumption: Uniform sampling of all genomic regions • Reality: certain portions of the genome are easier to sample than others • More difficult to overcome • Could be overcome by establishing the biases of the sequencing apparatus used

Is NGS The One? The One that will herald the beginning of cost-effective whole genome assembly? Maybe you should ask the Oracle…

Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads

Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads

Presentation Transcript

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Genome Assembly

Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes (pufferfish) Presented by Charu Gupta Kumar

Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads

Genome Assembly

Genome Resequencing with Short Reads

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities

Whole Genome Assembly Microarray analysis

SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences

Mandarin Chinese Ab Initio

Whole Genome Sequencing, Assembly and Annotation

AB INITIO

Ab initio

Ab initio REMPI

Whole Genome Shotgun Assembly

Correlated ab Initio Methods

Whole Genome Assembly

Ab initio REMPI

RUSSIAN ab Initio

Whole Genome Assembly