High Throughput Sequencing: Microscope in the Big Data Era

High Throughput Sequencing:Microscope in the Big Data Era David Tse EASIT Chinese University of Hong Kong July 7, 2014 Research supported by NSF Center for Science of Information. TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

High throughput sequencing revolution tech. driver for communications

Shotgun sequencing read

Technologies

High throughput sequencing:Microscope in the big data era Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. The quantities measured can be dynamic and vary spatially. Example: RNA expression is different in different tissues and at different times.

Computational problems for high throughput data manage data measure data utilize data • Assembly (de Novo) • Variant calling (reference-based assembly) • Genome wide association studies • Phylogenetic tree reconstruction • Pathogen detection • Compression • Privacy Scope of this tutorial

Assembly: three points of view • Software engineering • Computational complexity theoretic • Information theoretic

Assemblyas a software engineering problem • A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. • Primary concerns are to minimize time and memory requirements. • No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

Computational complexity view • Formulate the assembly problem as a combinatorial optimization problem: • Shortest common superstring (Kececioglu-Myers 95) • Maximum likelihood (Medvedev-Brudno 09) • Hamiltonian path on overlap graph (Nagarajan-Pop 09) • Typically NP-hard and even hard to approximate. • Does not address the question of when the solution reconstructs the ground truth.

Information theoretic view Basic question: What is the qualityand quantity of read data needed to reliably reconstruct?

Tutorial outline • De Novo DNA assembly. • De Novo RNA assembly

Themes • Interplay between information and computational complexity. • Role of empirical data in driving theory and algorithm development.

Part I:De Novo DNA Assembly TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Shotgun sequencing model Basic model : uniformly sampled reads. Assembly problem: reconstruct the genome given the reads.

A Gigantic Jigsaw Puzzle

Challenges Long repeats Read errors Human Chr 22 repeat length histogram Illuminaread error profile

Two-step approach • First, we assume the reads are noiseless • Derive fundamental limits and near-optimal assembly algorithms. • Then, we add noise and see how things change.

Repeat statistics harder jigsaw puzzle easier jigsaw puzzle How exactly do the fundamental limits depend on repeat statistics?

Lower bound: coverage • Introduced by Lander-Waterman in 1988. • What is the number of reads needed to cover the entire DNA sequence with probability 1-²? • NLW only provides a lower bound on the number of reads needed for reconstruction. • NLW does not depend on the DNA repeat statistics!

Simple model: I.I.D. DNA, G !1 (Motahari, Bresler & Tse 12) normalized # of reads reconstructable by greedy algorithm coverage 1 no coverage many repeats of length L no repeats of length L read length L What about for finite real DNA?

I.I.D. DNA vs real DNA (Bresler, Bresler& Tse 12) Example: human chromosome 22 (build GRCh37, G = 35M) data i.i.d. fit Can we derive performance bounds on an individual sequence basis?

Individual sequence performance bounds (Bresler, Bresler, Tse BMC Bioinformatics 13) Given a genome s greedy deBruijn Lcritical simpleBridging repeat length multiBridging Human Chr19 Build 37 lower bound Lander-Waterman coverage

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobactersphaeroides Human Chromosome14 Staphylococcusaureus G = 88,289,540 G = 2,903,081 G = 4,603,060 multiBridging multiBridging multiBridging lower bound lower bound lower bound

Lower bound: Interleaved repeats Necessary condition: all interleaved repeats are bridged. L m n m n In particular: L > longest interleaved repeat length (Ukkonen)

Lower bound: Triple repeats Necessary condition: all triple repeats are bridged L In particular: L > longest triple repeat length (Ukkonen)

Individual sequence performance bounds (Bresler, Bresler, T. BMC Bioinformatics 13) length Human Chr19 Build 37 lower bound Lander-Waterman coverage

Greedy algorithm • (TIGR Assembler, phrap, CAP3...) Input: the set of N reads of length L • Set the initial set of contigs as the reads • Find two contigs with largest overlap and merge them into a new contig • Repeat step 2 until only one contig remains

Greedy algorithm: first error at overlap repeat contigs bridging read already merged A sufficient condition for reconstruction: L all repeats are bridged

Back to chromosome 19 longest repeat at lower bound greedy algorithm non-interleaved repeats are resolvable! longest interleaved repeats at length 2248 GRCh37 Chr 19 (G = 55M)

Dense Read Model • As the number of reads N increases, one can recover exactly the L-spectrum of the genome. • If there is at least one non-repeating L-meron the genome, this is equivalent information to having a read at every starting position on the genome. • Key question: What is the minimum read length L for which the genome is uniquely reconstructible from its L-spectrum?

de Bruijn graph (L = 5) CTAG CCTA CCCT ATAGACCCTAGACGAT GCCC AGCC TAGA AGAC 1. Add a node for each (L-1)-mer on the genome. AGCG ATAG GCGA CGAT 2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the corresponding L-mer appears k times in genome.

Eulerian path (L = 5) CTAG CCTA CCCT ATAGACCCTAGACGAT GCCC AGCC TAGA AGAC Theorem (Pevzner 95) : If L > max(linterleaved, ltriple) , then the de Bruijn graph has a unique Eulerian path which is the original genome. AGCG ATAG GCGA CGAT

Resolving non-interleaved repeats Condensed sequence graph non-interleaved repeat Unique Eulerian path.

[Idury-Waterman 95] From dense reads to shotgun reads [Pevzner et al 01] Idea: mimic the dense read scenario by looking at K-mers of the length L reads Construct the K-mer graph and find an Eulerian path. Success if we have K-coverage of the genome and K > Lcritical

De Bruijn algorithm: performance Loss of info. from the reads! greedy deBruijn length Human Chr19 Build 37 lower bound Lander-Waterman coverage

Resolving bridged interleaved repeats bridging read interleaved repeat Bridging read resolves one repeat and the unique Eulerian path resolves the other.

Simple bridging: performance greedy deBruijn simpleBridging length Human Chr19 Build 37 lower bound Lander-Waterman coverage

Resolving triple repeats all copies bridged neighborhood of triple repeat triple repeat all copies bridged resolve repeat locally

Triple Repeats: subtleties

Multibridging De-Brujin Theorem: (Bresler,Bresler, Tse 13) Original sequence is reconstructible if: 1. triple repeats are all-bridged 2. interleaved repeats are (single) bridged 3. coverage • Necessary conditions for ANY algorithm: • triple repeats are (single) bridged • interleaved repeats are (single) bridged. • coverage.

Multibridging: near optimality for Chr 19 greedy deBruijn simpleBridging length multiBridging Human Chr19 Build 37 lower bound Lander-Waterman coverage

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobactersphaeroides Human Chromosome14 Staphylococcusaureus G = 88,289,540 G = 2,903,081 G = 4,603,060 Lcritical = length of the longest triple or interleaved repeat. Lcritical Lcritical Lcritical multiBridging multiBridging multiBridging lower bound lower bound lower bound

Gap Sulfolobusislandicus. G = 2,655,198 triple repeat lower bound MULTIBRIDGING algorithm interleaved repeat lower bound

Complexity: Computational vs Informational • Complexity of MULTIBRIDGING • For a G length genome, O(G2) • Alternate formulations of Assembly • Shortest Common Superstring: NP-Hard • Greedy is O(G), but only a 4-approximation to SCS in the worst case • Maximum Likelihood: NP-Hard • Key differences • We are concerned only with instances when reads are informationally sufficient to reconstruct the genome. • Individual sequence formulation lets us focus on issues arising only in real genomes.

Read Errors Error rate and nature depends on sequencing technology: Examples: Illumina: 0.1 – 2% substitution errors PacBio: 10 – 15% indel errors We will focus on a simple substitution noise model with noise parameter p. A A T C T T A T ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT

Consistency Basic question: What is the impact of noise on Lcritical? This question is equivalent to whether the L-spectrum is exactly recoverable as the number of noisy reads N -> 1. Theorem (C.C. Wang 13): Yes, for all p except p = ¾.

What about coverage depth? Theorem (Motahari, Ramchandran,Tse, Ma 13): Assume i.i.d. genome model. If read error rate p is less than a threshold, then Lander-Waterman coverage is sufficient for L > Lcritical For uniform distr. on {A,G,C,T}, threshold is 19%. A separation architecture is optimal: error correction assembly

Why? noise averaging • Coverage means most positions are covered by many reads. • Multiple aligning overlapping noisy reads is possible if • Assembly using noiseless reads is possible if M

From theory to practice Two issues: • Multiple alignment is performed by testing joint typicality of M sequences, computationally too expensive. Solution: use the technique of finger printing. 2) Real genomes are not i.i.d. Solution: replace greedy by multibridging.

High Throughput Sequencing: Microscope in the Big Data Era