Developing a computational method to help solve the sequencing gap closure problem

Advisors: Dr. Qunfeng Dong(CGB) Prof. Haixu Tang(SOI) INDIANA UNIVERSITY, BLOOMINGTON Developing a computational method to help solve the sequencing gap closure problem Vivek Krishnakumar M.S. Bioinformatics CAPSTONE

A Typical Sequencing Project • Sequencing • Assembly • Finishing • Annotation Steps in a whole genome shotgun sequencing project Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000)

Genome Assembly • Putting together a large number of short DNA sequences called reads to create a representation of the original genome. • Relies on the assumption that reads sharing the same string of letters, originated from the same place on the genome • Ideally, at an 8-10 fold coverage, most of the genome will be assembled into a small number of contigs (around 5 for a 1Mbp genome)1 1. Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988)

Assembly Algorithms • Assembly is a complex computational problem (NP-hard) • Several approximation algorithms to solve this problem. • Types • Simple greedy algorithms • Graph based algorithms • Overlap-layout-consensus (Phrap, TIGR, VCAKE, etc.) • Eulerian Path (Newbler, Euler, ALLPATHS, etc.)

Assembly Algorithms – Graph based Overlap-layout-consensus1 Eulerian Path2 • k-mers are computed • k-mers are represented by edges of a deBruijn graph • reads are represented as paths through k-mers • reconstructing the genome involves finding the path that makes use of all the edges • sequences represented as nodes • edges correspond to read overlap • use the overlap between sequence reads to create a link between them • contigs are eventually formed by reading along the links as far as possible Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009) 1. J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51. 2. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17 9748-9753

Finishing (Post-assembly) • No matter the type of assembly algorithm, in practice, the assemblers produce not one, but hundreds or thousands of unordered contiguous sequences. Reasons: • no or low coverage regions • repeats • sequencing errors • The task of gap closure consists of several steps: • Orienting the contigs • Ordering the contigs • Gap closure

Gap-closure problem A Un-ordered & un-oriented contigs C E F B D G E F A C B D G Ordered & oriented Gap-closing (by PCR or walking) A C

Existing methods for contig ordering (Reference genome based) • For biologists, it is imperative to move beyond the assembly step and elucidate the order of the contigs for effective gap-closure. Existing computational methods provide the biologists with tools to identify the order of the contigs, which enables them to move that much closer to reconstructing the genome • Projector1 • Projector 22 • CAAT-Box or Contigs-Assembly and Annotation Tool-Box3 • Pheromone trail-based genetic algorithm4 • PGAAS or Prokaryotic Genome Assembly Assistant System5 • As observed, the existing methods are based on similar working principles, i.e. they assume that there are reference genomes which are closely related to the organism in consideration 1. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes.Nucleic Acids Res. 2003 Nov 15;31(22):e144. 2. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6 3.Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29. 4.Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly.Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462 5.Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system.Bioinformatics. 2002 May;18(5):661-5.

Rationale • Post-sequencing and assembly of a novel genome (one with no known nearest phylogenetic neighbor), the biologists are faced with a set of unordered contigs • It is not feasible to run gap closing Polymerase Chain Reactions (PCR) for each possible pair of contigs • Thus, we propose a computational method, employing graph theoretical concepts, to identify non-repeat contig neighborhoods (determining possible pairs of contigs and the corresponding gap sizes)

Proposed method A C Set of unordered, un-oriented contigs generated after assembly Exhausting all possible pairs of contigs for gap-closing PCR A C A C E A F F F A B D G D G G D • Using the available data, employing a graph theoretical approach, we generate probable non-repeat contig neighborhoods • contigs connected directly with each other • connected through repeat contigs B F E C REPEAT G D

Graph Theory • Graph theory is a mathematical/computational concept of studying graphs. It is used to model pair-wise relationships between objects from a certain collection of such objects. Technically, a graph consists of: • nodes/vertices (representing the object) and • edges (representing the connection between two objects) – they are either directed or undirected. • For our study, we explored the use of the shortest path algorithm (Dijkstra’s Algorithm) undirected edge node directed edge

Simulation studies • Before analyzing the real world data, we conducted a simulation study to explore a method for identifying possible pairs of contigs • Escheirichia coli 536 genome used for simulation METHOD 1 Generate reads at different coverages (10x, 20x, 30x). Introduce N random length gaps into the genome, producing N+1 contigs GENOME CONTIG 1 GAP 1 CONTIG 2 GAP 2 CONTIG N GAP N CONTIG N+1 METHOD 2 Splice the genome with N copies of the same DNA sequence generate reads at different coverages Repeat 1 Repeat 2 Repeat N CONTIG 1 GAP 1 CONTIG 2 GAP 2 CONTIG N GAP N CONTIG N+1

Classifying the reads • From this population of reads we classify and extract two types of reads: • Reads that belong to the gap regions • Readsthat partially overlap with the edges of the contigs CONTIG 1 GAP 1 CONTIG 2 GAP 2 CONTIG N GAP N CONTIG N+1 suffix prefix CONTIG 1 GAP 1 CONTIG 2 Map the shortest path from a suffix to a prefix through a set of junction reads and unassembled reads

Running the simulation Junction Reads Reads from the gap region Contig edges (suffix|prefix) suffix prefix Large-scale sequence alignment using vmatch Compute all possible overlaps. Generate an overlap graph where the reads form the nodes and the overlaps form the edges. Traverse the graph and find the shortest path between a suffix and prefix suffix prefix

Results – METHOD 1 Number of gaps introduced: 10 Number of contiguous sequences: 11 (C1 through C11) Simulated read lengths: 400 bp Coverage: 10x, 20x and 30x All computed paths (at 20x coverage):

Results – METHOD 2 (Repeat-induced gaps) Number of repeat-induced gaps introduced: 2 Number of contiguous sequences: 3 (C1, C2 and C3) Simulated read lengths: 400 bp Coverage: 10x, 20x and 30x All computed paths (at 30x coverage): C1 C2 C3 Repeat 1 Repeat 2

Exploring the real world data • Eventually we want to help the biologists develop PCR primers to close the gaps between the contigs in the genome. • For this study, we obtained assembly data from Prof. Yves Bruns’ Lab in the Biology department for the organism Brevundimonasdiminuta. • Genome sequencing method: 454 (without paired ends) • Sequence Assembly program: Newbler • We intend to derive information from the .ace file produced by the Newbler assembler

Assembly data statistics Brevundimonasdiminuta STATISTICS Total Number Of Reads 613425 Number of contigs 476 Length of contigs 3325300 Avg. Length of contigs 6985 ± 10125 Largest contig 77275 bp Smallest contig 102 bp Number of N50 contigs 61 Length of N50 contigs 14580

.ace file format • .ace file created by the Newbler assembler contains the following assembly information: • the contig sequences • quality values • read alignment information. • contig linking information AS 476 537972 CO contig00001 755 63 28 U TTAC**AGCT*CCC*GCC*A**GGTC*TT*G*CCGG*TCATGCCTTCATA GTTGGTCGCGTGATAGACGCCGCGCGCCACGGCGCGGGCCAGGACGTCGG BQ 64 64646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464 64 64646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464 AF EYKMWLY02IBXNT.12-1.fm55 C -58 AF EYKMWLY01AHEVV.94-111.fm55 U -92 AF EYKMWLY01E0SVI.79-107.fm55 U -77 AF EYKMWLY01BQQ0K.153-181.fm55 U -151 AF EYKMWLY02GTKZH U 377 AF EYKMWLY01AUK2G U 466 AF EYKMWLY02IS15X C 466 AF EYKMWLY02IYK7T.265-2 C 485 AF EYKMWLY02IT5PF C 511 AF EYKMWLY01B4PII.51-1 C 500 AF EYKMWLY01D8FPB C 520 AF EYKMWLY02HKKY C 600 RD EYKMWLY01AHEVV.94-111.fm55 117 0 0 GGGGCTGAAGGCGCTGAGCGAGACGCTGCCGGTCGCGCAGGCGGTGCCGG CCGGGCGCGCCGATTTAATCAACCCTCTCCCGGCGGGAGAGGGTTAC**A GCTACCC*GCC*A**GG

Inferring information from the .ace file • From the .ace file, we extract information that suggests which reads a particular contig shares with which other contig(s) in the assembly CO contig00016 10537 1778 1241 U AF EYKMWLY02JASI2.1-151.to428 U 10379 AF EYKMWLY01CDRT0.1-130 U 10400 AF EYKMWLY02IF22Q.1-105 U 10426 AF EYKMWLY02HA9IP.1-105 U 10426 AF EYKMWLY01CV3A4.1-107 U 10426 AF EYKMWLY01BNJV1.1-91.to428 U 10440 AF EYKMWLY01D4P1Q C 10445 AF EYKMWLY02FFW9C.1-81 U 10450 AF EYKMWLY01C1SKF.1-75 U 10457 AF EYKMWLY01A3C1V.1-68 U 10465 AF EYKMWLY01CCJHV.1-66.to428 U 10465 AF EYKMWLY01D7WLL.1-52.to428 U 10480 AF EYKMWLY01CPNZS.1-49.to428 U 10484 AF EYKMWLY01CBBAK.216-196.to428 C 10508 AF EYKMWLY01DWD7X.196-176.to428 C 10506 AF EYKMWLY01A27AI.239-214.to428 C 10510 AF EYKMWLY01A14U9.208-187.to428 C 10513 AF EYKMWLY01A4PVY.242-220.to428 C 10513 CO contig00428 21606 2746 2071 U AF EYKMWLY01CPNZS.73-75.fm16 U -71 AF EYKMWLY01D7WLL.76-83.fm16 U -74 AF EYKMWLY02JASI2.175-195.fm16 U -173 AF EYKMWLY01CCJHV.90-117.fm16 U -88 AF EYKMWLY01BNJV1.115-149.fm16 U -113 AF EYKMWLY02JIVGZ.41-1.fm16 C -23 AF EYKMWLY02HY7R3.48-1.fm16 C -25 AF EYKMWLY02FW7CZ.58-1.fm16 C -27 AF EYKMWLY01B9IJ6 U 1 AF EYKMWLY01D4UXN.80-1.fm16 C -34 AF EYKMWLY01DBA88.146-1 C -93 AF EYKMWLY02HKXAA.152-1 C -96 AF EYKMWLY01DWD7X.152-1.fm16 C -47 AF EYKMWLY01B9SVF.62-221 U -60 AF EYKMWLY01A14U9.163-1.fm16 C -44 AF EYKMWLY01CBBAK.172-1.fm16 C -45 AF EYKMWLY01A27AI.190-1.fm16 C -48 AF EYKMWLY01A4PVY.196-1.fm16 C -45

Classification of contigs • Based on the number of recorded connections, we try to classify the contigs into so-called: • Repeat contigs; and • Non-repeat contigs • After classification, we go on to build non-repeat contig neighborhoods (pairs of non-repeat contigs connected through one or more repeat contigs) CONTIG X REPEAT CONTIG Y CONTIG X REPEAT A REPEAT B CONTIG Y

Results

Generated networks

Output from this method • On analysis of the assembly data, biologists are provided with the output in the following format suffix_C466 prefix_C411 1269 suffix_C20 prefix_C439 833 suffix_C20 prefix_C18 833 suffix_C34 prefix_C81 442 suffix_C466 prefix_C51 4095 suffix_C39 prefix_C425 1532 suffix_C42 prefix_C17 1532 suffix_C457 prefix_C18 2087 • This output can be easily plugged into a primer design program such as Primer3 for developing unique PCR primers for each possible pair

Summary • Using this method, we were able to generate probable non-repeat contig neighborhoods that can be effectively utilized by biologists for gap-closing by PCR • Limitations • The simulation studies conducted represent simple cases of the gap closure problem. In reality, we may encounter more complicated repeat structures that cannot be fully resolved by the simple method proposed here. We will conduct more simulation experiments that resemble the real-world sequencing data. • For the contigs that do not possess any connectivity information, further directed experiments can be conducted to close the physical gaps • Future Work • The generated contig-neighborhoods can be aligned to the available optical mapping data to improve the assembly

References • Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000) • Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988) • Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009) • J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51. • Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17 9748-9753 • van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes.Nucleic Acids Res. 2003 Nov 15;31(22):e144. • van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6 • Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29. • Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly.Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462 • Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system.Bioinformatics. 2002 May;18(5):661-5. • Mark J. Chaisson1 and Pavel A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Res. 2008. 18: 324-330

Acknowledgments • Dr. Qunfeng Dong • Prof. Haixu Tang • Prof. RajarshiGuha • Jeong-Hyeon (Justin) Choi – Genomics • Prof. Yves Bruns’ Lab • Colleagues at CGB • Faculty of the School of Informatics • CGB Computing Team – For the technical support and resources • Linda Hostetter & Rachel Lawmaster

THANK YOU!

Developing a computational method to help solve the sequencing gap closure problem

Developing a computational method to help solve the sequencing gap closure problem

Presentation Transcript

The Scientific Method A Way to Solve a Problem

UDL to Solve Problem

The Scientific Method ♫ A Way to Solve a Problem ♫

The Scientific Method ♫ A Way to Solve a Problem ♫

How to solve a math problem!

Our Problem To Solve

A GRASP Heuristic to the Extended Car Sequencing Problem

Helping to Solve a Grave Problem

Design: Solve a Problem

A GRASP Heuristic to the Extended Car Sequencing Problem

The Gap Trigger Problem

Computational Methods to study Sequencing data

Car Sequencing Problem

Car Sequencing Problem

How To Solve The Procrastination Problem

Try To Solve The Problem Together

SOLVE THE CUSTOMER’S PROBLEM

Problem Solve

Solve the Story Problem

The Scientific Method ♫ A Way to Solve a Problem ♫