Error Correction in HighThroughput Datasets

Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland Error Correction in HighThroughput Datasets

The advent of next-gen sequencing requires students and researchers to deal with large datasets Next-generation sequencing has revolutionized both biological research and clinical medicine, with sequencing of entire human genomes being used to predict drug responsiveness and to diagnose disease (for example Choi 2009).

Students must be able to address error in large datasets In contrast to traditional Sanger sequencing, next-generation sequencing datasets have shorter read lengths and higher error rates. This can create challenges for downstream analysis since even a small error rate will result in a large number of sequencing reads that contain errors due to the abundance of sequencing reads. Indeed, IlluminaMiSeq data produces reads with an error rate of 0.1% (Glenn 2011), yet this corresponds to only ~85% of the 150 bp sequencing reads (.999150) being error-free. Sequencing error in read http://www.pnas.org/content/106/45/19096/F3.expansion.html http://www.pnas.org/content/106/45/19096.full.pdf+html

Background • This module is designed for a genetics or molecular biology class. It will require 3 lecture/seminar class periods with optional additional Linux-based lab activities • Prior to beginning this module, students should be familiar with: • Sample preparation techniques for DNA sequencing • DNA replication and the enzymes that synthesize DNA • Nucleic acid and nucleotide structure

Research Goals Sequencing Requirements • Completed small eukaryotic genome data on Illumina platform • If students will not be performing command-line programming themselves, this data should be analyzed with: • Jellyfish to produce data on k-mer frequencies that students can use to generate a histogram in Excel • Quake to perform error correction so that students can be provided with pre- and post-error correction datasets • Initial evaluation of the quality of eukaryotic genome sequencing data • Implementation of error correction techniques • Comparison of the quality of sequencing data before and after error correction

Student Learning Goals • At the completion of this module, students will be able to: • Describe the important differences between highthroughput and traditional (low throughput) experiments • Explain the reasons for variations in the quality of highthroughputdatasets • Utilize computational tools to quantify errors in sequencing data • Interpret the quality of a sequencing experiment and be able to implement effective quality control measures

Computer Requirements • Excel or other Analytical packages to create a k-mer frequency distribution • Galaxy to create a boxplot of PHRED33 scores • Optional: Quake and Jellyfish on Linux system to generate k-mer data and perform error correction

Vision and Change Competencies • This module will develop students’ abilities to: • Apply the process of science • Design experiment from methodological design through data analysis • Analyze and interpret data • Ability to use modeling and simulation • Design experimental strategies and predict outcomes • Ability to use quantitative reasoning • Depict data using histograms and boxplots • Interpret graphs and use the results of their analysis to modify error correction strategies

Timeline: Class 1 Introductory lecture and data upload • Intro to sequencing history and platforms • Discuss typical sources of error in sequencing reads • Discuss sequence output formats and PHRED33 scores • Upload raw data to Galaxy • Optional: Quake in Linux to manipulate parameters and improve quality http://www.nimr.mrc.ac.uk/mill-hill-essays/bringing-it-all-back-home-next-generation-sequencing-technology-and-you#

Timeline: Class 2Setting up analysis and adjusting parameters • Introduce software packages that can be used to assess data quality • Demonstrate breaking sequencing reads into k-mers • Use Excel or Jellyfish to create k-mergraph • Use Excel or Jellyfish to create k-mergraphfollowing manipulation of error correction parameters (variations in k-mer size) K-mer frequency distibution

Timeline: Class 3Assessing quality Raw Data • Discussion of using PHRED33 scores to assess data quality • Create boxplots of PHRED33 scores in Galaxy for raw data • Create boxplots of PHRED33 scores in Galaxy for data post Quake correction • can have students compare outcomes following Quake correction with different parameters Data post Quake correction

Discussion Topics • Why has next-generation sequencing technology led to a revolution in biology/medicine? • Discuss and predict how chemical and physical mechanisms lead to errors • Comparison of sequence improvement based on different parameters • How do software packages determine which base is in error and which is correct if sequencing reads conflict? • Why is it important to have a numerical measure of error in addition to the nucleotide sequence?

Assessment • This module will be performed as a team-based project with students preparing and handing in a report at the end. Students will be able to: • Predict predominant types or sources of error based on experimental design and sequencing platform • Prepare a boxplot using Galaxy for an exemplary dataset and use the boxplot to evaluate the quality of the sequence data • Effectively improve the quality of any set of NGS reads prior to assembly

References • https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish • www.en.wikipedia.org/wiki/FASTQ_format • Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:quality-aware detection and correction of sequencing errors. Genome Biology 11:R116 • Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27:764-770. [Jellyfish program] • http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf • http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/pdf/ukmss-2586.pdf

Error Correction in HighThroughput Datasets