1 / 14

Error Correction in HighThroughput Datasets

Dale Beach, Longwood University Lisa Scheifele , Loyola University Maryland. Error Correction in HighThroughput Datasets. The advent of next-gen sequencing requires students and researchers to deal with large datasets.

kiril
Download Presentation

Error Correction in HighThroughput Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland Error Correction in HighThroughput Datasets

  2. The advent of next-gen sequencing requires students and researchers to deal with large datasets Next-generation sequencing has revolutionized both biological research and clinical medicine, with sequencing of entire human genomes being used to predict drug responsiveness and to diagnose disease (for example Choi 2009).

  3. Students must be able to address error in large datasets In contrast to traditional Sanger sequencing, next-generation sequencing datasets have shorter read lengths and higher error rates. This can create challenges for downstream analysis since even a small error rate will result in a large number of sequencing reads that contain errors due to the abundance of sequencing reads. Indeed, IlluminaMiSeq data produces reads with an error rate of 0.1% (Glenn 2011), yet this corresponds to only ~85% of the 150 bp sequencing reads (.999150) being error-free. Sequencing error in read http://www.pnas.org/content/106/45/19096/F3.expansion.html http://www.pnas.org/content/106/45/19096.full.pdf+html

  4. Background • This module is designed for a genetics or molecular biology class. It will require 3 lecture/seminar class periods with optional additional Linux-based lab activities • Prior to beginning this module, students should be familiar with: • Sample preparation techniques for DNA sequencing • DNA replication and the enzymes that synthesize DNA • Nucleic acid and nucleotide structure

  5. Research Goals Sequencing Requirements • Completed small eukaryotic genome data on Illumina platform • If students will not be performing command-line programming themselves, this data should be analyzed with: • Jellyfish to produce data on k-mer frequencies that students can use to generate a histogram in Excel • Quake to perform error correction so that students can be provided with pre- and post-error correction datasets • Initial evaluation of the quality of eukaryotic genome sequencing data • Implementation of error correction techniques • Comparison of the quality of sequencing data before and after error correction

  6. Student Learning Goals • At the completion of this module, students will be able to: • Describe the important differences between highthroughput and traditional (low throughput) experiments • Explain the reasons for variations in the quality of highthroughputdatasets • Utilize computational tools to quantify errors in sequencing data • Interpret the quality of a sequencing experiment and be able to implement effective quality control measures

  7. Computer Requirements • Excel or other Analytical packages to create a k-mer frequency distribution • Galaxy to create a boxplot of PHRED33 scores • Optional: Quake and Jellyfish on Linux system to generate k-mer data and perform error correction

  8. Vision and Change Competencies • This module will develop students’ abilities to: • Apply the process of science • Design experiment from methodological design through data analysis • Analyze and interpret data • Ability to use modeling and simulation • Design experimental strategies and predict outcomes • Ability to use quantitative reasoning • Depict data using histograms and boxplots • Interpret graphs and use the results of their analysis to modify error correction strategies

  9. Timeline: Class 1 Introductory lecture and data upload • Intro to sequencing history and platforms • Discuss typical sources of error in sequencing reads • Discuss sequence output formats and PHRED33 scores • Upload raw data to Galaxy • Optional: Quake in Linux to manipulate parameters and improve quality http://www.nimr.mrc.ac.uk/mill-hill-essays/bringing-it-all-back-home-next-generation-sequencing-technology-and-you#

  10. Timeline: Class 2Setting up analysis and adjusting parameters • Introduce software packages that can be used to assess data quality • Demonstrate breaking sequencing reads into k-mers • Use Excel or Jellyfish to create k-mergraph • Use Excel or Jellyfish to create k-mergraphfollowing manipulation of error correction parameters (variations in k-mer size) K-mer frequency distibution

  11. Timeline: Class 3Assessing quality Raw Data • Discussion of using PHRED33 scores to assess data quality • Create boxplots of PHRED33 scores in Galaxy for raw data • Create boxplots of PHRED33 scores in Galaxy for data post Quake correction • can have students compare outcomes following Quake correction with different parameters Data post Quake correction

  12. Discussion Topics • Why has next-generation sequencing technology led to a revolution in biology/medicine? • Discuss and predict how chemical and physical mechanisms lead to errors • Comparison of sequence improvement based on different parameters • How do software packages determine which base is in error and which is correct if sequencing reads conflict? • Why is it important to have a numerical measure of error in addition to the nucleotide sequence?

  13. Assessment • This module will be performed as a team-based project with students preparing and handing in a report at the end. Students will be able to: • Predict predominant types or sources of error based on experimental design and sequencing platform • Prepare a boxplot using Galaxy for an exemplary dataset and use the boxplot to evaluate the quality of the sequence data • Effectively improve the quality of any set of NGS reads prior to assembly

  14. References • https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish • www.en.wikipedia.org/wiki/FASTQ_format • Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:quality-aware detection and correction of sequencing errors. Genome Biology 11:R116 • Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27:764-770. [Jellyfish program] • http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf • http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/pdf/ukmss-2586.pdf

More Related