1 / 13

NGS data format and General Quality Control

NGS data format and General Quality Control. Data format “Flowchart”. Fastq file. Used to record raw reads coming off the sequencers Each record contains four lines Parameters were usually set by the sequencer, such as read length. Fastq file .

phila
Download Presentation

NGS data format and General Quality Control

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NGS data format and General Quality Control

  2. Data format “Flowchart”

  3. Fastq file • Used to record raw reads coming off the sequencers • Each record contains four lines • Parameters were usually set by the sequencer, such as read length

  4. Fastq file

  5. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). • Line 2 is the raw sequence letters. The read length is the length of the string. • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. http://en.wikipedia.org/wiki/FASTQ_format

  6. General quality control of raw reads • Using FASTQC • A tool that implements some general rules • Basic Statistics • Per base sequence quality • Per sequence quality scores • Per base sequence content • Per base GC content • Per sequence GC content • Per base N content • Sequence Length Distribution • Sequence Duplication Levels • Overrepresented sequences • Kmer Content

  7. Quality scores

  8. Perbase “N” percentage

  9. Sample FASTQC reports Good quality : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc/fastqc_report.html Bad quality: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc/fastqc_report.html

  10. Data format “Flowchart”

  11. SAM/BAM • SAM stands for Sequence Alignment Map • BAM is the binary form of SAM • Used for mapped/aligned reads • Generated by NGS mapper/aligners

  12. SAM

  13. BAM

More Related