1 / 19

Quality Control of Illumina Data

Quality Control of Illumina Data. Mick Watson Director of ARK-Genomics The Roslin Institute. Quality scores. Quality scores. The sequencer outputs base calls at each position of a read It also outputs a quality value at each position

kat
Download Presentation

Quality Control of Illumina Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

  2. Quality scores

  3. Quality scores • The sequencer outputs base calls at each position of a read • It also outputs a quality value at each position • This relates to the probability that that base call is incorrect • The most common Quality value is the Sanger Q score, or Phred score • Qsanger -10 * log10(p) • Where p is the probability that the call is incorrect • If p = 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect • If p = 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect • If p = 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect • Using the equation: • p=0.05, Qsanger= 13 • p=0.01, Qsanger= 20 • p=0.001, Qsanger= 30

  4. For the geeks…. • In R, you can investigate this: sangerq<- function(x) {return(-10 * log10(x))} sangerq(0.05) sangerq(0.01) sangerq(0.001) plot(seq(0,1,by=0.00001),sangerq(seq(0,1,by=0.00001)), type="l")

  5. The plot

  6. For the geeks…. • And the other way round…. qtop<- function(x) {return(10^(x/-10))} qtop(30) qtop(20) qtop(13) plot(seq(40,1,by=-1), qtop(seq(40,1,by=-1)), type="l")

  7. The important stuff • Q30 – 1 in 1000 chance base is incorrect • Q20 – 1 in 100 chance base is incorrect

  8. Quality encoding

  9. Quality Encoding • Bioinformaticians do not like to make your life easy! • Q scores of 20, 30 etc take two digits • Bioinformaticians would prefer they only took 1 • In computers, letters have a corresponding ASCII code: • Therefore, to save space, we convert the Q score (two digits) to a single letter using this scheme

  10. The process in full • p(probability base is wrong) : 0.01 • Q (-10 * log10(p)) : 30 • Add 33 : 63 • Encode as character : ?

  11. For the geeks…. code2Q <- function(x) { return(utf8ToInt(x)-33) } code2Q(".") code2Q("5") code2Q("?") code2P <- function(x) { return(10^((utf8ToInt(x)-33)/-10)) } code2P(".") code2P("5") code2P("?")

  12. QC of Illumina data

  13. FastQC • FastQC is a free piece of software • Written by Babraham Bioinformatics group • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Available on Linux, Windows etc • Command-line or GUI

  14. Read the documentation Follow the course notes

  15. Per sequence quality • One of the most important plots from FastQC • Plots a box at each position • The box shows the distribution of quality values at that position across all reads

  16. Obvious problems

  17. Less obvious problems

  18. Really bad problems

  19. Other useful plots • Per sequence N content • May identify cycles that are unreliable • Over-represented sequences • May identify Illumina adapters and primers

More Related