quality control of illumina data
Download
Skip this Video
Download Presentation
Quality Control of Illumina Data

Loading in 2 Seconds...

play fullscreen
1 / 19

Quality Control of Illumina Data - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

Quality Control of Illumina Data. Mick Watson Director of ARK-Genomics The Roslin Institute. Quality scores. Quality scores. The sequencer outputs base calls at each position of a read It also outputs a quality value at each position

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Quality Control of Illumina Data' - kat


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
quality control of illumina data

Quality Control of Illumina Data

Mick Watson

Director of ARK-Genomics

The Roslin Institute

quality scores1
Quality scores
  • The sequencer outputs base calls at each position of a read
  • It also outputs a quality value at each position
    • This relates to the probability that that base call is incorrect
  • The most common Quality value is the Sanger Q score, or Phred score
    • Qsanger -10 * log10(p)
    • Where p is the probability that the call is incorrect
    • If p = 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect
    • If p = 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect
    • If p = 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect
  • Using the equation:
    • p=0.05, Qsanger= 13
    • p=0.01, Qsanger= 20
    • p=0.001, Qsanger= 30
for the geeks
For the geeks….
  • In R, you can investigate this:

sangerq<- function(x) {return(-10 * log10(x))}

sangerq(0.05)

sangerq(0.01)

sangerq(0.001)

plot(seq(0,1,by=0.00001),sangerq(seq(0,1,by=0.00001)), type="l")

for the geeks1
For the geeks….
  • And the other way round….

qtop<- function(x) {return(10^(x/-10))}

qtop(30)

qtop(20)

qtop(13)

plot(seq(40,1,by=-1), qtop(seq(40,1,by=-1)), type="l")

the important stuff
The important stuff
  • Q30 – 1 in 1000 chance base is incorrect
  • Q20 – 1 in 100 chance base is incorrect
quality encoding1
Quality Encoding
  • Bioinformaticians do not like to make your life easy!
  • Q scores of 20, 30 etc take two digits
  • Bioinformaticians would prefer they only took 1
  • In computers, letters have a corresponding ASCII code:
  • Therefore, to save space, we convert the Q score (two digits) to a single letter using this scheme
the process in full
The process in full
  • p(probability base is wrong) : 0.01
  • Q (-10 * log10(p)) : 30
  • Add 33 : 63
  • Encode as character : ?
for the geeks2
For the geeks….

code2Q <- function(x) { return(utf8ToInt(x)-33) }

code2Q(".")

code2Q("5")

code2Q("?")

code2P <- function(x) { return(10^((utf8ToInt(x)-33)/-10)) }

code2P(".")

code2P("5")

code2P("?")

fastqc
FastQC
  • FastQC is a free piece of software
  • Written by Babraham Bioinformatics group
  • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • Available on Linux, Windows etc
  • Command-line or GUI
slide14

Read the documentation

Follow the course notes

per sequence quality
Per sequence quality
  • One of the most important plots from FastQC
  • Plots a box at each position
  • The box shows the distribution of quality values at that position across all reads
other useful plots
Other useful plots
  • Per sequence N content
    • May identify cycles that are unreliable
  • Over-represented sequences
    • May identify Illumina adapters and primers
ad