1 / 33

Introduction to Short Read Sequencing Analysis

Introduction to Short Read Sequencing Analysis. Jim Noonan GENE 760. Sequence read lengths remain limiting. Chr1: 249 Mb . 249 Mb sequencing read. Current platforms: A moderate number (~500,000) of long reads (~10 kb) A very large number (>200 M) of short reads (100 bp ) .

bianca
Download Presentation

Introduction to Short Read Sequencing Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760

  2. Sequence read lengths remain limiting Chr1: 249 Mb 249 Mb sequencing read • Current platforms: • A moderate number (~500,000) of long reads (~10 kb) • A very large number (>200 M) of short reads (100 bp) • For most applications reads are aligned to a reference genome • Short reads contain inherently limited information • De novo assembly of short reads is difficult

  3. Determining the identity and location of short sequence reads in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Need a computationally efficient method to perform accurate alignments of millions of reads

  4. Read length requirements vary depending on the feature being studied Exome: 80-120 bp Splice junctions (connectivity) Transcriptome: 10,000 bp

  5. Determining the identity and location of short sequence reads in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Exome or Genome • Considerations • Alignment scoring • Source of the reads • Sequencing format (PE or SE) • Read length • Error rates Transcriptome

  6. Topics • Scoring alignments • Error rates and quality scores for short read sequencing • Mapability • Common algorithms for short read sequence alignment • Scoring short read sequence aligments • Uniform data output formats

  7. Scoring alignments Correct: C | C Match (+1) TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC C T Mismatch (-1, -2, etc.) Wrong: A-TAC ||||| ATTAC Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC A--AC ||||| ATTAC Many short read alignment algorithms allow a fixed number of mismatches Adapted from Mark Gerstein

  8. Scoring alignments Correct (polymorphism): C | C Match (+1) TAGATTACTCAGATTAC |||||||| |||||||| TAGATTACACAGATTAC C T Mismatch (-1, -2, etc.) Wrong: A-TAC ||||| ATTAC Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC A--AC ||||| ATTAC Many short read alignment algorithms allow a fixed number of mismatches Adapted from Mark Gerstein

  9. Quality scores • A quality score (or Q-score) expresses the probability that a basecall is incorrect. • Given a basecall, A: • The estimated probability that A is not correct is P(~A); • The quality score for A is Q (A) = -10 log10 (P(~A)) A quality score of 10 means a probability of 0.1 that A is the wrong basecall. P(~A) is platform-specific; Q-scores can be compared across platforms. Quality scores are logarithmic:

  10. Error rates in lllumina sequencing reads Sequencing by synthesis with reversible dye terminators Reverse termination Add next base, etc. Add base Scan flow cell Individual synthesis reactions go out of phase 1 cycle

  11. Error rates in lllumina sequencing reads • Error rates are mismatch rates relative to reference genome • Error rates increase with increasing cycle number • Contingent on reference genome quality • Reads may be trimmed to improve alignment quality

  12. Illumina quality score encoding in FASTQ format (CASAVA v1.8) >90% Q30 bases in high quality run >80% mappable reads

  13. Sources of error in single-molecule sequencing Illumina: Consensus signal TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC TAGATTA-ACAG-TT-C ||||||| |||| || | TAGATTACACAGATTAC PacBio: One molecule, one read Sequence templates multiple times

  14. Mapability • The genome contains non-unique sequences (repeats, segmental duplications) • Short reads derived from repetitive regions are difficult to map Chr3 repeat Chr7 repeat Longer reads: Paired reads:

  15. Mapability scores at UCSC • The genome contains non-unique sequences (repeats, segmental duplications) • Short reads derived from repetitive regions are difficult to map 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

  16. Poorly mappable regions of the genome 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

  17. Common algorithms for mapping short reads to a reference genome • Considerations • Alignment scoring method • Speed • Quality aware • Seeding • Gapped alignment

  18. Seed-based alignment strategy Single seed alignments Reference Seed Critical values are seed length and number of mismatches allowed In ELAND: Seed length = 32 Number of mismatches = 2 Multiseed alignments (ELAND v2, others) Seed interval contingent on read length

  19. Implementation in ELAND v2 A read must have at least one seed with no more than 2 mismatches and no gaps Gapped alignment: extend each alignment to full length of read, allowing gaps up to 10 bp

  20. Resolving ambiguous read alignments with multiple seeds Reference Seed

  21. Resolving ambiguous read alignments with multiple seeds

  22. Utility of gapped alignments RNA-seq Insertions and deletion variants in exome and whole genome sequencing

  23. Mapping paired end reads Read 1 Read 2 Insert size Insert size within specified range

  24. ELAND alignment scoring Base quality values and mismatch positions in a candidate alignment are used to assign a p value P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values Alignment score for a read is computed from p values of all candidate alignments If there are two candidates for a read with p values 0.9 and 0.3: • 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct • 1- 0.75, chance highest scoring alignment is wrong • Alignment score = -10 log(0.25) = 6.

  25. BaseSpace https://basespace.illumina.com/

  26. Spaced-seed indexing of the reference genome • Need to break up the genome into • manageable segments • Create index of short sequences • Match seeds against genome index alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

  27. Reference genome indexing using Burrows-Wheeler transform • Reversible encoding scheme • Simplifies genome sequence • Results in “indexed” genome • Very rapid alignments alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

  28. Bowtie 2 Pre-built Indexed genomes Bowtie 1 and Bowtie 2 indexes are not compatible

  29. Alignments in Bowtie 2 @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Multiseedalignment (ungapped) Seed length: 16 nt, every 10 nt # mismatches: 0 Gap = -11 -5 to open -3 to extend by 1 bp Mismatch = -6 Seeds are extended (gaps allowed) to generate alignment Ref Match = 2 TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Read TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

  30. Mapping in highly repetitive regions • ELAND is conservative • Non-unique alignments are flagged; only one is reported in export.txt • Post-alignment CASAVA analyses ignore these • Bowtie will report non-unique alignments • User-specified options determine how these are reported http://bowtie-bio.sourceforge.net/manual.shtml http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

  31. Sequence Alignment/Map (SAM) format • Standard format for reporting short read alignment data • BAM is compressed version Header Alignment info http://samtools.sourceforge.net/

  32. Summary • Read the material posted for this lecture on the class wiki • Next week: first Regulomics lecture

More Related