1 / 59

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 3 Mapping and Genome Rearrangement. Jared Simpson, Ph.D. Bioinformatics for Cancer Genomics May 30 – June 3, 2016. from: doi:10.1038/nmeth.2258. Learning Objectives of Module.

lisab
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 3 Mapping and Genome Rearrangement Jared Simpson, Ph.D. Bioinformatics for Cancer Genomics May 30 – June 3, 2016 from: doi:10.1038/nmeth.2258

  4. Learning Objectives of Module • Understand mapping reads to a reference genome • Understand the FASTQ and SAM/BAM file formats • Learn common terminology used to describe alignments • Learn how to find genome rearrangements using read pairs • Run a mapper and rearrangement caller

  5. Sequencing platforms 14TB/run $ 600Gb/10d 100Gb/15d 120Gb/1d 90Gb/10d Increasing Data Per Run 150Mb/3h 2Gb/27h 700Mb/23h $ 100Mb/1h Increasing Run Time

  6. Illumina Sequencing

  7. Basecalling • Prediction of the DNA sequence from the images

  8. Sources of errorIllumina: Pre-phasing & Phasing

  9. Error Profiles • Illumina • Low error rate (~0.5%), mainly substitutions • 454/Ion Torrent • Mainly insertions/deletions in homopolymer runs • Pacbio and Oxford Nanopore • Single molecule sequencers • Higher error rate, mixture of insertions, deletions, substitutions

  10. Illumina Error Profile

  11. What is a FASTQ file?

  12. What is a FASTQ file? • Read name

  13. What is a FASTQ file? • Basecalled sequence

  14. What is a FASTQ file? • Quality separator

  15. What is a FASTQ file? • Base quality scores

  16. What is a base quality score? • Phred quality scores: • Estimate of probability the base call is incorrect

  17. Reference Mapping

  18. Reference Mapping Why do we map reads to the reference? By comparing the reads from a sequenced individual to a reference genome we can identify variants like SNPs, and rearrangements To do this we need to identify where in the reference genome that a readmight have come from

  19. Reference Mapping Issues The genome is very large and repetitive The mapping program must be efficient and tolerant of repetitive sequences Mappers like BWA using an index of the reference genome to rapidly identify possible mapping locations

  20. Reference Mapping Issues The reads contain sequencing errors The mapping program must tolerate differences between the reads and the reference Typically the mapper will find exact-match seeds then refine the seed alignments using dynamic programming Mapping reads with many errors or insertions/deletions is much harder

  21. Reference Mapping Issues Short read sequences produce huge amounts of data The mapping algorithm must be extremely efficient while accounting for the issues discussed above

  22. Choosing a Mapper Needs to be accurate Misaligned reads are a source of false positive variant calls Needs to be sensitive Must allow for differences between the individual and reference Needs to be fast

  23. Reference Mapping Reference genome Sequence read ?

  24. Reference Mapping Reference genome x x x Sequence read

  25. Mapping Quality • Phred-scaled estimate of the probability that the chosen mapping is wrong • 1 in 1000 reads with “Q30” alignment will be placed incorrectly • What causes mapping errors? • High error rate • Repetitive sequence • Differences between the reference and sequenced sample

  26. What are Paired Reads? DNA fragment ATCAAGA CTACATG Insert size (IS) Slides by M. Brudno

  27. Paired Reads Reference genome ? Sequence read pair

  28. Paired Mapping Reference genome x x Sequence read pair

  29. Paired Mapping Reference genome x x x x x x x x Sequence read pair

  30. Sequence Alignment/Map Format • SAM/BAM is a format for working with mapped reads • SAM is tab-delimited text representation • BAM is a compressed binary representation SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77

  31. SAM Format SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 Flag Read ID • Flag indicates the reference strand, pairing information

  32. SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 Chromosome Position

  33. SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 Mapping Quality

  34. SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 CIGAR Ref ACGATACATAC Ref GACA-AACC Read ACGA-ACATAC Read GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M

  35. SAM Description Mate chromosome, position Insert size SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 ATCAA CTAAG Insert size (IS)

  36. Resources samtools: toolkit for working with SAM/BAM files Convert between SAM/BAM Sort alignments Extract alignments for a given genomic location SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf Questions/Help https://lists.sourceforge.net/lists/listinfo/samtools-help http://www.biostars.org/ http://seqanswers.com/

  37. Viewing Alignments - IGV

  38. Alignment Problems

  39. Alignment Problems

  40. We are now going to start a read mapping exercise

  41. We are on a Coffee Break & Networking Session

  42. Types of variation Single Nucleotide Variants (SNVs) Insertions/deletions (INDELs) Structural variations Large insertions and deletions Inversions Translocations Copy number variation

  43. Structural variants using paired-end reads Genomic DNA Fragmentation and size selection (200-500bp) Add sequencing adaptors Sequence both ends

  44. Read pair orientation Reference read pair • Expected orientation: • one read on the forward strand, one read on the reverse strand

  45. Fragment size distribution from: doi:10.1038/ng.3121 • Fragment/insert size is determined by library preparation • Pairs that match the expected orientation and distance are called concordant • Discordant read pairs give evidence of structural variation

  46. SV Signatures: Deletion sample reference Slides by M. Brudno

  47. SV Signatures: Deletion sample reference Signature: mapped insert size larger than expected Slides by M. Brudno

  48. SV Signatures: Insertion sample reference Signature: mapped insert size smaller than expected Slides by M. Brudno

  49. SV Signatures: Tandem Duplication sample reference Signature: wrong orientation

More Related