1 / 8

Working with Mapped Reads in R and BioConductor

Working with Mapped Reads in R and BioConductor. Advanced Genomic Data Analysis BIOS 691- 804, 2012 Mark Reimers. Mapped Read File Formats. Common standard is Sequence Alignment/Map (SAM) Accommodates most kinds of sequence information commonly used Efficient for stream processing

breck
Download Presentation

Working with Mapped Reads in R and BioConductor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Working with Mapped Reads in R and BioConductor Advanced Genomic Data Analysis BIOS 691-804, 2012 Mark Reimers

  2. Mapped Read File Formats • Common standard is Sequence Alignment/Map (SAM) • Accommodates most kinds of sequence information commonly used • Efficient for stream processing • SAM files may be indexed by genomic position to efficiently retrieve all reads aligning to a locus • Compressed version Binary A.M.(BAM) • Other formats used (e.g. SOAP) have mostly similar fields • Variant Call Format (VCF) used by 1000 Genoms for SNPs and structural variants

  3. SAM Example

  4. SAM Example - File Format @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * • r003 – read identifier • ref – reference sequence (e.g. chr18) • 8M2I4M1D3M – CIGAR string

  5. Tools for SAM Data • SAMTools • Rsamtools • GATK • Especially good for SNP calling • Picard • Good for pre-processing SAM files, e.g. removing duplicates

  6. SAMTools • Utilities for manipulating alignments in the SAM format, including sorting, merging, and indexing • See http://samtools.sourceforge.net/ • Rsamtools - implementation of most of Samtools in R

  7. IRanges • Provides efficient low-level S4 classes for storing ranges of integers and RLE (Run-Length Encoding)vectors • RLE: sequences in which the same data value occurs in many consecutive data elements are stored as a single data value and count • Provides several methods for manipulating sequences • The foundation of GenomicRanges

  8. GenomicRanges • General purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome • Usually store many intervals in one object • Key function: countOverlaps()counts how many times one set of ranges overlaps another set of ranges • Ideal for counting reads in exons or genes

More Related