Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis
Sponsored Links
This presentation is the property of its rightful owner.
1 / 24

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis. Cost. Throughput. Gigabases. Cost per Kb. Lucinda Fulton, The Genome Center at Washington University. Sequencing Technologies. http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png.

Download Presentation

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


IMGS 2012Bioinformatics Workshop:File Formats for Next Gen Sequence Analysis


Cost

Throughput

Gigabases

Cost per Kb

Lucinda Fulton, The Genome Center at Washington University


Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png


Sequence “Space”

  • Roche 454 – Flow space

    • Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

    • Flow space describes sequence in terms of these base incorporations

    • http://www.youtube.com/watch?v=bFNjxKHP8Jc

  • AB SOLiD – Color space

    • Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye

    • Each base sequenced twice

    • http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

  • Illumina/Solexa – Base space

    • Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups

    • Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH

    • http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

  • GenomeTV – Next Generation Sequencing (lecture)

    • http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html


Flexible

Good: with rapidly changing data/tech

Poor: validation

Human Readable

Convenient for de-bugging

Computer doesn’t care!


Sequences

FASTA

FASTQ

SAM/BAM

Alignments

SAM/BAM

MAF

Annotations

BED

GTF

GFF3

GVF

VCF

http://genome.ucsc.edu/FAQ/FAQformat.html

http://www.sequenceontology.org/


FASTA

FASTQ


FASTQ: Data Format

Sequence data format

  • FASTQ

    • Text based

    • Encodes sequence calls and quality scores with ASCII characters

    • Stores minimal information about the sequence read

    • 4 lines per sequence

      • Line 1: begins with @; followed by sequence identifier and optional description

      • Line 2: the sequence

      • Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional)

      • Line 4: encoding of quality scores for the sequence in line 2

  • References/Documentation

    • http://maq.sourceforge.net/fastq.shtml

    • Cock et al. (2009). Nuc Acids Res 38:1767-1771.


FASTQ Example

For analysis, it may be necessary to convert to the Sanger form of FASTQ.

  • FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.


FASTQ: Details

  • FASTQ

    • Text based

    • Encodes sequence calls and quality scores with ASCII characters

    • Stores minimal information about the sequence read

    • 4 lines per sequence

      • Line 1: begins with @; followed by sequence identifier and optional description

      • Line 2: the sequence

      • Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional)

      • Line 4: encoding of quality scores for the sequence in line 2

  • References/Documentation

    • http://maq.sourceforge.net/fastq.shtml

    • Cock et al. (2009). Nuc Acids Res 38:1767-1771.


Quality scores

Q = Phred Quality Scores

P = Base-calling error probabilities


Quality score encoding differ among the platforms

  • !"#$%&'()*+,-./0123456789:;<=>[email protected][\]^_`abcdefghijklmnopqrstuvwxyz{|}~

  • | | | | | |

  • 33 59 64 73 104 126

    • S - Sanger Phred+33, raw reads typically (0, 40)

    • X - Solexa Solexa+64, raw reads typically (-5, 40)

    • I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)

    • J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)

    • with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator

    • L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Format/PlatformQualityScoreTypeASCII encoding

SangerPhred: 0-9333-126

SolexaSolexa:-5-6264-126

Illumina 1.3Phred: 0-6264-126

Illumina 1.5Phred: 0-6264-126

Illumina 1.8Phred: 0-6233-126 *** Sanger format!

Most analysis tools require Sanger fastq quality score encoding


http://main.g2.bx.psu.edu/


SAM (Sequence Alignment/Map)

Alignment data format

  • SAM is the output of aligners that map reads to a reference genome

    • Tab delimited w/ header section and alignment section

      • Header sections begin with @ (are optional)

      • Alignment section has 11 mandatory fields

    • BAM is the binary format of SAM

http://samtools.sourceforge.net/


Mandatory Alignment Fields

http://samtools.sourceforge.net/SAM1.pdf


Alignment Examples

Alignments in SAM format

CIGAR string -> 8M2I4M1D3M

http://samtools.sourceforge.net/SAM1.pdf


Annotation Formats

  • Mostly tab delimited files that describe the location of genome features (i.e., genes, etc.)

  • Also used for displaying annotations on standard genome browsers

  • Important for associating alignments with specific genome features

  • descriptions

  • Knowing format details can be important to translating results!

    • BED is zero based

    • GTF/GFF are one based


GTF

Annotation data format

http://useast.ensembl.org/info/website/upload/gff.html


BED format

Annotation data format

chr18611426586116346nsv433165

chr218417741846089nsv433166

chr1629504462955264nsv433167

chr171435038714351933nsv433168

chr173283169432832761nsv433169

chr173283169432832761nsv433170

chr186188055061881930nsv433171

chr11675982916778548chr1:21667704270866-

chr11676319416784844chr1:146691804407277+

chr11676319416784844chr1:144004664408925-

chr11676319416779513chr1:142857141291416-

chr11676319416779513chr1:143522082293473-

chr11676319416778548chr1:146844175284555-

chr11676319416778548chr1:147006260284948-

chr11676341116784844chr1:144747517405362+


BED: zero based, start inclusive, stop exclusive

Length = stop-start

GTF/GFF: one based, inclusive

Length = stop-start+1


GRCh37

NCBI36


  • Login