Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis. Cost. Throughput. Gigabases. Cost per Kb. Lucinda Fulton, The Genome Center at Washington University. Sequencing Technologies. http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png.

Download Presentation

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

IMGS 2012Bioinformatics Workshop:File Formats for Next Gen Sequence Analysis


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Cost

Throughput

Gigabases

Cost per Kb

Lucinda Fulton, The Genome Center at Washington University


Sequencing technologies

Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png


Sequence space

Sequence “Space”

  • Roche 454 – Flow space

    • Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

    • Flow space describes sequence in terms of these base incorporations

    • http://www.youtube.com/watch?v=bFNjxKHP8Jc

  • AB SOLiD – Color space

    • Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye

    • Each base sequenced twice

    • http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

  • Illumina/Solexa – Base space

    • Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups

    • Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH

    • http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

  • GenomeTV – Next Generation Sequencing (lecture)

    • http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Flexible

Good: with rapidly changing data/tech

Poor: validation

Human Readable

Convenient for de-bugging

Computer doesn’t care!


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Sequences

FASTA

FASTQ

SAM/BAM

Alignments

SAM/BAM

MAF

Annotations

BED

GTF

GFF3

GVF

VCF

http://genome.ucsc.edu/FAQ/FAQformat.html

http://www.sequenceontology.org/


Fastq

FASTA

FASTQ


Fastq data format

FASTQ: Data Format

Sequence data format

  • FASTQ

    • Text based

    • Encodes sequence calls and quality scores with ASCII characters

    • Stores minimal information about the sequence read

    • 4 lines per sequence

      • Line 1: begins with @; followed by sequence identifier and optional description

      • Line 2: the sequence

      • Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional)

      • Line 4: encoding of quality scores for the sequence in line 2

  • References/Documentation

    • http://maq.sourceforge.net/fastq.shtml

    • Cock et al. (2009). Nuc Acids Res 38:1767-1771.


Fastq example

FASTQ Example

For analysis, it may be necessary to convert to the Sanger form of FASTQ.

  • FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.


Fastq details

FASTQ: Details

  • FASTQ

    • Text based

    • Encodes sequence calls and quality scores with ASCII characters

    • Stores minimal information about the sequence read

    • 4 lines per sequence

      • Line 1: begins with @; followed by sequence identifier and optional description

      • Line 2: the sequence

      • Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional)

      • Line 4: encoding of quality scores for the sequence in line 2

  • References/Documentation

    • http://maq.sourceforge.net/fastq.shtml

    • Cock et al. (2009). Nuc Acids Res 38:1767-1771.


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Quality scores

Q = Phred Quality Scores

P = Base-calling error probabilities


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Quality score encoding differ among the platforms

  • !"#$%&'()*+,-./0123456789:;<=>[email protected][\]^_`abcdefghijklmnopqrstuvwxyz{|}~

  • | | | | | |

  • 33 59 64 73 104 126

    • S - Sanger Phred+33, raw reads typically (0, 40)

    • X - Solexa Solexa+64, raw reads typically (-5, 40)

    • I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)

    • J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)

    • with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator

    • L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Format/PlatformQualityScoreTypeASCII encoding

SangerPhred: 0-9333-126

SolexaSolexa:-5-6264-126

Illumina 1.3Phred: 0-6264-126

Illumina 1.5Phred: 0-6264-126

Illumina 1.8Phred: 0-6233-126 *** Sanger format!

Most analysis tools require Sanger fastq quality score encoding


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

http://main.g2.bx.psu.edu/


Sam sequence alignment map

SAM (Sequence Alignment/Map)

Alignment data format

  • SAM is the output of aligners that map reads to a reference genome

    • Tab delimited w/ header section and alignment section

      • Header sections begin with @ (are optional)

      • Alignment section has 11 mandatory fields

    • BAM is the binary format of SAM

http://samtools.sourceforge.net/


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Mandatory Alignment Fields

http://samtools.sourceforge.net/SAM1.pdf


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

Alignment Examples

Alignments in SAM format

CIGAR string -> 8M2I4M1D3M

http://samtools.sourceforge.net/SAM1.pdf


Annotation formats

Annotation Formats

  • Mostly tab delimited files that describe the location of genome features (i.e., genes, etc.)

  • Also used for displaying annotations on standard genome browsers

  • Important for associating alignments with specific genome features

  • descriptions

  • Knowing format details can be important to translating results!

    • BED is zero based

    • GTF/GFF are one based


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

GTF

Annotation data format

http://useast.ensembl.org/info/website/upload/gff.html


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

BED format

Annotation data format

chr18611426586116346nsv433165

chr218417741846089nsv433166

chr1629504462955264nsv433167

chr171435038714351933nsv433168

chr173283169432832761nsv433169

chr173283169432832761nsv433170

chr186188055061881930nsv433171

chr11675982916778548chr1:21667704270866-

chr11676319416784844chr1:146691804407277+

chr11676319416784844chr1:144004664408925-

chr11676319416779513chr1:142857141291416-

chr11676319416779513chr1:143522082293473-

chr11676319416778548chr1:146844175284555-

chr11676319416778548chr1:147006260284948-

chr11676341116784844chr1:144747517405362+


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

BED: zero based, start inclusive, stop exclusive

Length = stop-start

GTF/GFF: one based, inclusive

Length = stop-start+1


Imgs 2012 bioinformatics workshop file formats for next gen sequence analysis

GRCh37

NCBI36


  • Login