1 / 17

Mat úš Kalaš INF389, CBU, BCCS /UiB, Bergen Nov 12, 2010

Formats and standards for sequencing data. Mat úš Kalaš INF389, CBU, BCCS /UiB, Bergen Nov 12, 2010. SHRiMP Maq BWA Bowtie RMAP Eland SOAP SOAP2 MOSAIK SOCS PatMaN ZOOM PASS PerM RazerS segemehl MPSCAN BFAST Lastz BLAT. 454 Solexa/Illumina SOLiD …. Genome Metagenome

baruch
Download Presentation

Mat úš Kalaš INF389, CBU, BCCS /UiB, Bergen Nov 12, 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Formats and standards for sequencing data Matúš Kalaš INF389, CBU, BCCS/UiB, Bergen Nov 12, 2010

  2. SHRiMP Maq BWA Bowtie RMAP Eland SOAP SOAP2 MOSAIK SOCS PatMaN ZOOM PASS PerM RazerS segemehl MPSCAN BFAST Lastz BLAT 454 Solexa/Illumina SOLiD … Genome Metagenome Gene annotation Gene expression Binding sites Variation … Celera Newbler Velvet Euler SOAPdenovo … GenBank EMBL DDBJ Genome Catalogue SNPdb … NCBI SRA EMBL-EBI ENA Your databases

  3. 454 output formats • .sff • .fna • .qual

  4. Illumina output formats • .seq.txt • .prb.txt • Illumina FASTQ (ASCII – 64 is Illumina score) • Qseq • (ASCII – 64 is Phred score) • Illumina single line format • SCARF

  5. SOLiD output format(s) • CSFASTA

  6. Real (“standard”) FASTQ = Sanger FASTQ(ASCII – 33 is Phred score)

  7. Example of dealing with diverse read formats: • in Galaxy(http://usegalaxy.org)

  8. If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI • SRA format (XML) • SRF format Or should they be deleted?

  9. Common (“standard”) format for read alignments: • SAM • BAM(= binary SAM)

  10. Some common formats for results:(Genome/Gene annotation) • BED format (genome-browser tracks) • GFF format (gene/genome features) • BioXSD (XML) (any annotation; under development)

  11. Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: • MIGS/MIMSstandard by GSC • GCDMLformat (XML) (under development) • following the MIGS/MIMS standard

  12. MIGS: Minimum Information about a Genome SequenceMIMS: Minimum Information about a Metagenome Sequence/Sample

  13. MIGS/MIMS checklist:

  14. MIGS/MIMSmetadata example:

  15. Sequencing experiment metadata: • MINSEQEstandard by FGED • Minimum Information about a high-throughput • Nucleotide SEQuencing Experiment • (under development)

  16. Take-home messages: • Use raw sequencing data when possible • For base-call data, use “standard” FASTQ (Sanger, Phred) • For read alignments, use SAM/BAM format • Use common formats for your results (e.g. GFF or BED format) • Hope for new, generic, extensible standard format(s) • Submit MIGS/MIMS-compliant metadata of genome sequences • Keep an eye on MINSEQE standard, store your sequencing metadata

More Related