240 likes | 255 Views
Dive into the abyss of reference-based CRAM compression developed by Vadim Zalunin, storing 10 petabytes of EMBL-EBI data in ~1 petabyte, saving over 2 million DVDs' worth! Learn the need for compression, compare BMP, PNG, and JPG sizes, and explore lossless vs. lossy techniques in this fascinating realm of data management. Discover the bug's DNA hidden in sequencing files and the significance of reads in genome processing, with practical insights on compression for NGS data. Visit the provided links for more information on CRAM toolkit and publications.
E N D
CRAM: reference-based compression format developed by Vadim Zalunin
Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file
The need for compression Red alert
Compression, what is it? BMP, 190 kb PNG, 100 kb JPG, 21 kb JPG, 4 kb LOSSLESS LOSSY
Compression, when we know what to expect. BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb LOSSLESS LOSSY But the actual message is only 40 characters (bytes) long!
Compression at it’s best "Five little ducks went swimming one day" compress uncompress IMAGE, 145 kb TEXT, 40 b IMAGE, 145 kb ~3500 times more efficient
What are we talking about bug The bug’s DNA is hidden somewhere sample sequencing machines bunch of huge files
Looking closer at the data It boils down to a long list of reads: read 1 read 2 read 3 ….. read bizzilion bunch of huge files Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.
What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.
What is a Read? read name @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.
What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. Bases: ACGTN
What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)
What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.
Reference based encoding Read start position Read end position
Reference based encoding Mismatching bases
Lossy quality scores horizontal Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. vertical
Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM
Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline
Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline Original SNPs Restored SNPs
CRAM NGS data compression CRAM lossless CRAM lossy CRAM very lossy Untreated Bits/base (bad) (good) Do nothing Lossless Lossy
20-fold Lossless 200-fold 2-fold Progressive application of compression Sample accessibility Hard Easy Low High Sample value
References More information: • http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: • http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: • Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 • Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1