de novo genome assembly using vsmp l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
De Novo Genome Assembly Using vSMP PowerPoint Presentation
Download Presentation
De Novo Genome Assembly Using vSMP

Loading in 2 Seconds...

play fullscreen
1 / 18

De Novo Genome Assembly Using vSMP - PowerPoint PPT Presentation


  • 191 Views
  • Uploaded on

De Novo Genome Assembly Using vSMP. Wayne Pfeiffer (SDSC/UCSD) August 10, 2011. Goal: determine the DNA sequence of the chromosomes that make up a genome. Background Chromosomes consist of two long DNA molecules in a double helix

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'De Novo Genome Assembly Using vSMP' - wren


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
de novo genome assembly using vsmp

De Novo Genome AssemblyUsing vSMP

Wayne Pfeiffer (SDSC/UCSD)

August 10, 2011

goal determine the dna sequence of the chromosomes that make up a genome
Goal: determine the DNA sequence of the chromosomes that make up a genome
  • Background
    • Chromosomes consist of two long DNA molecules in a double helix
    • Humans have 23 pairs of nuclear chromosomes, each 50M to 250M base pairs long
    • This gives 3.1 billion bases in the human haploid genome
    • Humans have 20,000+ genes (which comprise the exome) that are typically a few kbases long, though dystrophin is 2.4 Mbases long
    • High-throughput sequencers output only short reads (typically ~100 bases long) that are paired-end
  • Challenge
    • Assemble a genome or exome sequence from short DNA reads without using a reference sequence (i.e., de novo)
  • Algorithm
    • Generate and analyze a large graph that is stored in shared memory

read

read

conceptual steps in de novo assembly
Conceptual steps in de novo assembly

1. Find reads that overlap by a specified number of bases (the k-mer size)

2. Merge overlapping, “good” reads into longer contigs

3. Link contigs to form scaffolds using paired-end information

Diagrams from Serafim Batzoglou, Stanford

slide4
de Bruijn graph hask-mers as nodes connected by reads;assembly involves finding Eulerian path through graph

Diagram from Michael Schatz, Cold Spring Harbor

velvet soapdenovo are two leading assemblers that use de bruijn graph algorithm
Velvet & SOAPdenovo are two leading assemblersthat use de Bruijn graph algorithm
  • Velvet is from EMBL-EBI
    • Code has two steps: hash & graph
    • Both are parallelized with OpenMP
    • Either step can use more time or memory depending upon problem & computer
    • http://www.ebi.ac.uk/~zerbino/velvet
  • SOAPdenovo is from BGI
    • Code has four steps: pregraph, contig, map, & scaffold
    • pregraph & map are parallelized with Pthreads; others are serial
    • pregraph uses the most time & memory
    • http://soap.genomics.org.cn/soapdenovo.html
  • k-mer size is adjustable parameter
    • Typically it is adjusted to maximize N50 length of scaffolds or contigs
    • N50 length is central measure of distribution weighted by lengths
three central measures of the distribution of sequence lengths can be compared
Three central measures of the distribution ofsequence lengths can be compared
  • Two common measures
    • Mean length: the usual average of the lengths
    • Median length: the length for which half the sequences are shorter & half are longer
  • Less common measure
    • N50 length: the length that splits the total bases in half, after the lengths are ordered
  • Example values for distribution of scaffold lengths for Daphnia genome (from SOAPdenovo with k-mer size=39)
    • Mean length: 627
    • Median length: 200
    • N50 length: 1,718
slide7

We do not get one contig or even one scaffold for each chromosomeof Daphnia genome or for each gene of human exomewhen assembling short reads

Read k-mer Longest Scaf-

Problem Reads length size Contigs contig folds

Daphnia genome: 208M 44, 75 39 684k 24k 351k

Huqhos2

Human exome: 186M 101 73 279k 13k 258k

HT002.s4.dedup

Complete assemblies are difficult to obtain

• Roughly half (!) of the human genome consists of repeatswith peaks around 300 bases & 6 kbases, which are longer than the reads

• However, it helps to have reads that are paired-end and have different(but roughly known) insert lengths (i.e., fragment lengths)

Tabulated results are for

• Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI

• Illumina paired-end reads with k-mer size picked to maximize scaffold N50 length

• SOAPdenovo 1.05

velvet soapdenovo often give similar assemblies though velvet seems better for daphnia genome
Velvet & SOAPdenovo often give similar assemblies,though Velvet seems better for Daphnia genome

SOAP- SOAP-

Velvet denovo Velvet denovo

Read k-mer longest longest scaffold scaffold

Problem length size scaffold scaffold N50 len N50 len

Daphnia genome: 44, 75 31 94k243k 4,017 1,634

Huqhos2 35 99k 283k 4,383 * 1,696

39 95k 225k 4,220 1,718 *

Human exome: 101 63 err 14k err 656

HT002.s4.dedup 67 17k 13k 787 * 692

73 17k 13k 768 702 *

Tabulated results are for

• Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI

• Illumina paired-end reads with * for optimal k-mer size

• Velvet 1.1.03 & SOAPdenovo 1.05

slide9
Velvet runs slower than SOAPdenovo;time roughly increases with GB of reads& decreases with k-mer size

SOAP-

Reads Velvet denovo

Read on disk k-mer time time

Problem Reads length (GB) size (h) (h)

Daphnia genome: 208M 44, 75 28 31 6.42.2

Huqhos2 35 5.0 * 1.6

39 4.4 1.7 *

Human exome: 186M 101 45 63 err 1.7

HT002.s4.dedup 67 2.6 * 1.8

73 2.2 1.5 *

Tabulated results are for

• Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI

• Illumina paired-end reads with * for optimal k-mer size

• One 32-core node of Triton PDAF with 2.5-GHz AMD Shanghais

• Velvet 1.1.03 with 8 threads & SOAPdenovo 1.05 with 16 threads

slide10
Velvet requires more memory than SOAPdenovo;memory roughly increases with GB of reads,but has more complicated dependence on k-mer size

SOAP-

Reads Velvet denovo

Read on disk k-mer memory memory

Problem Reads length (GB) size (GB) (GB)

Daphnia genome: 208M 44, 75 28 31 22118

Huqhos2 35 139 * 33

39 94 33 *

Human exome: 186M 101 45 63 err 59

HT002.s4.dedup 67 124 * 97

73 113 97 *

Tabulated results are for

• Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI

• Illumina paired-end reads with * for optimal k-mer size

• One 32-core node of Triton PDAF with 256 of memory

• Velvet 1.1.03 with 8 threads & SOAPdenovo 1.05 with 16 threads; blue cases plotted in later slides

benchmark tests were run on three computers that allow large shared memory runs
Benchmark tests were run on three computersthat allow large shared-memory runs
  • Triton PDAF from Sun at SDSC
    • 32-core nodes with 2.5-GHz AMD Shanghais
    • 256 & 512 GB of memory per node
  • Dash from Appro at SDSC
    • 8-core nodes with 2.4-GHz Intel Nehalems
    • 48 GB of memory per node
    • vSMP supernode combining 16 regular nodes for larger memory
  • Ember from SGI at NCSA
    • 384-core NUMA nodes with 2.66-GHz Intel Nehalem-EXes
    • 2 TB of memory per NUMA node
    • Highly variable run times
here are some details
Here are some details
  • For vSMP runs, numactl was called before the usual run command to limit threads to specific processors, e.g.,
    • numactl --cpunodebind=0,1 <usual run command>
  • No source code changes were made
  • Memory used was obtained from vmem record in PBS job summary sent by e-mail, e.g.,
    • vmem record: resources_used.vmem=128389148kb
    • Memory used: 128389148 kB / 1.024^2 kB/GB = 122 GB
slide13
Using 8 threads (or cores),graph step of Velvet is fast on Dash with vSMP,but hash step is much slower when vSMP is needed above 48 GB
what is going on
What is going on?
  • Memory access for graph step is fairly regular
    • This is efficient with vSMP
    • Performance has improved significantly over the past year through tuning of vSMP by ScaleMP
  • Memory access for hash step is nearly random
    • This is inefficient with vSMP
    • Changes to Velvet source code are likely needed to obtain better performance
  • Assessments are based on memory profiling by ScaleMP
slide15

For Velvet assemblies of Daphnia genome,Dash with vSMP is fastest from 1 to 8 cores,but its speed drops a lot on 16 cores (i.e., on 2 cores)

slide16
For Velvet assemblies of human exome,Dash with vSMP is much slower than Triton PDAFat all core counts because hash step is slow
slide17

For SOAPdenovo assemblies using 8 threads (or cores),Dash with vSMP is much slower than other computers for human exome; this is similar to behavior observed using Velvet for the same data

here are some take away messages
Here are some take-away messages
  • vSMP performs very well for some Velvet assemblies
    • These are cases where access to large memory is regular
    • Run time is actually faster on Dash with vSMP than on other computers with large shared-memory nodes!
  • vSMP performs less well for other Velvet assemblies & all SOAPdenovo assemblies studied
    • These are cases where access to large memory is random
    • Although run time using vSMP is relatively slow, it provides capability to do analyses that are possible on only a few computers