high throughput sequence alignment using graphics processing units n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
High-throughput sequence alignment using Graphics Processing Units PowerPoint Presentation
Download Presentation
High-throughput sequence alignment using Graphics Processing Units

Loading in 2 Seconds...

play fullscreen
1 / 37

High-throughput sequence alignment using Graphics Processing Units - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

High-throughput sequence alignment using Graphics Processing Units. Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble. Motivation. NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'High-throughput sequence alignment using Graphics Processing Units' - koko


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
high throughput sequence alignment using graphics processing units

High-throughput sequence alignment using Graphics Processing Units

Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney

UMD

Presented by Steve Rumble

motivation
Motivation
  • NGS technologies produce a ton of data
    • AB SOLiD: 22e6 25-mers
    • Others are even worse…
      • How does 200e6 50-mers sound?
  • Algorithms have been pushed hard, but typically assume same workstation CPU
  • Wozniak and others showed S-W could be well-parallelised on special H/W.
    • What of other algorithms/hardware?
motivation1
Motivation
  • GPUs have recently evolved general purpose programmability (GPGPU)
  • E.g.: nVidia 8800 GTX
    • 16 multiprocessors
      • 8 processors each
      • => 128 stream processors
    • 768MB onboard
    • 1.35GHz clock
    • Almost a year old now…
short gpu overview
Short GPU Overview
  • Highly parallel execution (hundreds of simultaneous operations)
  • Hundreds of gigaflops per chip!
  • Large on-board memories (up to 2GB)
  • Limitations:
    • No recursion (no stacks)
    • Each multiprocessor’s constituent processors execute same instruction
      • Thread Divergence due to conditionals hurts…
    • No direct host memory access
    • Small caches (locality is key)
    • High memory latency
    • No dynamic memory allocation (why one would ever do that, I don’t know)
short gpu overview1
Short GPU Overview
  • GPGPU environments
    • Previously had to reduce problems to graphics primitives… no more
    • Simplified C-like programming
      • Paper has very little detail, but they make it sound enticingly simple…
    • Each processor runs the same ‘kernel’
muh muh muh mummer
Muh-muh-muh… MUMmer!
  • Maximal Unique Match
  • Find longest match for each subsequence of a read (of reasonable length)
  • Employs Suffix Trees
mummergpu
MUMmerGPU
  • Plug-and-play replacement for MUMmer
  • MUMmer is not ‘arithmetic intensive’
    • Is the GPU a good fit?
  • Six-step process
    • 1) Build Suffix Tree of reference genome (Ukkonen’s alg. – O(n)) on host CPU
    • 2) Suffix Tree -> GPU Memory
    • 3) Queries -> GPU Memory
    • 4) Kick off the GPU…
    • 5) Results -> Host Memory
    • 6) Final processing on Host CPU
suffix trees
Suffix Trees
  • We want to find the longest subsequence of a string (query) quickly
    • Suffix Trees permit O(m) string search, m = string length
    • Space complexity is O(n)
      • But constants are apparently pretty big
suffix trees1
Suffix Trees
  • Definition:
    • Node edges have a node label
      • A string subsequence
      • Non-empty (but can be terminating)
    • A path label is the sequence formed by traversing from root to leaf
    • 1-1 correspondence of suffixes of S to path labels
    • Internal nodes have at least 2 children
    • n leaf nodes – one for each suffix of S
suffix trees2
Suffix Trees
  • O(n) space
    • n leaf nodes
    • => at most n – 1 internal nodes
    • => n + (n – 1) + 1 = 2n nodes (worst case)

n = 3

n – 1 = 2

3 + 2 + root = 6 nodes

suffix trees3
Suffix Trees
  • Example: TORONTO$
    • ‘$’ is terminating character

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

suffix trees4
Suffix Trees
  • Example: TORONTO$
    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

suffix trees5
Suffix Trees
  • Example: TORONTO$
    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

suffix trees6
Suffix Trees
  • Example: TORONTO$
    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

suffix trees7
Suffix Trees
  • Example: TORONTO$
    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

‘ONT’ at position 3 in S

suffix trees8
Suffix Trees
  • MUMmer wants to find all maximal unique matches for all suffixes:
    • E.g., for query ACCGTGCGTC, we want:
      • ACCGTGCGTC
      • CCGTGCGTC
      • CGTGCGTC
      • GTGCGTC
      • Up to some reasonable limit…
    • Don’t want to go back to root of tree each time…
suffix trees9
Suffix Trees
  • Suffix Links
    • All internal, non-root nodes have a suffix link to another node
      • If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a.
      • Got that?
suffix trees10
Suffix Trees
  • Example: TORONTO$
    • Suffix Links… Don’t backtrack (bad ex.)

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

suffix trees11
Suffix Trees
  • Example: BANANA$
    • Better example of Suffix Links

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

suffix trees12
Suffix Trees
  • Example: BANANA$
    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

suffix trees13
Suffix Trees
  • Example: BANANA$
    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

suffix trees14
Suffix Trees
  • Example: BANANA$
    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

suffix trees15
Suffix Trees
  • Example: BANANA$
    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

suffix trees16
Suffix Trees
  • Example: BANANA$
    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

suffix trees17
Suffix Trees
  • Example: BANANA$
    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1

memory limitations
Memory Limitations
  • Suffix trees take up a fair bit of memory
  • GPUs have 100’s of MBs, but this is still small
  • Divide the target sequence into ‘k’ segments with overlaps
cache optimisation
Cache Optimisation
  • Memory latency high, cache performance crucial
    • We’re walking a tree here, not crunching numbers down an array
  • Can store read-only data in 2D textures; nVidia caching scheme optimises access
  • Re-order and squish tree nodes into ‘texel blocks’ such that:
    • Nodes near root are level-ordered (BFS)
    • Nodes further down are ordered with descendants
cache optimisation1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

Cache Optimisation
  • Texture cache organized in 2x2 blocks.
  • Try to place all children of a node are in the same cache block

Shamelessly cribbed from:

http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt

cache optimisation2
Cache Optimisation
  • Reference Sequence stored in 4x216 blocks of a 2D array
    • Sequence: A B C D E F G H …

……….

A E

B F

C G

D H

……….

α Φ

β Χ

Γ Ψ

Δ Ω

Why? It worked well.

cache optimisation3
Cache Optimisation
  • Memory layouts heuristically determined
    • nVidia cache details not public
  • Cache optimisation improves execution speed ‘by several fold’.
conclusions
Conclusions
  • GPGPU isn’t just good for ‘arithmetic intensive’ applications
  • 5-11x speed-up for NGS data
conclusions1
Conclusions
  • Fine Print:
    • 5-11x is for the Suffix Tree kernel on the GPU
    • Reality is different!
    • 3.5x speed-up for real data in terms of total application runtime.
    • Pretty constant across read lengths (35-700+ bp)
  • Careful management of memory layout is crucial
    • Authors claim several-fold performance increase (could be difference between some improvement and none)
conclusions2
Conclusions
  • Runtime dominated by serial parts of MUMmer
food for thought
Food for Thought
  • 8800 GTX costs ~$400, uses 100-150 watts
  • Quad Core 2 chip runs ~$250, uses 100-130 watts
  • Each core approx. 2x faster than their test CPU
  • MUMmerGPU maximally 3.5x faster than test CPU
  • What have we won here?
food for thought1
Food for Thought
  • Confusing reports
    • “Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement
      • Earlier course paper (early/mid-2007)
    • Why from 35x down to 5-11x with MUMmerGPU?
my impressions
My Impressions…
  • (…whatever they’re worth)
  • GPU is not a clear win (in this case)
    • Suffix trees seem unsuited:
      • Cache locality trouble
      • O(n) footprint, but multiplicative constants are still substantial
    • Host CPUs seem to be as good or better (in $ and watts)
my impressions1
My Impressions…
  • GPGPU’s aren’t a great fit here
    • At least for this algorithm…
    • MUMmerGPU isn’t the order-of-magnitude win it claims to be
  • But this is a first-generation, general-purpose chip
    • geared toward number-crunching, not pointer-traversing
    • I don’t think we’ve seen the last (nor the best) of GPUs…