High throughput sequence alignment using graphics processing units
Download
1 / 37

High-throughput sequence alignment using Graphics Processing Units - PowerPoint PPT Presentation


  • 610 Views
  • Uploaded on

High-throughput sequence alignment using Graphics Processing Units. Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble. Motivation. NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'High-throughput sequence alignment using Graphics Processing Units' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
High throughput sequence alignment using graphics processing units

High-throughput sequence alignment using Graphics Processing Units

Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney

UMD

Presented by Steve Rumble


Motivation
Motivation Units

  • NGS technologies produce a ton of data

    • AB SOLiD: 22e6 25-mers

    • Others are even worse…

      • How does 200e6 50-mers sound?

  • Algorithms have been pushed hard, but typically assume same workstation CPU

  • Wozniak and others showed S-W could be well-parallelised on special H/W.

    • What of other algorithms/hardware?


Motivation1
Motivation Units

  • GPUs have recently evolved general purpose programmability (GPGPU)

  • E.g.: nVidia 8800 GTX

    • 16 multiprocessors

      • 8 processors each

      • => 128 stream processors

    • 768MB onboard

    • 1.35GHz clock

    • Almost a year old now…


Short gpu overview
Short GPU Overview Units

  • Highly parallel execution (hundreds of simultaneous operations)

  • Hundreds of gigaflops per chip!

  • Large on-board memories (up to 2GB)

  • Limitations:

    • No recursion (no stacks)

    • Each multiprocessor’s constituent processors execute same instruction

      • Thread Divergence due to conditionals hurts…

    • No direct host memory access

    • Small caches (locality is key)

    • High memory latency

    • No dynamic memory allocation (why one would ever do that, I don’t know)


Short gpu overview1
Short GPU Overview Units

  • GPGPU environments

    • Previously had to reduce problems to graphics primitives… no more

    • Simplified C-like programming

      • Paper has very little detail, but they make it sound enticingly simple…

    • Each processor runs the same ‘kernel’


Muh muh muh mummer
Muh-muh-muh… MUMmer! Units

  • Maximal Unique Match

  • Find longest match for each subsequence of a read (of reasonable length)

  • Employs Suffix Trees


Mummergpu
MUMmerGPU Units

  • Plug-and-play replacement for MUMmer

  • MUMmer is not ‘arithmetic intensive’

    • Is the GPU a good fit?

  • Six-step process

    • 1) Build Suffix Tree of reference genome (Ukkonen’s alg. – O(n)) on host CPU

    • 2) Suffix Tree -> GPU Memory

    • 3) Queries -> GPU Memory

    • 4) Kick off the GPU…

    • 5) Results -> Host Memory

    • 6) Final processing on Host CPU


Suffix trees
Suffix Trees Units

  • We want to find the longest subsequence of a string (query) quickly

    • Suffix Trees permit O(m) string search, m = string length

    • Space complexity is O(n)

      • But constants are apparently pretty big


Suffix trees1
Suffix Trees Units

  • Definition:

    • Node edges have a node label

      • A string subsequence

      • Non-empty (but can be terminating)

    • A path label is the sequence formed by traversing from root to leaf

    • 1-1 correspondence of suffixes of S to path labels

    • Internal nodes have at least 2 children

    • n leaf nodes – one for each suffix of S


Suffix trees2
Suffix Trees Units

  • O(n) space

    • n leaf nodes

    • => at most n – 1 internal nodes

    • => n + (n – 1) + 1 = 2n nodes (worst case)

n = 3

n – 1 = 2

3 + 2 + root = 6 nodes


Suffix trees3
Suffix Trees Units

  • Example: TORONTO$

    • ‘$’ is terminating character

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1


Suffix trees4
Suffix Trees Units

  • Example: TORONTO$

    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1


Suffix trees5
Suffix Trees Units

  • Example: TORONTO$

    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1


Suffix trees6
Suffix Trees Units

  • Example: TORONTO$

    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1


Suffix trees7
Suffix Trees Units

  • Example: TORONTO$

    • Searching for ‘ONT’

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1

‘ONT’ at position 3 in S


Suffix trees8
Suffix Trees Units

  • MUMmer wants to find all maximal unique matches for all suffixes:

    • E.g., for query ACCGTGCGTC, we want:

      • ACCGTGCGTC

      • CCGTGCGTC

      • CGTGCGTC

      • GTGCGTC

      • Up to some reasonable limit…

    • Don’t want to go back to root of tree each time…


Suffix trees9
Suffix Trees Units

  • Suffix Links

    • All internal, non-root nodes have a suffix link to another node

      • If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a.

      • Got that?


Suffix trees10
Suffix Trees Units

  • Example: TORONTO$

    • Suffix Links… Don’t backtrack (bad ex.)

T

NTO$

O

RONTO$

2

4

RONTO$

ORONTO$

$

O$

NTO$

0

5

6

3

1


Suffix trees11
Suffix Trees Units

  • Example: BANANA$

    • Better example of Suffix Links

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Suffix trees12
Suffix Trees Units

  • Example: BANANA$

    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Suffix trees13
Suffix Trees Units

  • Example: BANANA$

    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Suffix trees14
Suffix Trees Units

  • Example: BANANA$

    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Suffix trees15
Suffix Trees Units

  • Example: BANANA$

    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Suffix trees16
Suffix Trees Units

  • Example: BANANA$

    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Suffix trees17
Suffix Trees Units

  • Example: BANANA$

    • Searching for suffixes of ‘ANANA’

A

NA

BANANA$

$

NA

NA$

$

0

4

2

5

$

NA$

3

1


Memory limitations
Memory Limitations Units

  • Suffix trees take up a fair bit of memory

  • GPUs have 100’s of MBs, but this is still small

  • Divide the target sequence into ‘k’ segments with overlaps


Cache optimisation
Cache Optimisation Units

  • Memory latency high, cache performance crucial

    • We’re walking a tree here, not crunching numbers down an array

  • Can store read-only data in 2D textures; nVidia caching scheme optimises access

  • Re-order and squish tree nodes into ‘texel blocks’ such that:

    • Nodes near root are level-ordered (BFS)

    • Nodes further down are ordered with descendants


Cache optimisation1

1 Units

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

Cache Optimisation

  • Texture cache organized in 2x2 blocks.

  • Try to place all children of a node are in the same cache block

Shamelessly cribbed from:

http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt


Cache optimisation2
Cache Optimisation Units

  • Reference Sequence stored in 4x216 blocks of a 2D array

    • Sequence: A B C D E F G H …

……….

A E

B F

C G

D H

……….

α Φ

β Χ

Γ Ψ

Δ Ω

Why? It worked well.


Cache optimisation3
Cache Optimisation Units

  • Memory layouts heuristically determined

    • nVidia cache details not public

  • Cache optimisation improves execution speed ‘by several fold’.


Conclusions
Conclusions Units

  • GPGPU isn’t just good for ‘arithmetic intensive’ applications

  • 5-11x speed-up for NGS data


Conclusions1
Conclusions Units

  • Fine Print:

    • 5-11x is for the Suffix Tree kernel on the GPU

    • Reality is different!

    • 3.5x speed-up for real data in terms of total application runtime.

    • Pretty constant across read lengths (35-700+ bp)

  • Careful management of memory layout is crucial

    • Authors claim several-fold performance increase (could be difference between some improvement and none)


Conclusions2
Conclusions Units

  • Runtime dominated by serial parts of MUMmer


Food for thought
Food for Thought Units

  • 8800 GTX costs ~$400, uses 100-150 watts

  • Quad Core 2 chip runs ~$250, uses 100-130 watts

  • Each core approx. 2x faster than their test CPU

  • MUMmerGPU maximally 3.5x faster than test CPU

  • What have we won here?


Food for thought1
Food for Thought Units

  • Confusing reports

    • “Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement

      • Earlier course paper (early/mid-2007)

    • Why from 35x down to 5-11x with MUMmerGPU?


My impressions
My Impressions… Units

  • (…whatever they’re worth)

  • GPU is not a clear win (in this case)

    • Suffix trees seem unsuited:

      • Cache locality trouble

      • O(n) footprint, but multiplicative constants are still substantial

    • Host CPUs seem to be as good or better (in $ and watts)


My impressions1
My Impressions… Units

  • GPGPU’s aren’t a great fit here

    • At least for this algorithm…

    • MUMmerGPU isn’t the order-of-magnitude win it claims to be

  • But this is a first-generation, general-purpose chip

    • geared toward number-crunching, not pointer-traversing

    • I don’t think we’ve seen the last (nor the best) of GPUs…


ad