Designing physics algorithms for gpu architecture
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Designing physics Algorithms for gpu architecture PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Designing physics Algorithms for gpu architecture. Takahiro HARADA AMD. Narrow phase on GPU. Narrow phase is parallel How to solve each pair? Design it for a specific architecture. GPU Architecture. Radeon HD 5870 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec Many cores

Download Presentation

Designing physics Algorithms for gpu architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Designing physics algorithms for gpu architecture

DesigningphysicsAlgorithms for gpu architecture

Takahiro HARADA

AMD


Narrow phase on gpu

Narrow phase on GPU

  • Narrow phase is parallel

    • How to solve each pair?

  • Design it for a specific architecture


Gpu architecture

GPU Architecture

  • Radeon HD 5870

    • 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec

    • Many cores

      • 20SIMDs x 64 wide SIMD

      •  CPU SSE 4 wide SIMD

  • Program of a work item is packed in VLIW, then executed

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

Radeon HD 5870

Phenom II X4

SIMD

Core

Core

Core

Core

Core

20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600


Memory

Memory

  • Register

  • Global memory

    • “Main memory”

    • Large

    • High latency

  • Local Data Store(LDS)

    • Low latency

    • High bandwidth

    • Like a user managed cache

    • Key to get high performance

Global Memory

> 1GB

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

156.3GB/s

SIMD

GPU

Local Data Share

32KB


Narrow phase on cpu

Narrow phaseon CPU

0 1 2 3 4 5 6 7

Void Kernel()

{

executeX();

switch()

{

case A:

{

executeA();

break;

}

case B:

{

executeB();

break;

}

case C:

{

executeC();

break;

}

}

finish();

}

  • Methods on CPUs(GJK)

    • Any convex shapes

    • Possible to implement on the GPU

    • Complicated for GPU

    • Divergence => Low use of ALUs

  • GPU prefer simpler algorithm with less logic

  • Why GPU is not good at complicated logic?

    • Wide SIMD architecture

25%

25%

50%


Narrow phase on gpu1

Narrow phase on GPU

0 1 2 3 4 5 6 7

Void Kernel()

{

prepare();

collide(p0);

collide(p1);

collide(p2);

collide(p3);

collide(p4);

collide(p5);

collide(p6);

collide(p7);

}

  • Particles

    • Search for neighboring particle

    • Collide to all

    • Accurate shape representation needs

      • Increase resolution

      • Acceleration structure in each shape

        • Increase complexity

      • Explode number of contacts

      • Etc..

  • Can we make it better but keep it simple?


A good approach for gpus from architecture

a Good approach for GPUs, from architecture

  • Have to know what GPUs likes

    • Less branch

      • Less divergence

    • Use LDS on SIMD

    • Latency hiding

      • Why latency?


Work group wg work item wi

Work Group0

Particle[0-63]

Work group(WG), work item(WI)

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

Work Group1

Particle[64-127]

SIMD

Radeon HD 5870

Work Group2

Particle[128-191]

SIMD lane(64lanes)

Work item(64items)


How gpu hides latency

How GPU hides Latency?

Work

Group0

Work

Group1

Work

Group2

Work

Group3

Void Kernel()

{

readGlobalMem();

compute();

}

  • Memory access latency

    • Not rely on cache

  • SIMD hides latency by switching WGs

  • The more WGs/SIMD is the better

    • 1WG/SIMD cannot hide latency

    • Overlap work to memory request

  • What determines # of WGs/SIMD?

    • Local resource usage


Why reduce resource usage

Why reduce Resource usage?

  • Regs are limited resource

  • # of WGs/SIMD

    • SIMD regs/(kernel regs use)

    • LDS/(kernel LDS use)

  • Less # of regs

    • More WGs

    • Hide latency

  • Register overflow -> global memory

SIMD Engine (8 regs)

1

KernelA

Regs:8

1

2

KernelB

Regs:4

1

2

3

4

KernelC

Regs:2


Preview of current approach

Preview of Current Approach

0 1 2 3 4 5 6 7

Global Mem

Void Kernel()

{

fetchToLDS();

BARRIER;

compute();

BARRIER;

workTogether();

BARRIER;

Writeback();

}

  • 1 WG processes 1 pair

    • Reduce resource usage

  • Less branch

    • Compute is branch free

    • No dependency

  • Use of LDS

    • No global mem access on compute()

    • Random access to LDS

  • Latency hiding

    • Pair data for a WG not per WI

  • WIs work together

  • Unified method for all the shapes

  • LDS

    Global Mem


    Solver

    Solver


    Fusion

    Fusion


    Choosing a processor

    Choosing a processor

    • CPU can do everything

      • Not good for highly parallel computations as GPU

    • GPU is very powerful processor

      • Only for parallel computation

    • Real problem has both

    • GPU is far from CPU


    Fusion1

    Fusion

    • GPU and CPU are close

    • Faster communication between GPU and CPU

    • Use both GPU and CPU

      • Parallel workload -> GPU

      • Serial workload -> CPU


    Collision between large and small particles

    Collision between large and small particles

    0 1 2 3 4 5 6 7

    • Granularity of computation

      • Large particle collide more

      • Inefficient use of the GPU


    Designing physics algorithms for gpu architecture

    Q & A


  • Login