designing physics algorithms for gpu architecture
Download
Skip this Video
Download Presentation
Designing physics Algorithms for gpu architecture

Loading in 2 Seconds...

play fullscreen
1 / 20

Designing physics Algorithms for gpu architecture - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Designing physics Algorithms for gpu architecture. Takahiro HARADA AMD. Narrow phase on GPU. Narrow phase is parallel How to solve each pair? Design it for a specific architecture. GPU Architecture. Radeon HD 5870 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec Many cores

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Designing physics Algorithms for gpu architecture' - domani


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
narrow phase on gpu
Narrow phase on GPU
  • Narrow phase is parallel
    • How to solve each pair?
  • Design it for a specific architecture
gpu architecture
GPU Architecture
  • Radeon HD 5870
    • 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec
    • Many cores
      • 20SIMDs x 64 wide SIMD
      •  CPU SSE 4 wide SIMD
  • Program of a work item is packed in VLIW, then executed

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

Radeon HD 5870

Phenom II X4

SIMD

Core

Core

Core

Core

Core

20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600

memory
Memory
  • Register
  • Global memory
    • “Main memory”
    • Large
    • High latency
  • Local Data Store(LDS)
    • Low latency
    • High bandwidth
    • Like a user managed cache
    • Key to get high performance

Global Memory

> 1GB

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

156.3GB/s

SIMD

GPU

Local Data Share

32KB

narrow phase on cpu
Narrow phaseon CPU

0 1 2 3 4 5 6 7

Void Kernel()

{

executeX();

switch()

{

case A:

{

executeA();

break;

}

case B:

{

executeB();

break;

}

case C:

{

executeC();

break;

}

}

finish();

}

  • Methods on CPUs(GJK)
    • Any convex shapes
    • Possible to implement on the GPU
    • Complicated for GPU
    • Divergence => Low use of ALUs
  • GPU prefer simpler algorithm with less logic
  • Why GPU is not good at complicated logic?
    • Wide SIMD architecture

25%

25%

50%

narrow phase on gpu1
Narrow phase on GPU

0 1 2 3 4 5 6 7

Void Kernel()

{

prepare();

collide(p0);

collide(p1);

collide(p2);

collide(p3);

collide(p4);

collide(p5);

collide(p6);

collide(p7);

}

  • Particles
    • Search for neighboring particle
    • Collide to all
    • Accurate shape representation needs
      • Increase resolution
      • Acceleration structure in each shape
        • Increase complexity
      • Explode number of contacts
      • Etc..
  • Can we make it better but keep it simple?
a good approach for gpus from architecture
a Good approach for GPUs, from architecture
  • Have to know what GPUs likes
    • Less branch
      • Less divergence
    • Use LDS on SIMD
    • Latency hiding
      • Why latency?
work group wg work item wi

Work Group0

Particle[0-63]

Work group(WG), work item(WI)

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

Work Group1

Particle[64-127]

SIMD

Radeon HD 5870

Work Group2

Particle[128-191]

SIMD lane(64lanes)

Work item(64items)

how gpu hides latency
How GPU hides Latency?

Work

Group0

Work

Group1

Work

Group2

Work

Group3

Void Kernel()

{

readGlobalMem();

compute();

}

  • Memory access latency
    • Not rely on cache
  • SIMD hides latency by switching WGs
  • The more WGs/SIMD is the better
    • 1WG/SIMD cannot hide latency
    • Overlap work to memory request
  • What determines # of WGs/SIMD?
    • Local resource usage
why reduce resource usage
Why reduce Resource usage?
  • Regs are limited resource
  • # of WGs/SIMD
    • SIMD regs/(kernel regs use)
    • LDS/(kernel LDS use)
  • Less # of regs
    • More WGs
    • Hide latency
  • Register overflow -> global memory

SIMD Engine (8 regs)

1

KernelA

Regs:8

1

2

KernelB

Regs:4

1

2

3

4

KernelC

Regs:2

preview of current approach
Preview of Current Approach

0 1 2 3 4 5 6 7

Global Mem

Void Kernel()

{

fetchToLDS();

BARRIER;

compute();

BARRIER;

workTogether();

BARRIER;

Writeback();

}

    • 1 WG processes 1 pair
      • Reduce resource usage
  • Less branch
    • Compute is branch free
    • No dependency
  • Use of LDS
    • No global mem access on compute()
    • Random access to LDS
  • Latency hiding
    • Pair data for a WG not per WI
  • WIs work together
  • Unified method for all the shapes

LDS

Global Mem

choosing a processor
Choosing a processor
  • CPU can do everything
    • Not good for highly parallel computations as GPU
  • GPU is very powerful processor
    • Only for parallel computation
  • Real problem has both
  • GPU is far from CPU
fusion1
Fusion
  • GPU and CPU are close
  • Faster communication between GPU and CPU
  • Use both GPU and CPU
    • Parallel workload -> GPU
    • Serial workload -> CPU
collision between large and small particles
Collision between large and small particles

0 1 2 3 4 5 6 7

  • Granularity of computation
    • Large particle collide more
    • Inefficient use of the GPU
ad