Challenges in binary translation for desktop supercomputing
Download
1 / 38

Challenges in Binary Translation for Desktop Supercomputing - PowerPoint PPT Presentation


  • 382 Views
  • Uploaded on

Challenges in Binary Translation for Desktop Supercomputing. David Kaeli Rodrigo Dominguez Department of Electrical and Computer Engineering Northeastern University Boston, MA. Current trends in Many-core Computing. The CPU industry has elected to jump off the cycle-time scaling bandwagon

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Challenges in Binary Translation for Desktop Supercomputing' - mike_john


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Challenges in binary translation for desktop supercomputing l.jpg

Challenges in Binary Translation for Desktop Supercomputing

David Kaeli

Rodrigo Dominguez

Department of Electrical and Computer Engineering

Northeastern University

Boston, MA


Current trends in many core computing l.jpg
Current trends in Many-core Computing

  • The CPU industry has elected to jump off the cycle-time scaling bandwagon

    • Power/thermal constraints have become a limiting factor

    • We now see CPU vendors placing multiple (10’s of) cores on a single chip

    • Clock speeds have not changed

    • The memory wall persists and multiple cores that assume a shared-memory model place further pressure on this problem

  • Software vendors are looking for new parallelization technology

    • Multi-core aware operating systems

    • Semi-automatic parallelizing compilers


Current trends in many core computing3 l.jpg
Current trends in Many-core Computing

  • There has been a renewed interest in parallel computing paradigms and languages

  • Existing many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs)

  • Heterogeneous systems are becoming a common theme

  • The trend will only accelerate if proper programming frameworks are available to effectively exploit many-core resources


Graphics processors l.jpg
Graphics Processors

  • Graphics Processing Units

    • More than 64% of Americans played a video game in 2009

    • High-end - primarily used for 3-D rendering for videogame graphics and movie animation

    • Mid/low-end – primarily used for computer displays

    • Manufacturers include NVIDIA, AMD/ATI, IBM-Cell

    • Very competitive commodities market


Gpu performance l.jpg
GPU Performance

  • GPUs provide a path for performance growth

  • Cost and power usage numbers are also impressive

Near exponential

growth

in performance

for GPUS!!

Source:NVIDIA 2009


Comparison of cpu and gpu hardware architectures l.jpg
Comparison of CPU and GPU Hardware Architectures

CPU: Cache heavy, focused on individual thread performance

GPU: ALU heavy, massively parallel, throughput-oriented


Cpu gpu relationship l.jpg
CPU/GPU Relationship

CPU

(host)

GPU w/

local DRAM

(device)


A wide range of gpu apps l.jpg
A wide range of GPU apps

  • Film

  • Financial

  • Languages

  • GIS

  • Holographics cinema

  • Machine learning

  • Mathematics research

  • Military

  • Mine planning

  • Molecular dynamics

  • MRI reconstruction

  • Multispectral imaging

  • N-body simulation

  • Network processing

  • Neural network

  • Oceanographic research

  • Optical inspection

  • Particle physics

  • 3D image analysis

  • Adaptive radiation therapy

  • Acoustics

  • Astronomy

  • Audio

  • Automobile vision

  • Bioinfomatics

  • Biological simulation

  • Broadcast

  • Cellular automata

  • Fluid dynamics

  • Computer vision

  • Cryptography

  • CT reconstruction

  • Data mining

  • Digital cinema / projections

  • Electromagnetic simulation

  • Equity training

  • Protein folding

  • Quantum chemistry

  • Ray tracing

  • Radar

  • Reservoir simulation

  • Robotic vision / AI

  • Robotic surgery

  • Satellite data analysis

  • Seismic imaging

  • Surgery simulation

  • Surveillance

  • Ultrasound

  • Video conferencing

  • Telescope

  • Video

  • Visualization

  • Wireless

  • X-Ray


Gpu as a general purpose computing platform l.jpg
GPU as a General Purpose Computing Platform

  • Speedups are impressive and ever increasing!

Real Time Elimination

of Undersampling Artifacts

Lattice-Boltzmann Method

for Numerical Fluid Mechanics

Genetic Algorithm

Total Variation Modeling

2300 X

1840 X

1000 X

2600 X

Monte Carlo Simulation

Of Photon Migration

Stochastic Differential

Equations

K-Nearest Neighbor

Search

Fast Total Variation for

Computer Vision

1000 X

675 X

470 X

1000 X

Source: CUDA Zone at www.nvidia.com/cuda/


Gpgpu is becoming mainstream research l.jpg
GPGPU is becoming mainstream research

  • Research activities are expanding significantly

Search result for keyword “GPGPU” in IEEE and ACM


Slide11 l.jpg

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

SM

SM

SM

Streaming Processor Array

Grid of thread blocks

Multiple thread blocks, many warps of threads

Texture Processor

Cluster

Streaming Multiprocessor

NVIDIA GT200 architecture

SP

SP

  • 240 shader cores

  • 1.4B transistors

  • Up to 2GB onboard memory

  • ~150GB/sec BW

  • 1.06 SP GFLOPS

  • CUDA and OpenCL support

  • Programmable memory spaces

  • Tesla S1070 provides 4 GPUs in a 1U unit

SP

SP

SFU

SFU

SP

SP

SP

SP

Texture Unit

Individual threads


Amd ati radeon hd 5870 l.jpg
AMD/ATI Radeon HD 5870

  • Codename “Evergreen”

  • 1600 SIMD cores

  • L1/L2 memory architecture

  • 153GB/sec memory bandwidth

  • 2.72 TFLOPS SP

  • OpenCL and DirectX11

  • Hidden memory microarchitecure

  • Provides for vectorized operation




Talk outline l.jpg
Talk Outline

  • Introduction on GPUs

  • Overview of the tool chains for both CUDA and OpenCL

  • Motivation for pursuing this work

    • Comparing intermediate representations

    • Leveraging/analyzing benefits of Open64 optimization on AMD GPUs

    • Comparing challenges with fundamentally different ISAs (SS SIMT versus VLIW SIMT)

  • Discuss PTX and IL

  • Describe new common IR

  • Two examples of PTX->IR->IL binary translation

  • Discuss status of project and future work


Gpu programming model l.jpg
GPU Programming Model

  • Single Instruction Multiple Threads (SIMT)

  • Parallelism is implicit

  • Programs (also called kernels or shaders) are generally small and contain nested loops

  • Synchronization is handled explicitly


Toolchains l.jpg
Toolchains

  • Toolchain = compiler + runtime library

C for CUDA

OpenCL

Brook+

OpenCL

CUDA Runtime

CAL Runtime

Graphics driver

Graphics driver

GPU

GPU

NVIDIA

AMD


Cuda compiler l.jpg
CUDA Compiler

c for cuda

compile-time

cudafe

gpu

host

Open64

ptx*

host compiler

exe

execution-time

binary

runtime

driver

* ptx is included as data in the host application


Opencl dynamic compiler l.jpg
OpenCL (Dynamic) Compiler

OpenCL

compile-time

host compiler

exe

execution-time

OpenCL Library

LLVM

binary

runtime

driver


Objectives of our work l.jpg
Objectives of our work

  • Compare two different IRs from similar massively-threaded architectures

  • Influence future IR design (an active topic in GPGPU research)

  • Leverage/analyze benefits of Open64 optimizations

  • Compare challenges with fundamentally different ISAs: Superscalar/SIMT versus VLIW/SIMT


Cuda runtime l.jpg
CUDA Runtime

  • Device Management

    • cudaSetDevice, cudaGetDevice

  • Memory Management

    • Allocation: cudaMalloc, cudaFree

    • Transfer: cudaMemcpy, cudaMemset

  • Execution Control

    • Kernel launch: cudaLaunch

    • Config: cudaConfigureCall

  • Thread Management

    • cudaSynchronize


Cuda runtime vector add example l.jpg
CUDA Runtime (Vector Add example)

__global__ void vecAdd(int A[ ], int B[ ], int C[ ]) {

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main() {

int hA[ ] = {…};

int hB[ ] = {…};

cudaMemcpy(dA, hA, sizeof(hA), HostToDevice);

cudaMemcpy(dB, hB, sizeof(hB), HostToDevice);

vecAdd<<<1, N>>>(dA, dB, dC);

cudaMemcpy(dA, hA, sizeof(hA), DeviceToHost);

}

cudaConfigureCall

cudaSetupArgument

cudaLaunch


Nvidia ptx l.jpg
NVIDIA PTX

  • Low-level IR (close to ISA)

  • Pseudo-assembly style syntax

  • Load-Store instruction set

  • Strongly typed language

    • cvt.s32.u16 %r1, %tid.x;

  • Unlimited virtual registers

  • Predicate registers


Amd il l.jpg
AMD IL

  • High-level IR

  • Structured control flow (if-endif, while-end, switch-end)

  • No predication

  • 32-bit registers (4 components) - vectorization


Common ptx and il instructions l.jpg
Common PTX and IL instructions

vectorAdd (PTX)

mov.u16 %rh1, %ctaid.x;

mov.u16 %rh2, %ntid.x;

mul.wide.u16 %r1, %rh1, %rh2;

cvt.u32.u16 %r2, %tid.x;

add.u32 %r3, %r2, %r1;

ld.param.s32 %r4, [N];

setp.le.s32 %p1, %r4, %r3;

@%p1 bra $LabelA;

cvt.u64.s32 %rd1, %r3;

mul.lo.u64 %rd2, %rd1, 4;

ld.param.u64 %rd3, [A];

add.u64 %rd4, %rd3, %rd2;

ld.global.f32 %f1, [%rd4+0];

ld.param.u64 %rd5, [B];

add.u64 %rd6, %rd5, %rd2;

ld.global.f32 %f2, [%rd6+0];

add.f32 %f3, %f1, %f2;

ld.param.u64 %rd7, [C];

add.u64 %rd8, %rd7, %rd2;

st.global.f32 [%rd8+0], %f3;

$LabelA:

exit;

  • Data movement (mov)

  • Memory access (ld, st)

  • Arithmetic (mul, add)

  • Conversion (cvt)

  • Comparison and selection (setp)

  • Control flow (bra): uses predication for conditional branch


Common ptx and il instructions26 l.jpg
Common PTX and IL instructions

vectorAdd (IL)

mov r0, vThreadGrpId.x

mov r1, cb0[0].x

imul r2, r0, r1

mov r3, vTidInGrp.x

iadd r4, r3, r2

mov r5, cb1[3]

ige r6, r4, r5

if_logicalz r6

mov r7, r4

imul r8, r7, l0

mov r9, cb1[0]

iadd r10, r9, r8

uav_raw_load_id(0) r11, r10

mov r12, cb1[1]

iadd r13, r12, r8

uav_raw_load_id(0) r14, r13

add r15, r11, r14

mov r16, cb1[2]

iadd r17, r16, r8

uav_raw_store_id(0) mem.xyzw, r17, r15

endif

end

  • Data movement (mov)

  • Memory access (uav_raw)

  • Arithmetic (imul, iadd)

  • No conversion instructions

  • Comparison and Selection (ige)

  • Control Flow (if_logicalz): structured statements


Ocelot framework l.jpg
Ocelot Framework*

  • Implemented as a CUDA library

  • Intercepts library calls

  • PTX Emulation on the CPU

  • Parses PTX into an internal IR

  • Analysis: CFG, SSA, Data flow, optimizations

  • Our work:

    • IR for IL programs

    • PTX IR -> IL IR translation

    • AMD/CAL Backend

*Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. Modeling gpu-cpu workloads and systems. In GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 31–42, New York, NY, USA, 2010. ACM.


Translation framework l.jpg
Translation Framework

compile-time

ATI driver

exe

Ocelot

ptx parser

analysis

translation to IL

CAL back-end


Il control tree l.jpg
IL Control Tree

  • Based on Structural Analysis*

  • Build DFS spanning tree of the control flow graph and traverse in postorder

  • Form regions and collapse the nodes in the CFG

  • Construct the Control Tree in the process

  • Repeat until only 1 node is left in the CFG

*S. Muchnick. Advanced Compiler Design and Implementation, chapter 7.7. Morgan Kaufmann, 1997.


Il control tree30 l.jpg
IL Control Tree

Entry

abstract node representing regions

WHILE

cond

body

BB

IF

false

cond

true

BB

BB

BB


Example 1 if then l.jpg
Example 1 (if-then)

Entry

PTX

mov.u16

setp.le.s32 p1, r4, r3

@p1 bra LabelA

cvt.u64.s32

LabelA:

exit

Block

BB:

mov..

IF

BB:

exit

cond

true

BB:

setp..

BB:

cvt…


Example 1 if then32 l.jpg
Example 1 (if-then)

Entry

IL

mov

ige r6, r4, r5

if_logicalz r6

mov

endif

end

Block

BB:

mov..

IF

BB:

exit

cond

true

BB:

setp..

BB:

cvt…


Example 2 for loop l.jpg
Example 2 (for-loop)

Entry

+

Block

PTX

mov.u16

setp.le.s32 p1, r5, r3

@p1 bra LabelA

cvt.u64.s32

LabelB:

setp.lt.s32 p2, r4, r5

@p2 bra LabelB

LabelA:

exit

BB:

mov..

IF

BB:

exit

cond

true

BB:

setp..

Block

BB:

cvt…

WHILE

cond

body

setp


Example 2 for loop34 l.jpg
Example 2 (for-loop)

Entry

+

Block

IL

mov

ige r7, r4, r6

if_logicalz r7

mov

whileloop

if_logicalz r17

break

endif

endloop

endif

end

BB:

mov..

IF

BB:

exit

cond

true

BB:

setp..

Block

BB:

cvt…

WHILE

body

cond

setp


Other bt challenges l.jpg
Other BT Challenges

  • Pointer arithmetic in CUDA needs to be emulated in CAL

  • Translate Application Binary Interface (ABI), e.g. different calling conventions

  • Architectural bitness: Tesla and Cypress are 32-bit architectures but Fermi is 64-bits


Project status l.jpg
Project Status

  • Main CUDA library API’s are implemented (cudaMalloc, cudaMemcpy, cudaLaunch, etc.)

  • 3 CUDA applications from the SDK running

  • Code quality comparable to LLVM code generation


Next steps l.jpg
Next Steps

  • Enhance translation of the Control Tree to support other IL constructs (e.g., switch-case)

  • Implement other GPGPU abstractions (e.g., shared memory, textures, etc.)

  • Handle PTX predicated instructions (since IL does not support predication directly)


Summary and future work l.jpg
Summary and Future Work

  • GPUs are revolutionizing desktop supercomputing

  • A number of critical applications have been migrated successfully

  • CUDA and OpenCL have made these platforms much more accessible for general purpose computing

  • AMD presently has the highest DP FP performance

  • CUDA presently produces higher performance code for NVIDIA

  • We are developing a platform that leverages the best of both worlds 


ad