directcompute capturing the teraflop n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DirectCompute: Capturing the Teraflop PowerPoint Presentation
Download Presentation
DirectCompute: Capturing the Teraflop

Loading in 2 Seconds...

play fullscreen
1 / 57

DirectCompute: Capturing the Teraflop - PowerPoint PPT Presentation


  • 194 Views
  • Uploaded on

PDC09-CL03. DirectCompute: Capturing the Teraflop. Chas. Boyd Architect Microsoft Corporation. Overview. Describing the GPU as a CPU Fundamental principles in familiar terms Problem Set Definition In what cases will I get the Teraflop? How to DirectCompute Step by Step Managing I/O

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DirectCompute: Capturing the Teraflop' - molly-cervantes


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
directcompute capturing the teraflop
PDC09-CL03

DirectCompute:Capturing the Teraflop

Chas. Boyd

Architect

Microsoft Corporation

overview
Overview
  • Describing the GPU as a CPU
    • Fundamental principles in familiar terms
  • Problem Set Definition
    • In what cases will I get the Teraflop?
  • How to DirectCompute
    • Step by Step
  • Managing I/O
    • Most codes are I/O bound
current cpu
Current CPU

4 Cores

4 float wide SIMD

3GHz

48-96GFlops

2x HyperThreaded

64kB $L1/core

20GB/s to Memory

$200

200W

CPU 0

CPU 1

CPU 2

CPU 3

L2 Cache

current gpu
Current GPU

32 Cores

32 Float wide

1GHz

1TeraFlop

32x “HyperThreaded”

64kB $L1/Core

150GB/s to Mem

$200,

200W

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

L2 Cache

comparison current processors
Comparison: Current Processors

SIMD

SIMD

SIMD

SIMD

SIMD

CPU 0

CPU 1

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

CPU 2

CPU 3

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

L2 Cache

SIMD

SIMD

SIMD

SIMD

SIMD

L2 Cache

CPU

GPU

cpu vs gpu
CPU vs GPU

CPU

GPU

High bandwidth memory

Sequential accesses

100GB/s bandwidth

1TFlop compute

10 Gflops/watt

Niche programming model

  • Low latency memory
  • Random accesses
  • 20GB/s bandwidth
  • 0.1TFlop compute
  • 1GFlops/watt
  • Well known programming model
a n asymmetric multi processor system
An Asymmetric Multi- Processor System

CPU

50GFlops

GPU

1TFlop

1GB/s

10GB/s

100GB/s

CPU RAM

  • 4-6 GB

GPU RAM

  • 1 GB
gpus are data parallel processors
GPUs are Data-Parallel Processors
  • GPU has 1000s of simultaneous ALUs
  • Need 100s of 1000s of threads to hit peak
  • Only data elements come in such numbers
gpus need data parallel algorithms
GPUs Need Data-Parallel Algorithms
  • Image processing
    • Reduction, Histogram, FFT, Summed Area Table
  • Video processing
    • transcode, effects, analysis
  • Audio
  • Linear Algebra
  • Simulation/Modeling:
    • Technical, Finance, Academic
    • Some Databases
applications algorithms
Applications <> Algorithms
  • Most important algorithms have known data-parallel versions
  • Algorithm was replaced with data-parallel version:
    • Sorting: Quicksort was swapped to Bitonic
n body galaxy simulation

N-Body Galaxy Simulation

demo

DirectCompute

AMD HD 5870

DirectX11

the teraflop today
The Teraflop Today

N-Body Demo App:

AMD Phenom II X4 940 3GHz + Radeon HD 5850

CPU      13.7GFlops Multicore SSE, not cache-aware

GPU   537GFlops DirectCompute

Intel Xeon E5410 2.33GHz + Radeon HD 5870

CPU 25.5GFlops Multicore SSE, not cache-aware

GPU   722GFlops DirectCompute

microsoft fft performance
Microsoft FFT Performance

GFlops

Log2( size)

component relationships
Component Relationships

Applications

Media playback or processing, media UI, recognition, etc. Technical

Domain Libraries

Domain Languages

Accelerator, Brook+, Rapidmind, Ct

MKL, ACML, cuFFT, D3DX, etc.

Compute Languages

DirectCompute, CUDA, CAL, OpenCL, LRB Native, etc.

Processors

CPU, GPU, Larrabee

nVidia, Intel, AMD, S3, etc.

directcompute adds client scenarios
DirectCompute Adds Client Scenarios
  • Support for multiple vendors
    • All DirectX11 chips will support DirectCompute
    • Some DirectX10 chips already support it
  • Tight integration with rendering
    • Client scenarios involve interactive playback
  • Support media data-types
    • Hardware format conversion for pixel formats
  • Server scenarios still supported
directcompute usage
DirectCompute Usage
  • Initialize DirectCompute
  • Create some GPU code in .hlsl
  • Compile it using DirectX compiler
  • Load the code onto the GPU
  • Set up a GPU buffer for input data
    • And set up a view into it for access
  • Make that data view current
  • Execute the code on the GPU
  • Copy the data back to CPU memory
initialize directcompute
Initialize DirectCompute
  • hr = D3D11CreateDevice(
  • NULL, // default gfx adapter
  • D3D_DRIVER_TYPE_HARDWARE, // use hw
  • NULL, // not swrasterizer
  • uCreationFlags, // Debug, Threaded, etc.
  • NULL, // feature levels
  • 0, // size of above
  • D3D11_SDK_VERSION, // SDK version
  • ppDeviceOut, // D3D Device
  • &FeatureLevelOut, // of actual device
  • ppContextOut ); // subunit of device
  • );
example hlsl code
Example HLSL code
  • #define BLOCK_SIZE 256
  • StructuredBuffer gBuf1;
  • StructuredBuffer gBuf2;
  • RWStructuredBuffergBufOut;
  • [numthreads(BLOCK_SIZE,1,1)]
  • void VectorAdd( uint3 id: SV_DispatchThreadID )
  • {
  • gBufOut[id] = gBuf1[id] + gBuf2[id];
  • }
the hlsl language

HLSL is the most widely used language for Data Parallel Programming

Syntax is similar to ‘C/C++’

Preprocessor defines (#define, #ifdef, etc)

Basic types (float, int, uint, bool, etc)

Operators, variables, functions

Has some important differences

No pointers 

Built-in variables & types (float4, matrix, etc)

Intrinsic functions (mul, normalize, etc)

The HLSL Language
compile the hlsl code
Compile the HLSL code
  • hr = D3DX11CompileFromFile(
  • “myCode.hlsl”, // path to .hlsl file
  • NULL,
  • NULL,
  • “VectorAdd”, // entry point
  • pProfile,
  • NULL, // Flags
  • NULL,
  • NULL,
  • &pBlob, // compiled shader
  • &pErrorBlob, // error log
  • NULL );
compilation steps

Compiler (fxcor library) generates target-specific instructions (IL) from shader

Different instruction sets for different generations of hardware

Shader IL is highly optimized

Compilation Steps
complete compilation and send to gpu
Complete Compilation and Send to GPU
  • pD3D->CreateComputeShader(
  • pBlob->GetBufferPointer(),
  • pBlob->GetBufferSize(),
  • NULL,
  • &pMyShader ); // hw fmt
  • pD3D->CSSetShader(
  • pMyShader, NULL, 0 );
setup buffer resource for input data
Setup Buffer Resource for Input Data
  • D3D11_BUFFER_DESC descBuf;
  • ZeroMemory( &descBuf, sizeof(descBuf) );
  • desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
  • desc.StructureByteStride = uElementSize;
  • desc.ByteWidth = uElementSize * uCount;
  • desc.MiscFlags =
  • D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
  • pD3D->CreateBuffer( &desc, pInput, ppBuffer );
resources
Resources
  • Resource Objects are used to store data
  • Resource Views are interfaces to the Resource

Compute Shader

Sampler Resource View

Resource Object

My Data Buffer

Unordered Access View

directx resources
DirectX Resources
  • Data Objects in memory
  • Enable out-of-bounds memory checking
    • Improves security, reliability of shipped code
    • Returns 0 on reads
    • Writes are No-Ops
  • Facilitates interop with Direct3D for display
directx resource types
DirectX Resource Types
  • Buffer
    • Defines an arbitrary data struct for the records in this buffer object
    • Includes, structured, raw, streaming buffers
  • Texture*
    • Storage for data that will be used in pixel tasks
    • Includes 1-D, 2-D, 3-D, Cubes and arrays thereof
buffer resource types
Buffer Resource Types
  • Structured
    • Defines a record size with a fixed size.
    • Pixel data format is not specified, so automatic type/format conversion not provided
  • Unstructured
    • Can provide type/format conversion
  • Both types support non-order-preserving
      • For use with Append()/Consume() I/O
image media resource types
Image/Media Resource Types
  • Texture1D, 2D, 3D, Cube, Array
    • A 2-D array of Pixels in specified format
      • R8G8B8A8, R32_UINT, R16G16_UINT
setup a view into the buffer
Setup a View into the Buffer
  • D3D11_UNORDERED_ACCESS_VIEW_DESC desc;
  • ZeroMemory( &desc, sizeof(desc) );
  • desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
  • desc.Buffer.FirstElement = 0;
  • desc.Format = DXGI_FORMAT_UNKNOWN;
  • desc.Buffer.NumElements = uCount;
  • pD3D->CreateUnorderedAccessView(
  • pBuffer, // Buffer view is into
  • &desc, // above data
  • &pMyUAV ); // result
resource views
Resource Views
  • Resource Views define the access mechanism for data stored in Resources (buffers)
  • Support cool features like:
    • Hardware accelerated format conversion
    • Hardware accelerated linear filtering/sampling
  • Can create multiple views onto one resource
  • Enable data polymorphism while providing info to implementation for optimal layout
unordered access view uav
Unordered Access View (UAV)
  • Enables two alternative usage patterns:
  • Unordered/random/scattered I/O to the buffer it is created into
  • Indexed operations for I/O
    • myBuffer[index] = x;
    • For Texture2D Resource, index is uint2
  • Or Non-Order-Preserving I/O
    • Using Append()/Consume() intrinsics
non order preserving i o
Non-Order Preserving I/O
  • For fastest performance when ordering of records need not be preserved
  • Or when nr of writes is unknown

Append( ResourceVar, val);

  • Corresponding read operation provided for completeness

Consume( ResourceVar, val);

  • Requires buffer to have flag enabling this
shader resource view srv
Shader Resource View (SRV)
  • Enables hardware accelerated filtered sampling of the buffer
  • This hardware is a significant fraction of chip area
  • Excellent for pixel data (images/video)
  • A single pixel format defined per View
  • Read-Only operation
    • Same resource cannot be bound to shader as SRV and as another view type at the same time
  • Can also load w/o filtering
implementation secrets
Implementation Secrets
  • Resources correspond to ranges of memory
  • Views correspond to hardware logic units that perform data transformation on I/O
graphics vs compute i o
Graphics vs Compute I/O

Texture Samplers

Pixel format conversion,Bi-linear filtering, Gamma correction

ALUs

Shader Execution

Output Mergers

Gamma correction, Pixel format conversion, Framebufferprefetch

~50 clocks

250 clocks

GPU Memory

bind the data launch the work
Bind the Data, Launch the Work
  • pD3D->CSSetUnorderedAccessViews(
  • 0,
  • 1,
  • &pMyUAV,
  • NULL );
  • pD3D->Dispatch( GrpsX, GrpsY, GrpsZ );
thread groups
Thread Groups
  • Not all threads in the call can/should share registers with each other
  • Compute threads are structured into subsets or groups of threads
  • Thread indices are available to the code:
    • SV_DispatchThreadID index of thread in call
    • SV_GroupThreadID index of thread in group
    • SV_GroupID index of group in call
thread groups1
Thread Groups
  • pDev11->Dispatch(3, 2, 1);
  • [numthreads(4, 4, 1)]
  • void MyCS(…)

00

00

01

02

03

10

00

01

02

03

20

00

01

02

03

10

11

12

13

10

11

12

13

10

11

12

13

20

21

22

23

20

21

22

23

20

21

22

23

30

31

32

33

30

31

32

33

30

31

32

33

01

00

01

02

03

11

00

01

02

03

00

21

01

02

03

10

11

12

13

10

11

12

13

10

11

12

13

20

21

22

23

20

21

22

23

20

21

22

23

30

31

32

33

30

31

32

33

30

31

32

33

set up buffer for transfer to cpu
Set up Buffer for Transfer to CPU
  • D3D11_BUFFER_DESC desc;
  • ZeroMemory( &desc, sizeof(desc) );
  • desc.CPUAccessFlags =
  • D3D11_CPU_ACCESS_READ;
  • desc.Usage = D3D11_USAGE_STAGING;
  • desc.BindFlags = 0;
  • desc.MiscFlags = 0;
  • pD3D->CreateBuffer(
  • &desc, NULL, &StagingBuf );
transfer results to cpu
Transfer Results to CPU
  • pD3D->CopyResource( debugbuf, pBuffer );
temporary registers aka general purpose registers
Temporary Registersaka General Purpose Registers
  • Used for fast local variable storage
  • Built as a block in each SIMD core
    • 16k 32-bit registers per core
  • Registers available per thread depends on number of threads in the group (group size)
    • E.g. 16k registers/1024 threads in group means each thread gets 16 DWORDs
  • Exceeding this limit has perf impacts:
    • Registers may be spilled to memory, or
    • Threads on core may be cut back (less ‘HyperThreads’)
groupshared memory
Groupshared Memory
  • New register type variable storage class
    • groupshared float sfFoo;
  • A whole group of threads can access the same memory
    • Enables uses like user-controlled cache
  • Max 32kB can be shared in DirectX11
    • 8k floats or 2k float4s
    • Vs 64kB of temporary registers
      • 16k floats or 4k float4s
  • Using fewer is usually faster
barrier intrinsics
Barrier Intrinsics

GroupMemoryBarrier

DeviceMemoryBarrier

AllMemoryBarrier

  • All I/O ops at the specified scope (group, device, or both) before this point must complete before any other I/O ops

GroupMemoryBarrierWithGroupSync

DeviceMemoryBarrierWithGroupSync

AllMemoryBarrierWithGroupSync

  • All I/O ops at the specified scope (group, device, or both) before this point must complete before any other I/O ops
  • AND all the specified threads must reach this point before any can continue
barrier example
Barrier Example

Shader()

{

groupshared GS[GROUPSIZE];

…compute the indices…

GS[sid] = myBuffer[Tid]; // Load my data element

GroupMemoryBarrierWithGroupSync();

// process the data in groupshared memory

GroupMemoryBarrierWithGroupSync();

outBuffer[Tid] = GS[sid]; // write my data element

}

implementation secrets1
Implementation Secrets
  • Thread Group corresponds to a SIMD core
    • 1 of 16-32 on the die
  • Groupshared memory corresponds to a partition of that core’s L1 cache
  • GroupMemoryBarrier() corresponds to a flush of that core’s I/O
data parallel i o
Data Parallel I/O
  • I/O with 1600 active threads is not trivial
  • Reads are broadcast, so should be fast, but:
  • Writes by many threads to one destination can result in serialization
  • Less Obvious:
  • Even writing to a sequential location results in serialization on access to the address counter
  • This is why DirectCompute provides a rich set of I/O operations and intrinsics
hardware support
Hardware Support
  • DirectX11 Compute Shader runs on most current DirectX10 and 10.1 (4.x) parts
    • Explicit thread Dispatch()
    • Random-access I/O via resource variables
    • Private Write/Shared Read on groupshared data
  • New DirectX11-class (5.x) hardware adds
    • Arbitrary accesses to groupshared data
    • Atomic intrinsic operators
    • Hardware format conversion on i/o
    • More streaming i/o methods
os support
OS Support
  • DirectCompute ships in DirectX11
    • DirectX11 is integrated into Windows7 and Server 2008R2
  • Also available on Windows Vista SP2 and Windows Server 2008 via Platform Update 
    • http://support.microsoft.com/kb/971644
    • Supports all new hardware features
  • Developer SDK installs on either OS
    • http://msdn.microsoft.com/directx
call to action
Call to Action
  • Install the DirectX11 SDK
  • Try out the DirectCompute samples
  • Look for parts of your code that are data parallel
  • Swap in GPU code using DirectCompute
  • Experience Teraflop computing today
slide54

YOUR FEEDBACK IS IMPORTANT TO US!

Please fill out session evaluation forms online at

MicrosoftPDC.com

learn more on channel 9
Learn More On Channel 9
  • Expand your PDC experience through Channel 9
  • Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses

channel9.msdn.com/learn

Built by Developers for Developers….