Cuda lecture 10 architectural considerations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 72

CUDA Lecture 10 Architectural Considerations PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

CUDA Lecture 10 Architectural Considerations. Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Objective. To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPU

Download Presentation

CUDA Lecture 10 Architectural Considerations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cuda lecture 10 architectural considerations

CUDA Lecture 10Architectural Considerations

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.


Objective

Objective

  • To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPU

    • The feeds and speeds of the traditional CPU world

    • The feeds and speeds when employing a GPU

    • To form a solid knowledge base for performance programming in modern GPU’s

  • Knowing yesterday, today, and tomorrow

    • The PC world is becoming flatter

    • Outsourcing of computation is becoming easier…

Architectural Considerations – Slide 2


Outline

Outline

  • Topic 1 (next): The GPU as Part of the PC Architecture

  • Topic 2: Threading Hardware in the G80

  • Topic 3: Memory Hardware in the G80

Architectural Considerations – Slide 3


Recall typical structure of a cuda program

Recall: Typical Structure of a CUDA Program

  • Global variables declaration

  • Function prototypes

    • __global__ void kernelOne(…)

  • Main ()

    • allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )

    • transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)

    • execution configuration setup

    • kernel call – kernelOne<<<execution configuration>>>( args… );

    • transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)

    • optional: compare against golden (host computed) solution

  • Kernel – void kernelOne(type args,…)

    • variables declaration - __local__, __shared__

      • automatic variables transparently assigned to registers or local memory

    • __syncthreads()…

repeat

as needed

Architectural Considerations – Slide 4


Bandwidth gravity of modern computer systems

Bandwidth: Gravity of Modern Computer Systems

  • The bandwidth between key components ultimately dictates system performance

    • Especially true for massively parallel systems processing massive amount of data

    • Tricks like buffering, reordering, caching can temporarily defy the rules in some cases

    • Ultimately, the performance goes falls back to what the “speeds and feeds” dictate

Architectural Considerations – Slide 5


Classic pc architecture

CPU

Classic PC Architecture

  • Northbridge connects three components that must be communicate at high speed

    • CPU, DRAM, video

    • Video also needs to have first-class access to DRAM

    • Previous NVIDIA cards are connected to AGP, up to 2 GB/sec transfers

  • Southbridge serves as a concentrator for slower I/O devices

Core Logic Chipset

Architectural Considerations – Slide 6


Original pci bus specification

(Original) PCI Bus Specification

  • Connected to the southBridge

    • Originally 33 MHz, 32-bit wide, 132 MB/sec peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak

    • Upstream bandwidth remain slow for device (256 MB/sec peak)

    • Shared bus with arbitration

      • Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge

Architectural Considerations – Slide 7


Pci as memory mapped i o

PCI as Memory Mapped I/O

  • PCI device registers are mapped into the CPU’s physical address space

    • Accessed through loads/ stores (kernel mode)

  • Addresses assigned to the PCI devices at boot time

    • All devices listen for their addresses

Architectural Considerations – Slide 8


Pci express pcie

PCI Express (PCIe)

  • Switched, point-to-point connection

    • Each card has a dedicated “link” to the central switch, no bus arbitration.

    • Packet switches messages form virtual channel

    • Prioritized packets for quality of service, e.g., real-time video streaming

Architectural Considerations – Slide 9


Pcie links and lanes

PCIe Links and Lanes

  • Each link consists of one more lanes

    • Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction)

      • Upstream and downstream now simultaneous and symmetric

    • Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc.

Architectural Considerations – Slide 10


Pcie links and lanes cont

PCIe Links and Lanes (cont.)

  • Each link consists of one more lanes

    • Each byte data is8b/10bencoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way.

    • Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way

Architectural Considerations – Slide 11


Pcie pc architecture

PCIe PC Architecture

  • PCIe forms the interconnect backbone

    • Northbridge/Southbridge are both PCIe switches

    • Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards

    • Some PCIe cards are PCI cards with a PCI-PCIe bridge

Architectural Considerations – Slide 12


Today s intel pc architecture single core system

Today’s Intel PC Architecture: Single Core System

  • FSB connection between processor and Northbridge (82925X)

    • Memory control hub

  • Northbridge handles “primary” PCIe to video/GPU and DRAM.

    • PCIe x16 bandwidth at 8 GB/sec (4 GB each direction)

  • Southbridge (ICH6RW) handles other peripherals

Architectural Considerations – Slide 13


Today s intel pc architecture dual core system

Today’s Intel PC Architecture: Dual Core System

  • Bensley platform

    • Blackford Memory Control Hub (MCH) is now a PCIe switch that integrates (NB/SB).

    • FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM

    • PCIe links form backbone

Source: http://www.2cpu.com/review.php?id=109

Architectural Considerations – Slide 14


Today s intel pc architecture dual core system cont

Today’s Intel PC Architecture: Dual Core System (cont.)

  • Bensley platform

    • PCIe device upstream bandwidth now equal to down stream

    • Workstation version has x16 GPU link via the Greencreek MCH

Source: http://www.2cpu.com/review.php?id=109

Architectural Considerations – Slide 15


Today s intel pc architecture dual core system cont1

Today’s Intel PC Architecture: Dual Core System (cont.)

  • Two CPU sockets

    • Dual Independent Bus to CPUs, each is basically a FSB

    • CPU feeds at 8.5–10.5 GB/sec per socket

    • Compared to current Front-Side Bus CPU feeds 6.4GB/sec

  • PCIe bridges to legacy I/O devices

Source: http://www.2cpu.com/review.php?id=109

Architectural Considerations – Slide 16


Today s amd pc architecture

Today’s AMD PC Architecture

  • AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture

  • HyperTransport™ similarities to PCIe:

    • Packet based, switching network

    • Dedicated links for both directions

  • Shown in 4 socket configuraton, 8 GB/sec per link

Architectural Considerations – Slide 17


Today s amd pc architecture cont

Today’s AMD PC Architecture (cont.)

  • Northbridge/ HyperTransport™ is on die

  • Glueless logic

    • to DDR, DDR2 memory

    • PCI-X/PCIe bridges (usually implemented in Southbridge)

Architectural Considerations – Slide 18


Today s amd pc architecture cont1

Today’s AMD PC Architecture (cont.)

  • “Torrenza” technology

    • Allows licensing of coherent HyperTransport™ to 3rd party manufacturers to make socket-compatible accelerators/co-processors

Architectural Considerations – Slide 19


Today s amd pc architecture cont2

Today’s AMD PC Architecture (cont.)

  • “Torrenza” technology

    • Allows 3rd party PPUs (Physics Processing Unit), GPUs, and co-processors to access main system memory directly and coherently

Architectural Considerations – Slide 20


Today s amd pc architecture cont3

Today’s AMD PC Architecture (cont.)

  • “Torrenza” technology

    • Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.

Architectural Considerations – Slide 21


Hypertransport feeds and speeds

HyperTransport™ Feeds and Speeds

  • Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe

Courtesy HyperTransport ™ Consortium

Source: “White Paper: AMD HyperTransport

Technology-Based System Architecture

Architectural Considerations – Slide 22


Hypertransport feeds and speeds cont

HyperTransport™ Feeds and Speeds (cont.)

  • HyperTransport ™ 1.0 Specification

    • 800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way)

Courtesy HyperTransport ™ Consortium

Source: “White Paper: AMD HyperTransport

Technology-Based System Architecture

Architectural Considerations – Slide 23


Hypertransport feeds and speeds cont1

HyperTransport™ Feeds and Speeds (cont.)

  • HyperTransport ™ 2.0 Specification

    • Added PCIe mapping

    • 1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way)

Courtesy HyperTransport ™ Consortium

Source: “White Paper: AMD HyperTransport

Technology-Based System Architecture

Architectural Considerations – Slide 24


Hypertransport feeds and speeds cont2

HyperTransport™ Feeds and Speeds (cont.)

  • HyperTransport ™ 3.0 Specification

    • 1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way)

    • Added AC coupling to extend HyperTransport ™ to long distance to system-to-system interconnect

Courtesy HyperTransport ™ Consortium

Source: “White Paper: AMD HyperTransport

Technology-Based System Architecture

Architectural Considerations – Slide 25


Geforce 7800 gtx board details

GeForce 7800 GTX Board Details

SLI Connector

Single slot cooling

sVideo

TV Out

DVI x 2

  • 256MB/256-bit DDR3

    • 600 MHz

    • 8 pieces of 8Mx32

16x PCI-Express

Architectural Considerations – Slide 26


Topic 2 threading in g80

Topic 2: Threading in G80

  • Single-Program Multiple-Data (SPMD)

  • CUDA integrated CPU + GPU application C program

    • Serial C code executes on CPU

    • Parallel Kernel C code executes on GPU thread blocks

Architectural Considerations – Slide 27


Spmd cont

. . .

. . .

SPMD (cont.)

CPU Serial Code

Grid 0

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

CPU Serial Code

Grid 1

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

Architectural Considerations – Slide 28


Grids and blocks

Grids and Blocks

  • A kernel is executed as a grid of thread blocks

    • All threads share global memory space

Architectural Considerations – Slide 29


Grids and blocks cont

Grids and Blocks (cont.)

  • A thread block is a batch of threads that can cooperate with each other by:

    • Synchronizing their execution using barrier

    • Efficiently sharing data through a low latency shared memory

    • Two threads from two different blocks cannot cooperate

Architectural Considerations – Slide 30


Cuda thread block review

CUDA Thread Block: Review

  • Programmer declares (Thread) Block:

    • Block size 1 to 512 concurrent threads

    • Block shape 1D, 2D, or 3D

    • Block dimensions in threads

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Architectural Considerations – Slide 31


Cuda thread block review cont

CUDA Thread Block: Review (cont.)

  • All threads in a block execute the same thread program

  • Threads share data and synchronize while doing their share of the work

  • Threads have thread id numbers within block

  • Thread program uses thread id to select work and address shared data

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Architectural Considerations – Slide 32


Geforce 8 series hardware overview

GeForce-8 Series Hardware Overview

Streaming Processor Array

TPC

TPC

TPC

TPC

TPC

TPC

Streaming Multiprocessor

Instruction L1

Data L1

Texture Processor Cluster

Instruction Fetch/Dispatch

SM

Shared Memory

SP

SP

TEX

SP

SP

SFU

SFU

SM

SP

SP

SP

SP

Architectural Considerations – Slide 33


Cuda processor terminology

CUDA Processor Terminology

  • SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)

  • TPC: Texture Processor Cluster (2 SM + TEX)

  • SM: Streaming Multiprocessor (8 SP)

    • Multi-threaded processor core

    • Fundamental processing unit for CUDA thread block

  • SP: Streaming Processor

    • Scalar ALU for a single CUDA thread

Architectural Considerations – Slide 34


Streaming multiprocessor

Streaming Multiprocessor

  • Streaming Multiprocessor (SM)

    • 8 Streaming Processors (SP)

    • 2 Super Function Units (SFU)

  • Multi-threaded instruction dispatch

    • 1 to 512 threads active

    • Shared instruction fetch per 32 threads

    • Cover latency of texture/memory loads

Streaming Multiprocessor

Instruction L1

Data L1

Instruction Fetch/Dispatch

Shared Memory

SP

SP

SP

SP

SFU

SFU

SP

SP

SP

SP

Architectural Considerations – Slide 35


Streaming multiprocessor cont

Streaming Multiprocessor (cont.)

  • 20+ GFLOPS

  • 16 KB shared memory

  • texture and global memory access

Streaming Multiprocessor

Instruction L1

Data L1

Instruction Fetch/Dispatch

Shared Memory

SP

SP

SP

SP

SFU

SFU

SP

SP

SP

SP

Architectural Considerations – Slide 36


G80 thread computing pipeline

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

TF

TF

TF

TF

TF

TF

TF

TF

L1

L1

L1

L1

L1

L1

L1

L1

Host

Input Assembler

Setup / Rstr / ZCull

Vtx Thread Issue

Geom Thread Issue

Pixel Thread Issue

Thread Processor

L2

L2

L2

L2

L2

L2

FB

FB

FB

FB

FB

FB

G80 Thread Computing Pipeline

  • The future of GPUs is programmable processing

  • So – build the architecture around the processor

Architectural Considerations – Slide 37


G80 thread computing pipeline cont

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Host

Input Assembler

Thread Execution Manager

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store

Load/store

Load/store

Load/store

Load/store

Load/store

Global Memory

G80 Thread Computing Pipeline (cont.)

  • Processors execute computing threads

  • Alternative operating mode specifically for computing

Generates thread grids based on kernel calls

Architectural Considerations – Slide 38


Thread life cycle in hardware

Host

Device

Kernel 1

Kernel 2

Grid 1

Block

(0, 0)

Block

(0, 1)

Block

(1, 1)

Block

(1, 0)

Block

(2, 1)

Block

(2, 0)

Grid 2

Block (1, 1)

Thread

(0, 0)

Thread

(0, 2)

Thread

(0, 1)

Thread

(1, 0)

Thread

(1, 1)

Thread

(1, 2)

Thread

(2, 0)

Thread

(2, 1)

Thread

(2, 2)

Thread

(3, 0)

Thread

(3, 1)

Thread

(3, 2)

Thread

(4, 0)

Thread

(4, 1)

Thread

(4, 2)

Thread Life Cycle in Hardware

  • Grid is launched on the streaming processor array (SPA)

  • Thread blocks are serially distributed to all the streaming multiprocessors (SMs)

    • Potentially >1 thread block per SM

  • Each SM launches warps of threads

    • 2 levels of parallelism

Architectural Considerations – Slide 39


Thread life cycle in hardware cont

Host

Device

Kernel 1

Kernel 2

Grid 1

Block

(0, 0)

Block

(0, 1)

Block

(1, 1)

Block

(1, 0)

Block

(2, 1)

Block

(2, 0)

Grid 2

Block (1, 1)

Thread

(0, 0)

Thread

(0, 2)

Thread

(0, 1)

Thread

(1, 0)

Thread

(1, 1)

Thread

(1, 2)

Thread

(2, 0)

Thread

(2, 1)

Thread

(2, 2)

Thread

(3, 0)

Thread

(3, 1)

Thread

(3, 2)

Thread

(4, 0)

Thread

(4, 1)

Thread

(4, 2)

Thread Life Cycle in Hardware (cont.)

  • SM schedules and executes warps that are ready to run

  • As warps and thread blocks complete, resources are freed

    • SPA can distribute more thread blocks

Architectural Considerations – Slide 40


Streaming multiprocessor executes blocks

MT IU

MT IU

SP

SP

Shared

Memory

Shared

Memory

t0 t1 t2 … tm

t0 t1 t2 … tm

TF

Streaming Multiprocessor Executes Blocks

SM 0

SM 1

  • Threads are assigned to SMs in block granularity

    • Up to 8 blocks to each SM as resource allows

    • SM in G80 can take up to 768 threads

      • Could be 256 (threads/block) × 3 blocks

      • Or 128 (threads/block) × 6 blocks, etc.

Blocks

Blocks

Texture L1

L2

Memory

Architectural Considerations – Slide 41


Streaming multiprocessor executes blocks cont

MT IU

MT IU

SP

SP

Shared

Memory

Shared

Memory

t0 t1 t2 … tm

t0 t1 t2 … tm

TF

Streaming Multiprocessor Executes Blocks (cont.)

SM 0

SM 1

  • Threads run concurrently

    • SM assigns/maintains thread id numbers

    • SM manages/schedules thread execution

Blocks

Blocks

Texture L1

L2

Memory

Architectural Considerations – Slide 42


Thread scheduling execution

t0 t1 t2 … t31

t0 t1 t2 … t31

Thread Scheduling/Execution

  • Each thread blocks is divided into 32-thread warps

    • This is an implementation decision, not part of the CUDA programming model

  • Warps are scheduling units in SM

Block 1 Warps

Block 2 Warps

Streaming Multiprocessor

Instruction L1

Data L1

Instruction Fetch/Dispatch

Shared Memory

SP

SP

SP

SP

SFU

SFU

SP

SP

SP

SP

Architectural Considerations – Slide 43


Thread scheduling execution cont

t0 t1 t2 … t31

t0 t1 t2 … t31

Thread Scheduling/Execution (cont.)

  • If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?

    • Each block is divided into 256/32 = 8 warps

    • There are 8 × 3 = 24 warps

    • At any point in time, only one of the 24 warps will be selected for instruction fetch and execution.

Block 1 Warps

Block 2 Warps

Streaming Multiprocessor

Instruction L1

Data L1

Instruction Fetch/Dispatch

Shared Memory

SP

SP

SP

SP

SFU

SFU

SP

SP

SP

SP

Architectural Considerations – Slide 44


Symmetric multiprocessor warp scheduling

warp 8 instruction 11

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

warp 3 instruction 96

Symmetric Multiprocessor Warp Scheduling

  • SM hardware implements zero-overhead warp scheduling

    • Warps whose next instruction has its operands ready for consumption are eligible for execution

    • Eligible warps are selected for execution on a prioritized scheduling policy

    • All threads in a warp execute the same instruction when selected

SM multithreaded

Warp scheduler

time

...

Architectural Considerations – Slide 45


Symmetric multiprocessor warp scheduling cont

warp 8 instruction 11

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

warp 3 instruction 96

Symmetric Multiprocessor Warp Scheduling (cont.)

  • Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80

    • If one global memory access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200-cycle memory latency

SM multithreaded

Warp scheduler

time

...

Architectural Considerations – Slide 46


Sm instruction buffer warp scheduling

SM Instruction Buffer: Warp Scheduling

  • Fetch one warp instruction/cycle

    • from instruction L1 cache

    • into any instruction buffer slot

  • Issue one “ready-to-go” warp instruction/cycle

    • from any warp - instruction buffer slot

    • operand scoreboarding used to prevent hazards

  • Issue selection based on round-robin/age of warp

  • SM broadcasts the same instruction to 32 threads of a warp

I

$

L

1

Multithreaded

Instruction Buffer

R

C

$

Shared

F

L

1

Mem

Operand Select

MAD

SFU

Architectural Considerations – Slide 47


Scoreboarding

Scoreboarding

  • All register operands of all instructions in the instruction buffer are scoreboarded

    • instruction becomes ready after the needed values are deposited

    • prevents hazards

    • cleared instructions are eligible for issue

Architectural Considerations – Slide 48


Scoreboarding cont

Scoreboarding (cont.)

  • Decoupled memory/processor pipelines

    • any thread can continue to issue instructions until scoreboarding prevents issue

    • allows memory/processor ops to proceed in shadow of other waiting memory/processor ops

Architectural Considerations – Slide 49


Granularity considerations

Granularity Considerations

  • For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?

    • For 4×4, we have 16 threads per block.

    • Since each SM can take up to 768 threads, the thread capacity allows 48 blocks.

    • However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!

      • There are 8 warps but each warp is only half full.

Architectural Considerations – Slide 50


Granularity considerations cont

Granularity Considerations (cont.)

  • For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?

    • For 8×8, we have 64 threads per block.

    • Since each SM can take up to 768 threads, it could take up to 12 blocks.

    • However, each SM can only take up to 8 blocks, only 512 threads will go into each SM!

      • There are 16 warps available for scheduling in each SM

      • Each warp spans four slices in the y dimension

Architectural Considerations – Slide 51


Granularity considerations cont1

Granularity Considerations (cont.)

  • For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?

    • For 16×16, we have 256 threads per block.

    • Since each SM can take up to 768 threads, it can take up to 3 blocks and achieve full capacity unless other resource considerations overrule.

      • There are 24 warps available for scheduling in each SM

      • Each warp spans two slices in the y dimension

    • For 32×32, we have 1024 threads per Block. Not even one can fit into an SM!

Architectural Considerations – Slide 52


Topic 3 memory hardware in g80

(Device) Grid

Block (0, 0)

Block (1, 0)

Shared Memory

Shared Memory

Regs

Regs

Regs

Regs

Thr (0, 0)

Thr(1, 0)

Thr(0, 0)

Thr(1, 0)

Local

Mem

Local

Mem

Local

Mem

Local

Mem

H

O

S

T

Global

Memory

Constant

Memory

Texture

Memory

Topic 3: Memory Hardware in G80

  • Review: CUDA Device Memory Space

    • Each thread can:

      • R/W per-thread registersand local memory

      • R/W per-block shared memory

      • R/W per-grid global memory

      • Read only per-grid constantand texture memories

    • The host can R/W global,

      constant and texture

      memories

Architectural Considerations – Slide 53


Overview shared memory

Overview: Shared Memory

  • Uses:

    • Inter-thread communication within a block

    • Cache data to reduce global memory accesses

    • Use it to avoid non-coalesced access

  • Organization:

    • 16 banks, 32-bitwide banks (Tesla)

    • 32 banks, 32-bit wide banks (Fermi)

    • Successive 32-bit words belong to different banks

Architectural Considerations – Slide 54


Overview shared memory cont

Overview: Shared Memory (cont.)

  • Performance:

    • 32 bits per bank per 2 clocks per multiprocessor

    • Shared memory accesses are per 16-threads (half-warp)

    • Serialization: if n threads (out of 16) access the same bank, n accesses are executed serially

    • Broadcast:n threads access the same word in one fetch

Architectural Considerations – Slide 55


Parallel memory sharing

Thread

Local Memory

Block

Shared

Memory

Parallel Memory Sharing

  • Local Memory: per-thread

    • Private per thread

    • Auto variables, register spill

  • Shared Memory: per-block

    • Shared by threads of the same block

    • Inter-thread communication

Architectural Considerations – Slide 56


Parallel memory sharing cont

. . .

. . .

Parallel Memory Sharing (cont.)

  • Global Memory:per-application

    • Shared by all threads

    • Inter-grid communication

Grid 0

Global

Memory

Sequential

Grids

in Time

Grid 1

Architectural Considerations – Slide 57


Streaming multiprocessor memory architecture

MT IU

MT IU

SP

SP

Shared

Memory

Shared

Memory

t0 t1 t2 … tm

t0 t1 t2 … tm

TF

Streaming Multiprocessor Memory Architecture

SM 0

SM 1

  • Threads in a block share data and results

    • In memory and shared memory

    • Synchronize at barrier instruction

Blocks

Blocks

Texture L1

L2

Memory

Architectural Considerations – Slide 58


Streaming multiprocessor memory architecture cont

MT IU

MT IU

SP

SP

Shared

Memory

Shared

Memory

t0 t1 t2 … tm

t0 t1 t2 … tm

TF

Streaming Multiprocessor Memory Architecture(cont.)

SM 0

SM 1

  • Per-block shared memory allocation

    • Keeps data close to processor

    • Minimize trips to global memory

    • Shared memory is dynamically allocated to blocks, one of the limiting resources

Blocks

Blocks

Texture L1

L2

Memory

Architectural Considerations – Slide 59


Symmetric multiprocessor register file

Symmetric Multiprocessor Register File

  • Register File (RF)

    • 32 KB (8K entries) for each SM in G80

  • TEX pipe can also read/write RF

    • 2 SMs share 1 TEX

    • Load/Store pipe can also read/write RF

I

$

L

1

Multithreaded

Instruction Buffer

R

C

$

Shared

F

L

1

Mem

Operand Select

MAD

SFU

Architectural Considerations – Slide 60


Programmer view of register file

Programmer View of Register File

3 blocks

  • There are 8192 registers in each SM in G80

    • This is an implementation decision, not part of CUDA

    • Registers are dynamically partitioned across all blocks assigned to the SM

    • Once assigned to a block, the register is NOT accessible by threads in other blocks

    • Each thread in the same block only access registers assigned to itself

4 blocks

Architectural Considerations – Slide 61


Matrix multiplication example

Matrix Multiplication Example

  • If each block has 16 × 16 threads and each thread uses 10 registers, how many thread can run on each SM?

    • Each block requires 10 × 256 = 2560 registers

    • 8192 = 3 × 2560 + change

    • So, three blocks can run on an SM as far as registers are concerned

  • How about if each thread increases the use of registers by 1?

    • Each block now requires 11 × 256 = 2816 registers

    • 8192 < 2816 × 3

    • Only two blocks can run on an SM, ¹⁄₃ reduction of thread-level parallelism!!!

Architectural Considerations – Slide 62


More on dynamic partitioning

More on Dynamic Partitioning

  • Dynamic partitioning gives more flexibility to compilers/programmers

    • One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each

      • This allows for finer grain threading than traditional CPU threading models.

    • The compiler can tradeoff between instruction-level parallelism (ILP) and thread level parallelism (TLP)

Architectural Considerations – Slide 63


Ilp vs tlp example

ILP vs. TLP Example

  • Assume that a kernel has 256-thread blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles, then 3 blocks can run on each SM

  • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, only two can run on each SM

Architectural Considerations – Slide 64


Ilp vs tlp example cont

ILP vs. TLP Example (cont.)

  • However, one only needs 200/(8×4) = 7 warps to tolerate the memory latency

  • Two blocks have 16 warps. The performance can be actually higher!

Architectural Considerations – Slide 65


Resource allocation example

Resource Allocation Example

  • Increase in per-thread performance, but fewer threads

    • Lower overall performance

Architectural Considerations – Slide 66


Prefetching

Prefetching

  • One could double buffer the computation, getting better instruction mix within each thread

    • This is classic software pipelining in ILP compilers

Architectural Considerations – Slide 67


Prefetch

Prefetch

0

1

2

bx

TILE_WIDTH-1

0

1

2

tx

Nd

  • Deposit blue tile from register into shared memory and Syncthreads

  • Load orange tile into register

  • Compute blue tile

  • Deposit orange tile into shared memory

  • ….

TILE_WIDTH

WIDTH

TILE_WIDTH

Pd

Md

0

0

Pdsub

1

ty

2

1

WIDTH

TILE_WIDTHE

TILE_WIDTH-1

TILE_WIDTH

TILE_WIDTH

TILE_WIDTH

2

WIDTH

WIDTH

Architectural Considerations – Slide 68

by


Instruction mix considerations

Instruction Mix Considerations

  • There are very few multiplications or additions between branches and address calculations.

  • Loop unrolling can help.

Architectural Considerations – Slide 69


Unrolling

Unrolling

  • Removal of branch instructions and address calculations

Does this use more registers?

Architectural Considerations – Slide 70


Major g80 performance detractors

Major G80 Performance Detractors

  • Long-latency operations

    • Avoid stalls by executing other threads

  • Stalls and bubbles in the pipeline

    • Barrier synchronization

    • Branch divergence

  • Shared resource saturation

    • Global memory bandwidth

    • Local memory capacity

Architectural Considerations – Slide 71


End credits

End Credits

  • Based on original material from

    • Jon Stokes, PCI Express: An Overview

      • http://arstechnica.com/articles/paedia/hardware/pcie.ars

    • The University of Illinois at Urbana-Champaign

      • David Kirk, Wen-mei W. Hwu

  • Revision history: last updated 10/11/2011.

    • Previous revisions: 9/13/2011.

Architectural Considerations – Slide 72


  • Login