slide1 n.
Download
Skip this Video
Download Presentation
ECE 569 High Performance Processors and Systems

Loading in 2 Seconds...

play fullscreen
1 / 20

ECE 569 High Performance Processors and Systems - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

ECE 569 High Performance Processors and Systems. Administrative HW2 due Thursday 2/13 @ start of class GPUs Conditional execution Current state of the art. GPUs == SIMD. Registers. Registers. Registers. ALU. ALU. ALU. RAM. . . . . . . Control Processor. Instruction Memory.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ECE 569 High Performance Processors and Systems' - shubha


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

ECE 569

High Performance Processors and Systems

  • Administrative
    • HW2 due Thursday 2/13 @ start of class
  • GPUs
    • Conditional execution
    • Current state of the art

ECE 569 -- 11 Feb 2014

gpus simd
GPUs == SIMD

Registers

Registers

Registers

ALU

ALU

ALU

RAM

. . .

. . .

Control

Processor

Instruction Memory

ECE 569 -- 11 Feb 2014

a simple execution model
A Simple Execution Model
  • No branch prediction
    • Just evaluate branch targets and wait for resolution
    • But wait is only a small number of cycles once data is loaded from global memory
  • No speculation
    • Only execute useful instructions

Lecture 07

ECE 569 -- 11 Feb 2014

example conditionals
Example: conditionals

if (threadIdx >= 2) {

out[threadIdx] += 100;

}

else {

out[threadIdx] += 10;

}

compare

threadIdx,2

Reg

Instruction

Unit

Reg

Reg

...

P0

P!

PM-1

Memory

Lecture 07

ECE 569 -- 11 Feb 2014

then part
then part

if (threadIdx.x >= 2) {

out[threadIdx.x] += 100;

}

else {

out[threadIdx.x] += 10;

}

all threads in warp where condtrue do then part:

load …

add 100

store …

other threads in warp are "masked off" and wait…

X

X

Reg

Instruction

Unit

Reg

Reg

...

P0

P!

PM-1

Memory

Lecture 07

ECE 569 -- 11 Feb 2014

else part
else part

if (threadIdx >= 2) {

out[threadIdx] += 100;

}

else {

out[threadIdx] += 10;

}

all threads in warp where condfalse do else part:

load …

add 10

store …

other threads in warp are "masked off" and wait…

X

X

Reg

Instruction

Unit

Reg

Reg

...

P0

P!

PM-1

Memory

Lecture 07

ECE 569 -- 11 Feb 2014

terminology
Terminology
  • Divergent paths
    • Different threads within a warp take different control flow paths within a kernel function
    • N divergent paths in a warp?
      • result from nested conditionals…
      • An N-way divergent warp is serially issued over the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path.
      • Performance decreases by about a factor of N

Lecture 07

ECE 569 -- 11 Feb 2014

example vector reduction
Example: Vector Reduction

Thread 0

Thread 2

Thread 4

Thread 6

Thread 8

Thread 10

0

1

2

3

4

5

6

7

8

9

10

11

1

0+1

2+3

4+5

6+7

8+9

10+11

2

0...3

4..7

8..11

3

0..7

8..15

iterations

Array elements

implementation
Implementation

unsigned int t = threadIdx.x;

for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2)

{

__syncthreads();

if (t % (2*stride) == 0)

d_out[t] += d_out[t+stride];

}

Lecture 07

ECE 569 -- 11 Feb 2014

some observations
Some Observations
  • In each iteration, two control flow paths
    • Threads that perform addition and threads that do not
    • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence
  • No more than half of threads will be executing at any time
    • All odd index threads are disabled right from the beginning!
    • On average, less than ¼ of the threads will be activated for all warps over time

Lecture 07

ECE 569 -- 11 Feb 2014

a better approach
A better approach

Thread 0

0

1

2

3

13

14

15

16

17

18

19

1

0+16

15+31

3

4

a better implementation
A better implementation

unsigned int t = threadIdx.x;

for (unsigned int stride = blockDim.x >> 1;

stride >= 1; stride >> 1)

{

__syncthreads();

if (t < stride)

d_out[t] += d_out[t+stride];

}

Lecture 07

ECE 569 -- 11 Feb 2014

some observations1
Some Observations
  • Only the last 5 iterations will have divergence
    • within a warp
  • Entire warps will be shut down as iterations progress
    • For a 512-thread block, 4 iterations to shut down all but one warp in each block
    • Better resource utilization, will retire warps and blocks faster

Lecture 07

ECE 569 -- 11 Feb 2014

slide14

Optimization: Predicated Execution

<p1> LDR r1,r2,0

  • If p1 is TRUE, instruction executes normally
  • If p1 is FALSE, instruction treated as NOP

ECE 569 -- 11 Feb 2014

slide15

Predication Example

:

:

if (X == 10)

c = c + 1;

:

:

:

:

LDR r5, X

p1 <- r5 eq 10

<p1> LDR r1, C

<p1> ADD r1, r1, 1

<p1> STR r1, C

:

:

ECE 569 -- 11 Feb 2014

predication better scheduling
Predication == better scheduling

:

:

p1,p2 <- r5 eq 10

<p1> inst 1 from B

<p1> inst 2 from B

<p1> :

:

<p2> inst 1 from C

<p2> inst 2 from C

:

:

:

:

p1,p2 <- r5 eq 10

<p1> inst 1 from B

<p2> inst 1 from C

<p1> inst 2 from B

<p2> inst 2 from C

<p1> :

:

schedule

ECE 569 -- 11 Feb 2014

17

state of the art with gpus
State of the Art with GPUs

Lecture 07

ECE 569 -- 11 Feb 2014

18

slide19

nVidiaKepler GK110

  • 7.1B transistors
  • 2,880 cores
  • 2012 model :-)

ECE 569 -- 11 Feb 2014

slide20

TITAN: 2nd fastest supercomputer

560,640 total cores: ½ CPU and ½ GPU

CPUs: 18,688 AMD Opteron (16-core)

GPUs: 18,688 nVidia Tesla K20x

710,144 GB RAM

8,000 kWatts of power

ECE 569 -- 11 Feb 2014