ECE 569
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

ECE 569 High Performance Processors and Systems PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

ECE 569 High Performance Processors and Systems. Administrative HW2 due Thursday 2/13 @ start of class GPUs Conditional execution Current state of the art. GPUs == SIMD. Registers. Registers. Registers. ALU. ALU. ALU. RAM. . . . . . . Control Processor. Instruction Memory.

Download Presentation

ECE 569 High Performance Processors and Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ece 569 high performance processors and systems

ECE 569

High Performance Processors and Systems

  • Administrative

    • HW2 due Thursday 2/13 @ start of class

  • GPUs

    • Conditional execution

    • Current state of the art

ECE 569 -- 11 Feb 2014


Gpus simd

GPUs == SIMD

Registers

Registers

Registers

ALU

ALU

ALU

RAM

. . .

. . .

Control

Processor

Instruction Memory

ECE 569 -- 11 Feb 2014


A simple execution model

A Simple Execution Model

  • No branch prediction

    • Just evaluate branch targets and wait for resolution

    • But wait is only a small number of cycles once data is loaded from global memory

  • No speculation

    • Only execute useful instructions

Lecture 07

ECE 569 -- 11 Feb 2014


Example conditionals

Example: conditionals

if (threadIdx >= 2) {

out[threadIdx] += 100;

}

else {

out[threadIdx] += 10;

}

compare

threadIdx,2

Reg

Instruction

Unit

Reg

Reg

...

P0

P!

PM-1

Memory

Lecture 07

ECE 569 -- 11 Feb 2014


Then part

then part

if (threadIdx.x >= 2) {

out[threadIdx.x] += 100;

}

else {

out[threadIdx.x] += 10;

}

all threads in warp where condtrue do then part:

load …

add 100

store …

other threads in warp are "masked off" and wait…

X

X

Reg

Instruction

Unit

Reg

Reg

...

P0

P!

PM-1

Memory

Lecture 07

ECE 569 -- 11 Feb 2014


Else part

else part

if (threadIdx >= 2) {

out[threadIdx] += 100;

}

else {

out[threadIdx] += 10;

}

all threads in warp where condfalse do else part:

load …

add 10

store …

other threads in warp are "masked off" and wait…

X

X

Reg

Instruction

Unit

Reg

Reg

...

P0

P!

PM-1

Memory

Lecture 07

ECE 569 -- 11 Feb 2014


Terminology

Terminology

  • Divergent paths

    • Different threads within a warp take different control flow paths within a kernel function

    • N divergent paths in a warp?

      • result from nested conditionals…

      • An N-way divergent warp is serially issued over the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path.

      • Performance decreases by about a factor of N

Lecture 07

ECE 569 -- 11 Feb 2014


Example vector reduction

Example: Vector Reduction

Thread 0

Thread 2

Thread 4

Thread 6

Thread 8

Thread 10

0

1

2

3

4

5

6

7

8

9

10

11

1

0+1

2+3

4+5

6+7

8+9

10+11

2

0...3

4..7

8..11

3

0..7

8..15

iterations

Array elements


Implementation

Implementation

unsigned int t = threadIdx.x;

for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2)

{

__syncthreads();

if (t % (2*stride) == 0)

d_out[t] += d_out[t+stride];

}

Lecture 07

ECE 569 -- 11 Feb 2014


Some observations

Some Observations

  • In each iteration, two control flow paths

    • Threads that perform addition and threads that do not

    • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence

  • No more than half of threads will be executing at any time

    • All odd index threads are disabled right from the beginning!

    • On average, less than ¼ of the threads will be activated for all warps over time

Lecture 07

ECE 569 -- 11 Feb 2014


A better approach

A better approach

Thread 0

0

1

2

3

13

14

15

16

17

18

19

1

0+16

15+31

3

4


A better implementation

A better implementation

unsigned int t = threadIdx.x;

for (unsigned int stride = blockDim.x >> 1;

stride >= 1; stride >> 1)

{

__syncthreads();

if (t < stride)

d_out[t] += d_out[t+stride];

}

Lecture 07

ECE 569 -- 11 Feb 2014


Some observations1

Some Observations

  • Only the last 5 iterations will have divergence

    • within a warp

  • Entire warps will be shut down as iterations progress

    • For a 512-thread block, 4 iterations to shut down all but one warp in each block

    • Better resource utilization, will retire warps and blocks faster

Lecture 07

ECE 569 -- 11 Feb 2014


Ece 569 high performance processors and systems

Optimization: Predicated Execution

<p1> LDR r1,r2,0

  • If p1 is TRUE, instruction executes normally

  • If p1 is FALSE, instruction treated as NOP

ECE 569 -- 11 Feb 2014


Ece 569 high performance processors and systems

Predication Example

:

:

if (X == 10)

c = c + 1;

:

:

:

:

LDR r5, X

p1 <- r5 eq 10

<p1> LDR r1, C

<p1> ADD r1, r1, 1

<p1> STR r1, C

:

:

ECE 569 -- 11 Feb 2014


Ece 569 high performance processors and systems

Predication optimizes execution of simple if-then-else

A

A

B

C

D

B

C

D

ECE 569 -- 11 Feb 2014


Predication better scheduling

Predication == better scheduling

:

:

p1,p2 <- r5 eq 10

<p1> inst 1 from B

<p1> inst 2 from B

<p1>:

:

<p2> inst 1 from C

<p2> inst 2 from C

:

:

:

:

p1,p2 <- r5 eq 10

<p1> inst 1 from B

<p2> inst 1 from C

<p1> inst 2 from B

<p2> inst 2 from C

<p1>:

:

schedule

ECE 569 -- 11 Feb 2014

17


State of the art with gpus

State of the Art with GPUs

Lecture 07

ECE 569 -- 11 Feb 2014

18


Ece 569 high performance processors and systems

nVidiaKepler GK110

  • 7.1B transistors

  • 2,880 cores

  • 2012 model :-)

ECE 569 -- 11 Feb 2014


Ece 569 high performance processors and systems

TITAN: 2nd fastest supercomputer

560,640 total cores: ½ CPU and ½ GPU

CPUs: 18,688 AMD Opteron (16-core)

GPUs: 18,688 nVidia Tesla K20x

710,144 GB RAM

8,000 kWatts of power

ECE 569 -- 11 Feb 2014


  • Login