Many thread aware prefetching mechanisms for gpgpu applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on
  • Presentation posted in: General

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. Introduction. SIMD Execution. Shared Memory. DRAM. Memory Req.uestBuffer. Core. General Purpose GPUs (GPGPU) are getting popular High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops )

Download Presentation

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Many thread aware prefetching mechanisms for gpgpu applications

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications


Introduction

Introduction

SIMD

Execution

Shared

Memory

DRAM

Memory Req.uestBuffer

Core

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • General Purpose GPUs (GPGPU) are getting popular

    • High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops)

    • GPGPUs have SIMD execution, many cores, and large-scale multi-threading

      • Warp – basic unit of execution in a core (SIMD unit)


Memory latency problem

Memory Latency Problem

C

C

C

M

M

M

C

C

C

C

C

C

M

M

M

D

D

D

Memory Latency

Computation

Memory

Dependent on memory

C

M

D

No stall

4 active threads

T0

C

M

C

C

M

D

C

M

C

Switch

T1

Switch

T2

Switch

T3

  • Tolerating memory latency is critical in CPUs

    • Many techniques have been proposed

      • Caches, prefetching, multi-threading, etc.

  • Is this critical in GPGPUs as well?

    • GPGPUs have employed multi-threading

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Memory problems in gpgpus

Memory Problems in GPGPUs

2 active threads

C

M

C

C

M

D

C

C

M

M

C

C

C

C

M

M

D

D

T0

Switch

Stall Cycles

T1

Stall

Memory Latency

  • What if there are not enough threads in GPGPUs?

  • Limited thread-level-parallelism

    • Application behavior

      • Algorithmically, lack of parallelism

    • Limited by resource constraints

      • # registers per thread, # threads per block, shared memory usage per block

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Prefetching in gpgpus

Prefetching in GPGPUs

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem: when multi-threading is not enough, how can we hide memory latency?

    • Other solutions

      • Caching (NVIDIA Fermi)

      • Prefetching (in this talk)

  • Many prefetchers mechanisms proposed in CPUs

    • stride, stream, Markov, CDP, GHB, helper thread, etc.

  • Question: will the existing mechanisms work in GPGPUs?


Characteristic 1 many threads

Characteristic #1. Many Threads

1 thread

2 threads

Many threads

Prefetcher

Prefetcher

Prefetcher

Prefetching in CPU

Prefetching in GPU

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #1. Training of prefetcher(Scalability)

    • Accesses from many threads are interleaved

  • Problem #2. Amplified negative effects (SIMT)

    • One useless prefetchrequest per thread many useless prefetches


Characteristic 1 many threads1

Characteristic #1. Many Threads

Capacity misses

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

Fit in a cache

Cache

Cache

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #3. Single-Configuration Many-Threads (SCMT)

    • Too many threads are controlled together

      Prefetch degree 1: < cache size

      Prefetch degree 2: >> cache size


Characteristic 2 data level parallelism

Characteristic #2. Data Level Parallelism

create

prefetch

Memory latency

prefetch

demand

Memory latency

demand

Useful!

Not enough opportunity

1. thread terminated

2. too close to demand

terminate

A thread in sequential program

A thread in parallel program

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #4. Short thread lifetime

    • The length of a thread in parallel programs is shorter than in sequential programs due to the parallelization


Many thread aware prefetching mechanisms for gpgpu applications

Goal

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Design hardware/software prefetching mechanisms for GPGPU applications

    • Step 1. Prefetcher for Many-thread Architecture

      • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4)

    • Step 2. Feedback mechanism to reduce negative effects

      • Prefetch Throttling(Problems #2 and #3)


Many thread aware prefetching mechanisms for gpgpu applications

Goal

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Design hardware/software prefetching mechanisms for GPGPU applications

    • Step 1. Prefetcher for Many-thread Architecture

      • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4)

    • Step 2. Feedback mechanism to reduce negative effects

      • Prefetch Throttling(Problems #2 and #3)


Many thread aware hardware prefetcher

Many-Thread Aware Hardware Prefetcher

PromotionTable

Decision

Logic

PromotionTable

Decision

Logic

PC, ADDR

Pref. Addr

IP Pref.

PC, ADDR TID

IP Pref.

Stride Pref.

PC, ADDR TID

Stride Pref.

Stride Promotion

  • (Conventional) Stride prefetcher

  • Promotion table for stride prefetcher (Problem #1)

  • Inter-Thread prefetcher (Problem #4)

  • Decision logic

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Solving scalability problem

Solving Scalability Problem

Promotion

Redundant

Entries

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #1. Training of prefetcher (Scalability)

  • Stride Promotion

    • Similar (or even same) access pattern across threads

    • Without promotion, table is occupied by redundant entries

      • By promotion, we can effectively manage storage

    • Reduce training time using earlier threads’ information


Solving short thread lifetime problem

Solving Short Thread Lifetime Problem

prefetch

demand

Memory latency

for (ii = 0; ii < 100; ++ii) {

prefetch(A[ii+1]);

prefetch(B[ii+1]);

C[ii] = A[ii] + B[ii];

}

// there are 100 threads

__global__ void KernelFunction(…) {

inttid = blockDim.x * blockIdx.x + threadIdx.x;

intvarA = aa[tid];

intvarB = bb[tid];

C[tid] = varA + varB;

}

Loop!

No loop,

2 mem, 1 comp

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem #4 (Short thread lifetime)

Highly parallelized code often eliminates prefetching opportunities


Inter thread prefetching

Inter-Thread Prefetching

Prefetch

// there are 100 threads

__global__ void KernelFunction(…) {

inttid = blockDim.x * blockIdx.x + threadIdx.x;

intnext_tid = tid + 32;

prefetch(aa[next_tid]);

prefetch(bb[next_tid]);

intvarA = aa[tid];

intvarB = bb[tid];

C[tid] = varA + varB;

}

T0

T0

T1

T1

T2

T2

Prefetch

T32

T32

T33

T33

prefetch

Memory access

in other threads

prefetch

T64

T64

T64

  • Instead, we can prefetch for other threads

    • Inter-Thread Prefetching (IP)

    • In CUDA, Memory addresses = func(thread id) [SIMT]

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Ip pattern detection in hardware

IP Pattern Detection in Hardware

Req 4

Req 3

Req 2

Req 1

PC:0x1a Addr:2100 TID:1

PC:0x1a Addr:400 TID:3

PC:0x1a Addr:1100 TID:10

PC:0x1a Addr:200 TID:1

Prefetch (addr + stride)

Addr:2100 Stride:100

Trained already

Addr ∆

Delta (Req1-Req2) = = 100

All three deltas are same

We found a pattern

TID ∆

Delta (Req3-Req1) = = 100

Delta (Req3-Reg2) = = 100

Detecting strides across threads

Launch prefetch request

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Mt aware hardware prefetcher

MT-aware Hardware Prefetcher

PromotionTable

Decision

Logic

PC, ADDR

Pref. Addr

PC, ADDR TID

IP Pref.

Cycle 2

Cycle 3

PC, ADDR TID

Stride Pref.

Cycle 1

Stride Promotion

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Many thread aware prefetching mechanisms for gpgpu applications

Goal

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Design a hardware/software prefetcher for GPGPU applications

    • Step 1. Prefetcher for Many-thread Architecture

      • Many-Thread Aware Prefetching Mechanisms

    • Step 2. Feedback mechanism to reduce negative effects

      • Prefetch Throttling


Outline

Outline

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Motivation

Step 1. Many-Thread Aware Prefetching

Step 2. Prefetch Throttling

Evaluation

Conclusion


Prefetch throttling

Prefetch Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Recall problems in GPGPU prefetching

    • Problem #2. Amplifying negative effects

    • Problem #3. Single-Configuration Many-Thread

  • In order to identify whether prefetching is effective

    • Metrics

      • Usefulness – Accurate and timely

      • Harmfulness – Inaccurate or too early prefetches

        • Some late prefetches can be tolerable

      • Similar to Srinath [HPCA 2007]


Throttling metrics

Throttling Metrics

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Merged memory requests

    • New request with same address of existing entries

      • Inside of a core (in MSHR)

    • Late prefetches in CPUs

    • Indicate accuracy (due to massive multi-threading)

      • Less correlated with timeliness

  • Early block eviction from a prefetch cache

    • Due to capacity misses, regardless of accuracy

    • Indicate timeliness and accuracy

  • Periodic Updates

    • To cope with runtime behavior


Heuristic for prefetch throttling

Heuristic for Prefetch Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

* Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio.

  • Throttle Degree

    • Vary from 0 (prefetch all) to 5 (no prefetch)

    • Default:2


Evaluation methodology

Evaluation Methodology

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • MacSim simulator

    • A cycle accurate, in-house simulator

    • A trace-driven simulator (trace from GPUOcelot[Diamos])

  • Baseline

    • 14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels, 1.2GHz memory frequency, 900MHz bus, FR-FCFS

    • NVIDIA G80 Architecture

  • 14 memory intensive benchmarks

    • CUDA SDK, Merge, Rodinia, and Parboil

    • Type

      • Stride, MP (massively parallel), uncoalesced

    • Non-memory intensive benchmarks (in the paper)


Evaluation methodology cont d

Evaluation Methodology – cont’d

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Prefetch

    • Stream, Stride, and GHB prefetchers evaluated

    • 16 KB cache per core (other size results are in the paper)

    • Prefetch distance:1 degree :1 (the optimal configuration)

  • Results

    • Hardware prefetcher

    • Software prefether (in the paper)


Mt hardware prefetcher results

MT Hardware Prefetcher Results

15% over Stride

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

GHB/Stride do not work in mp and uncoal-type

IP (Inter-Thread Prefetching) can be effective

Stride Promotion improves performance of few benchmarks


Mt hwp with throttling results

MT-HWP with Throttling Results

15% over Stride + Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Feedback-driven hardware prefetchers can be effective

  • Throttling eliminates negative effect (stream)

    * There are more negative cases in software prefetching mechanism


Outline1

Outline

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Motivation

Step 1. Many-Thread Aware Prefetching

Step 2. Prefetch Throttling

Evaluation

Conclusion


Conclusion

Conclusion

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Memory is also an important problem in GPGPUs.

  • GPGPU prefetching has four problems:

    • scalability, amplifying negative effects, SCMT, and short thread

  • Goal: Design hardware/software prefetcher

    • Step 1. Many-Thread aware prefetcher(promotion, IP)

    • Step 2. Prefetch throttling

  • MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects.

  • Future work

    • Study other many-thread architectures.

      • Other programming models, architectures with caches


Many thread aware prefetching mechanisms for gpgpu applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

THANK YOU!


Many thread aware prefetching mechanisms for gpgpu applications1

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications


Nvidia fermi result

NVIDIA Fermi Result

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Different prefetch cache size

Different Prefetch Cache Size

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Software mt prefetcher results

Software MT Prefetcher Results

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Many thread aware prefetching mechanisms for gpgpu applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Many thread aware prefetching mechanisms for gpgpu applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


  • Login