Many thread aware prefetching mechanisms for gpgpu applications
Download
1 / 34

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications - PowerPoint PPT Presentation


  • 179 Views
  • Uploaded on
  • Presentation posted in: General

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. Introduction. SIMD Execution. Shared Memory. DRAM. Memory Req.uestBuffer. Core. General Purpose GPUs (GPGPU) are getting popular High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops )

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript



Introduction
Introduction Applications

SIMD

Execution

Shared

Memory

DRAM

Memory Req.uestBuffer

Core

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • General Purpose GPUs (GPGPU) are getting popular

    • High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops)

    • GPGPUs have SIMD execution, many cores, and large-scale multi-threading

      • Warp – basic unit of execution in a core (SIMD unit)


Memory latency problem
Memory Latency Problem Applications

C

C

C

M

M

M

C

C

C

C

C

C

M

M

M

D

D

D

Memory Latency

Computation

Memory

Dependent on memory

C

M

D

No stall

4 active threads

T0

C

M

C

C

M

D

C

M

C

Switch

T1

Switch

T2

Switch

T3

  • Tolerating memory latency is critical in CPUs

    • Many techniques have been proposed

      • Caches, prefetching, multi-threading, etc.

  • Is this critical in GPGPUs as well?

    • GPGPUs have employed multi-threading

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Memory problems in gpgpus
Memory Problems in GPGPUs Applications

2 active threads

C

M

C

C

M

D

C

C

M

M

C

C

C

C

M

M

D

D

T0

Switch

Stall Cycles

T1

Stall

Memory Latency

  • What if there are not enough threads in GPGPUs?

  • Limited thread-level-parallelism

    • Application behavior

      • Algorithmically, lack of parallelism

    • Limited by resource constraints

      • # registers per thread, # threads per block, shared memory usage per block

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Prefetching in gpgpus
Prefetching Applications in GPGPUs

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem: when multi-threading is not enough, how can we hide memory latency?

    • Other solutions

      • Caching (NVIDIA Fermi)

      • Prefetching (in this talk)

  • Many prefetchers mechanisms proposed in CPUs

    • stride, stream, Markov, CDP, GHB, helper thread, etc.

  • Question: will the existing mechanisms work in GPGPUs?


Characteristic 1 many threads
Characteristic #1. Many Threads Applications

1 thread

2 threads

Many threads

Prefetcher

Prefetcher

Prefetcher

Prefetching in CPU

Prefetching in GPU

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #1. Training of prefetcher(Scalability)

    • Accesses from many threads are interleaved

  • Problem #2. Amplified negative effects (SIMT)

    • One useless prefetchrequest per thread many useless prefetches


Characteristic 1 many threads1
Characteristic #1. Many Threads Applications

Capacity misses

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

pref

Fit in a cache

Cache

Cache

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #3. Single-Configuration Many-Threads (SCMT)

    • Too many threads are controlled together

      Prefetch degree 1: < cache size

      Prefetch degree 2: >> cache size


Characteristic 2 data level parallelism
Characteristic Applications#2. Data Level Parallelism

create

prefetch

Memory latency

prefetch

demand

Memory latency

demand

Useful!

Not enough opportunity

1. thread terminated

2. too close to demand

terminate

A thread in sequential program

A thread in parallel program

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #4. Short thread lifetime

    • The length of a thread in parallel programs is shorter than in sequential programs due to the parallelization


Many thread aware prefetching mechanisms for gpgpu applications
Goal Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Design hardware/software prefetching mechanisms for GPGPU applications

    • Step 1. Prefetcher for Many-thread Architecture

      • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4)

    • Step 2. Feedback mechanism to reduce negative effects

      • Prefetch Throttling(Problems #2 and #3)


Many thread aware prefetching mechanisms for gpgpu applications
Goal Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Design hardware/software prefetching mechanisms for GPGPU applications

    • Step 1. Prefetcher for Many-thread Architecture

      • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4)

    • Step 2. Feedback mechanism to reduce negative effects

      • Prefetch Throttling(Problems #2 and #3)


Many thread aware hardware prefetcher
Many-Thread Aware ApplicationsHardware Prefetcher

PromotionTable

Decision

Logic

PromotionTable

Decision

Logic

PC, ADDR

Pref. Addr

IP Pref.

PC, ADDR TID

IP Pref.

Stride Pref.

PC, ADDR TID

Stride Pref.

Stride Promotion

  • (Conventional) Stride prefetcher

  • Promotion table for stride prefetcher (Problem #1)

  • Inter-Thread prefetcher (Problem #4)

  • Decision logic

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Solving scalability problem
Solving Scalability Problem Applications

Promotion

Redundant

Entries

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Problem #1. Training of prefetcher (Scalability)

  • Stride Promotion

    • Similar (or even same) access pattern across threads

    • Without promotion, table is occupied by redundant entries

      • By promotion, we can effectively manage storage

    • Reduce training time using earlier threads’ information


Solving short thread lifetime problem
Solving Short Thread Lifetime Problem Applications

prefetch

demand

Memory latency

for (ii = 0; ii < 100; ++ii) {

prefetch(A[ii+1]);

prefetch(B[ii+1]);

C[ii] = A[ii] + B[ii];

}

// there are 100 threads

__global__ void KernelFunction(…) {

inttid = blockDim.x * blockIdx.x + threadIdx.x;

intvarA = aa[tid];

intvarB = bb[tid];

C[tid] = varA + varB;

}

Loop!

No loop,

2 mem, 1 comp

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Problem #4 (Short thread lifetime)

Highly parallelized code often eliminates prefetching opportunities


Inter thread prefetching
Inter-Thread Prefetching Applications

Prefetch

// there are 100 threads

__global__ void KernelFunction(…) {

inttid = blockDim.x * blockIdx.x + threadIdx.x;

intnext_tid = tid + 32;

prefetch(aa[next_tid]);

prefetch(bb[next_tid]);

intvarA = aa[tid];

intvarB = bb[tid];

C[tid] = varA + varB;

}

T0

T0

T1

T1

T2

T2

Prefetch

T32

T32

T33

T33

prefetch

Memory access

in other threads

prefetch

T64

T64

T64

  • Instead, we can prefetch for other threads

    • Inter-Thread Prefetching (IP)

    • In CUDA, Memory addresses = func(thread id) [SIMT]

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Ip pattern detection in hardware
IP Pattern Detection in Hardware Applications

Req 4

Req 3

Req 2

Req 1

PC:0x1a Addr:2100 TID:1

PC:0x1a Addr:400 TID:3

PC:0x1a Addr:1100 TID:10

PC:0x1a Addr:200 TID:1

Prefetch (addr + stride)

Addr:2100 Stride:100

Trained already

Addr ∆

Delta (Req1-Req2) = = 100

All three deltas are same

We found a pattern

TID ∆

Delta (Req3-Req1) = = 100

Delta (Req3-Reg2) = = 100

Detecting strides across threads

Launch prefetch request

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Mt aware hardware prefetcher
MT-aware Hardware ApplicationsPrefetcher

PromotionTable

Decision

Logic

PC, ADDR

Pref. Addr

PC, ADDR TID

IP Pref.

Cycle 2

Cycle 3

PC, ADDR TID

Stride Pref.

Cycle 1

Stride Promotion

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Many thread aware prefetching mechanisms for gpgpu applications
Goal Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Design a hardware/software prefetcher for GPGPU applications

    • Step 1. Prefetcher for Many-thread Architecture

      • Many-Thread Aware Prefetching Mechanisms

    • Step 2. Feedback mechanism to reduce negative effects

      • Prefetch Throttling


Outline
Outline Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Motivation

Step 1. Many-Thread Aware Prefetching

Step 2. Prefetch Throttling

Evaluation

Conclusion


Prefetch throttling
Prefetch Applications Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Recall problems in GPGPU prefetching

    • Problem #2. Amplifying negative effects

    • Problem #3. Single-Configuration Many-Thread

  • In order to identify whether prefetching is effective

    • Metrics

      • Usefulness – Accurate and timely

      • Harmfulness – Inaccurate or too early prefetches

        • Some late prefetches can be tolerable

      • Similar to Srinath [HPCA 2007]


Throttling metrics
Throttling Metrics Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Merged memory requests

    • New request with same address of existing entries

      • Inside of a core (in MSHR)

    • Late prefetches in CPUs

    • Indicate accuracy (due to massive multi-threading)

      • Less correlated with timeliness

  • Early block eviction from a prefetch cache

    • Due to capacity misses, regardless of accuracy

    • Indicate timeliness and accuracy

  • Periodic Updates

    • To cope with runtime behavior


Heuristic for prefetch throttling
Heuristic Applicationsfor Prefetch Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

* Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio.

  • Throttle Degree

    • Vary from 0 (prefetch all) to 5 (no prefetch)

    • Default:2


Evaluation methodology
Evaluation Methodology Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • MacSim simulator

    • A cycle accurate, in-house simulator

    • A trace-driven simulator (trace from GPUOcelot[Diamos])

  • Baseline

    • 14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels, 1.2GHz memory frequency, 900MHz bus, FR-FCFS

    • NVIDIA G80 Architecture

  • 14 memory intensive benchmarks

    • CUDA SDK, Merge, Rodinia, and Parboil

    • Type

      • Stride, MP (massively parallel), uncoalesced

    • Non-memory intensive benchmarks (in the paper)


Evaluation methodology cont d
Evaluation ApplicationsMethodology – cont’d

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Prefetch

    • Stream, Stride, and GHB prefetchers evaluated

    • 16 KB cache per core (other size results are in the paper)

    • Prefetch distance:1 degree :1 (the optimal configuration)

  • Results

    • Hardware prefetcher

    • Software prefether (in the paper)


Mt hardware prefetcher results
MT Hardware ApplicationsPrefetcher Results

15% over Stride

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

GHB/Stride do not work in mp and uncoal-type

IP (Inter-Thread Prefetching) can be effective

Stride Promotion improves performance of few benchmarks


Mt hwp with throttling results
MT-HWP with Throttling Results Applications

15% over Stride + Throttling

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Feedback-driven hardware prefetchers can be effective

  • Throttling eliminates negative effect (stream)

    * There are more negative cases in software prefetching mechanism


Outline1
Outline Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Motivation

Step 1. Many-Thread Aware Prefetching

Step 2. Prefetch Throttling

Evaluation

Conclusion


Conclusion
Conclusion Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  • Memory is also an important problem in GPGPUs.

  • GPGPU prefetching has four problems:

    • scalability, amplifying negative effects, SCMT, and short thread

  • Goal: Design hardware/software prefetcher

    • Step 1. Many-Thread aware prefetcher(promotion, IP)

    • Step 2. Prefetch throttling

  • MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects.

  • Future work

    • Study other many-thread architectures.

      • Other programming models, architectures with caches




Nvidia fermi result
NVIDIA Fermi Result Applications

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Different prefetch cache size
Different ApplicationsPrefetch Cache Size

Many-Thread Aware Prefetching Mechanisms (MICRO-43)


Software mt prefetcher results
Software MT ApplicationsPrefetcher Results

Many-Thread Aware Prefetching Mechanisms (MICRO-43)




ad
  • Login