A compile time managed multi level register file hierarchy
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

A Compile-Time Managed Multi-Level Register File Hierarchy PowerPoint PPT Presentation


  • 215 Views
  • Uploaded on
  • Presentation posted in: General

A Compile-Time Managed Multi-Level Register File Hierarchy. Mark Gebhart Stephen W. Keckler The University of Texas NVIDIA / The University of Texas at Austin at Austin William J. Dally NVIDIA / Stanford University .

Download Presentation

A Compile-Time Managed Multi-Level Register File Hierarchy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A compile time managed multi level register file hierarchy

A Compile-Time Managed Multi-Level Register File Hierarchy

Mark GebhartStephen W. Keckler

The University of Texas NVIDIA / The University of Texas

at Austin at Austin

William J. Dally

NVIDIA / Stanford University

MICRO-44


Motivation

Motivation

  • All systems are effectively power limited

    • Energy efficiency is a primary design constraint

  • Throughput processors

    • Massive multithreading to tolerate memory latency

    • Large register file consumes significant energy

  • Register file hierarchy localizes most accesses to small structures close to ALUs

    • Compiler controls all data movement through hierarchy

MICRO-44


Register reuse patterns

Register Reuse Patterns

  • 70% of values only read once

  • 50% of values only read once, within 3 instructions of being written

MICRO-44


Outline

Outline

  • Motivation

  • Background

    • Baseline GPU

    • Two-Level Warp Scheduler

    • Register File Caching

  • SW Managed Register File Hierarchy

  • Results

  • Conclusions

MICRO-44


Baseline gpu architecture

Baseline GPU Architecture

  • Similar to NVIDIA’s Fermi design

  • 16 streaming multiprocessors (SMs) per chip

  • Memory interface designed to maximize bandwidth rather than latency

MICRO-44


Baseline sm

Baseline SM

  • Register file heavily banked for high bandwidth

  • 32 SIMT lanes

  • 1024 threads per SM

    • 32 warps

    • 32 threads per warp

MICRO-44


Prior work

Prior Work

  • Two Level Warp Scheduler

    • Only active warps issue instructions

    • Active warps descheduled on possible long latency events

  • Register file cache (RFC)

    • Hardware managed

    • 21 times smaller than MRF

    • Only active warps access RFC

    • RFC flushed when active warp descheduled

MICRO-44


Limitations of hw managed cache

Limitations of HW Managed Cache

  • No knowledge of register reuse patterns

  • Writebacks from RFC to MRF

  • Must track RFC tags

  • Can’t privatize RFC to ALUs

MICRO-44


Outline1

Outline

  • Motivation

  • Background

  • SW Managed Register File Hierarchy

    • Microarchitecture

    • Compiler Algorithms

  • Results

  • Conclusions

MICRO-44


Microarchitecture

Microarchitecture

  • Main Register File (MRF)

    • 128KB for 1024 threads

  • Operand Register File (ORF)

    • 3 entries * 256 active threads

  • Last Result File (LRF)

    • 1 entry * 256 active threads

    • Private to ALUs

MICRO-44


Allocation for hierarchical register file

Allocation for Hierarchical Register File

  • Split program into allocation units called strands

    • Strand is sequence of instructions with no long latency dependencies

  • All inter-strand values communicated through MRF

  • To simplify allocation:

    • A backwards branch ends a strand

    • A basic block targeted by a backwards branch begins a strand

  • Compiler marks end of strands

BB1

ld.global R1

read R1

Strand 1

*

*

*

Strand 2

End of strand marker

Strand 3

BB2

add

add

Strand 4

BB3

MICRO-44


Allocation algorithm

Allocation Algorithm

LRF ORF MRF

R3

R4

R6 R6

R5

R7

  • Greedy per strand

  • Metric: energy savings/lifetime

add R3, R1, R2

sub R4, R1, R3

mul R6, R3, R3

ld.global R5, R4

add R7, R6, R6

div R8, R5, R6

add R9, R7, R8

MICRO-44


Optimizations

Optimizations

  • Partial range allocation

  • Read operand allocation

  • Split LRF

MICRO-44


Optimization 1

Optimization #1

  • Partial range allocation

R1 written to both ORF and MRF

Reads of R1 come from ORF

Strand

Read of R1 comes from MRF

MICRO-44


Optimization 2

Optimization #2

  • Read operand allocation

R0 is read from MRF and

written to ORF

Strand

Reads of R0 come

from ORF

MICRO-44


Outline2

Outline

  • Motivation

  • Background

  • SW Managed Register File Hierarchy

  • Results

    • Register Access Breakdown

    • Energy Savings

  • Conclusions

MICRO-44


Breakdown of register accesses

Breakdown of Register Accesses

  • LRF is able to handle 30% of traffic

MICRO-44


Energy evaluation

Energy Evaluation

  • Max energy savings of 54% with 3 level SW control and 3 ORF entries per thread

    • 44% improvement over prior hardware controlled RFC

2 Level HW

3 Level HW

2 Level SW

3 Level SW

Number of ORF/RFC Entries per Thread

MICRO-44


Individual benchmark results

Individual Benchmark Results

  • 3 level SW design with 3 ORF entries per active thread

MICRO-44


Conclusion

Conclusion

  • 3-level SW controlled design reduces register file energy by 54%

    • 8.3% savings in SM dynamic energy

    • 5.8% savings in chip-wide dynamic energy

  • Limit study highlights potential for future work to improve results

    • Instruction scheduling concurrently with allocation

  • Throughput processors have different critical structures

    • Must redesign as we enter power limited world

MICRO-44


  • Login