a compile time managed multi level register file hierarchy
Download
Skip this Video
Download Presentation
A Compile-Time Managed Multi-Level Register File Hierarchy

Loading in 2 Seconds...

play fullscreen
1 / 20

A Compile-Time Managed Multi-Level Register File Hierarchy - PowerPoint PPT Presentation


  • 260 Views
  • Uploaded on

A Compile-Time Managed Multi-Level Register File Hierarchy. Mark Gebhart Stephen W. Keckler The University of Texas NVIDIA / The University of Texas at Austin at Austin William J. Dally NVIDIA / Stanford University .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Compile-Time Managed Multi-Level Register File Hierarchy' - eagan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a compile time managed multi level register file hierarchy

A Compile-Time Managed Multi-Level Register File Hierarchy

Mark GebhartStephen W. Keckler

The University of Texas NVIDIA / The University of Texas

at Austin at Austin

William J. Dally

NVIDIA / Stanford University

MICRO-44

motivation
Motivation
  • All systems are effectively power limited
    • Energy efficiency is a primary design constraint
  • Throughput processors
    • Massive multithreading to tolerate memory latency
    • Large register file consumes significant energy
  • Register file hierarchy localizes most accesses to small structures close to ALUs
    • Compiler controls all data movement through hierarchy

MICRO-44

register reuse patterns
Register Reuse Patterns
  • 70% of values only read once
  • 50% of values only read once, within 3 instructions of being written

MICRO-44

outline
Outline
  • Motivation
  • Background
    • Baseline GPU
    • Two-Level Warp Scheduler
    • Register File Caching
  • SW Managed Register File Hierarchy
  • Results
  • Conclusions

MICRO-44

baseline gpu architecture
Baseline GPU Architecture
  • Similar to NVIDIA’s Fermi design
  • 16 streaming multiprocessors (SMs) per chip
  • Memory interface designed to maximize bandwidth rather than latency

MICRO-44

baseline sm
Baseline SM
  • Register file heavily banked for high bandwidth
  • 32 SIMT lanes
  • 1024 threads per SM
    • 32 warps
    • 32 threads per warp

MICRO-44

prior work
Prior Work
  • Two Level Warp Scheduler
    • Only active warps issue instructions
    • Active warps descheduled on possible long latency events
  • Register file cache (RFC)
    • Hardware managed
    • 21 times smaller than MRF
    • Only active warps access RFC
    • RFC flushed when active warp descheduled

MICRO-44

limitations of hw managed cache
Limitations of HW Managed Cache
  • No knowledge of register reuse patterns
  • Writebacks from RFC to MRF
  • Must track RFC tags
  • Can’t privatize RFC to ALUs

MICRO-44

outline1
Outline
  • Motivation
  • Background
  • SW Managed Register File Hierarchy
    • Microarchitecture
    • Compiler Algorithms
  • Results
  • Conclusions

MICRO-44

microarchitecture
Microarchitecture
  • Main Register File (MRF)
    • 128KB for 1024 threads
  • Operand Register File (ORF)
    • 3 entries * 256 active threads
  • Last Result File (LRF)
    • 1 entry * 256 active threads
    • Private to ALUs

MICRO-44

allocation for hierarchical register file
Allocation for Hierarchical Register File
  • Split program into allocation units called strands
    • Strand is sequence of instructions with no long latency dependencies
  • All inter-strand values communicated through MRF
  • To simplify allocation:
    • A backwards branch ends a strand
    • A basic block targeted by a backwards branch begins a strand
  • Compiler marks end of strands

BB1

ld.global R1

read R1

Strand 1

*

*

*

Strand 2

End of strand marker

Strand 3

BB2

add

add

Strand 4

BB3

MICRO-44

allocation algorithm
Allocation Algorithm

LRF ORF MRF

R3

R4

R6 R6

R5

R7

  • Greedy per strand
  • Metric: energy savings/lifetime

add R3, R1, R2

sub R4, R1, R3

mul R6, R3, R3

ld.global R5, R4

add R7, R6, R6

div R8, R5, R6

add R9, R7, R8

MICRO-44

optimizations
Optimizations
  • Partial range allocation
  • Read operand allocation
  • Split LRF

MICRO-44

optimization 1
Optimization #1
  • Partial range allocation

R1 written to both ORF and MRF

Reads of R1 come from ORF

Strand

Read of R1 comes from MRF

MICRO-44

optimization 2
Optimization #2
  • Read operand allocation

R0 is read from MRF and

written to ORF

Strand

Reads of R0 come

from ORF

MICRO-44

outline2
Outline
  • Motivation
  • Background
  • SW Managed Register File Hierarchy
  • Results
    • Register Access Breakdown
    • Energy Savings
  • Conclusions

MICRO-44

breakdown of register accesses
Breakdown of Register Accesses
  • LRF is able to handle 30% of traffic

MICRO-44

energy evaluation
Energy Evaluation
  • Max energy savings of 54% with 3 level SW control and 3 ORF entries per thread
    • 44% improvement over prior hardware controlled RFC

2 Level HW

3 Level HW

2 Level SW

3 Level SW

Number of ORF/RFC Entries per Thread

MICRO-44

individual benchmark results
Individual Benchmark Results
  • 3 level SW design with 3 ORF entries per active thread

MICRO-44

conclusion
Conclusion
  • 3-level SW controlled design reduces register file energy by 54%
    • 8.3% savings in SM dynamic energy
    • 5.8% savings in chip-wide dynamic energy
  • Limit study highlights potential for future work to improve results
    • Instruction scheduling concurrently with allocation
  • Throughput processors have different critical structures
    • Must redesign as we enter power limited world

MICRO-44

ad