Pushing Performance, Efficiency and Scalability of Microprocessors

Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006 Gabriel Loh

Research Overview • Funding from state of GA, Intel, MARCO • Currently 2 PhD students, 2 MS • Active undergrad research as well • Collaborations • Universities: PSU, UO, Rutgers • Industry: Intel, IBM

Research Focus • “Near-term” microprocessor design issues • ~ 5-year time scale • Power/performance/complexity • Traditional uniprocessor performance • Multi-core performance • “Longer-term” • Keeping Moore’s Law alive for the longer term • Primarily, 3D integration for now

Scaling Performance and Efficiency • Multi-cores are here, but single-thread perf still matters • Intel Core 2 Duo is multi-core, but… • Single core is more OOO than ever • Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders • But power also really matters • Lower clock speeds, different channel length transistors, more uop fusion, …

Research Focus • Maximum performance within bounds • Bounds = power, area, TDP, … • Single-core performance helps multi-core performance, too • For future multi-core systems, need to strike a good balance between 1T and MT • Most of our research is at the uarch level • Caches, branch predictors, instruction schedulers, memory queue design, memory dependence prediction, etc.

Highlight: Traditional Caching [MICRO’06] • Well known that different apps respond differently to different replacement policies • Previous work in the OS domain has described adaptive replacement with provable bounds on performance • Adapted techniques for on-chip caches

Idea…

Adaptive Cache Implementation • Theoretical Guarantees • Miss rate provably bounded to be within a factor of two of the better algorithm In practice, it’s much better

Current Research • Working on multi-core generalizations of adaptive caching and other ways to manage shared resources • Uniprocessor microarchitecture • Scalable memory scheduling [MICRO’06] • Memory dependence prediction [HPCA’06] • Branch prediction […] • And more…

Longer-Term Processor Scaling • Limitations/Obstacles • Wire scaling • Latency/performance • Power • Feature size • Lithography, parametric variations • Off-chip communication

3D Integration Active Layer 1 • Wire • Power/perf. • Off-chip • Feature size • Limitations, variations Metal Layers 1 Die-to-Die Vias Metal Layers 2 Active Layer 2 Die/Wafer Stacking Less RC  faster, lower-power

Wordline length halved • in our studies, WL was critical for latency 3D Bitline Stacking • Bitline length halved • BL reduction has greater impact on power savings • Split decoder  no activity stacking 3D Wordline Stacking Example: Caches We’ve studied a wide variety of other CPU building blocks Simplified 2D SRAM Array

Uarch-level 3D design Smaller footprint  faster and lower-power Width-based gating  even lower power, close to original power density Overall: 47% performance gain at only 2 degree temperature increase Example: 4-die significance-partitioned datapath Use uarch prediction mechanism for early determination of width

3D Research Summary • Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06] • Uarch-level [MICRO’06 (w/ ),HPCA’07] • Tutorial papers [JETC’06] • Tutorial [MICRO’06] • Tools [DATE’06,TCAD’07] w/ GTCAD & • Parametric Variations w/ Jim Meindl • Funding, equip from ,

Summary • loh@cc • http://www.cc.gatech.edu/~loh • Lots of exciting work going on here

Pushing Performance, Efficiency and Scalability of Microprocessors