1 / 29

Challenges for High Performance Processors

Challenges for High Performance Processors. Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo. What’s the challenge?. Our Primary Goal: Performance How ? increase the number and/or operating frequency of functional units AND

jariah
Download Presentation

Challenges for High Performance Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo

  2. What’s the challenge? • Our Primary Goal: Performance • How ? • increase the number and/or operating frequency of functional units AND • supply functional units with sufficient data (bandwidth) • Problems: • Memory Wall • system performance is limited by poor memory performance • Power Wall • power consumption is approaching cooling limitation France-Japan PAAP Workshop

  3. Memory Wall Problem • Performance improvement • CPU: 55% / year • DRAM: 7% / year France-Japan PAAP Workshop

  4. L2 hit L1 hit 1/6 cache miss Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i] non-blocking cache & out-of-order issue  lack of effective memory throughput France-Japan PAAP Workshop

  5. Itanium2/Montecito : Huge L3 cache (12MB x 2) Recap: Memory Wall Problem • growing gap between processor and memory speed • performance is limited by memory ability in High Performance Computing (HPC) • long access latency of main memory • lack of throughput of main memory  making full use of local memory (on-chip memory) of wide bandwidth is indispensable • on-chip memory space is valuable resource • not enough for HPC • should exploit data locality France-Japan PAAP Workshop

  6. Does cache work well in HPC? works well in many cases, but not the best for HPC • data location and replacement by hardware × unfortunate line conflicts occur although most of data accesses are regular ex. data used only once flush out other useful data • transfer size of cache  off-chip is fixed • for consecutive data: larger transfer size is preferable • for non-consecutive data: large line transfer incurs unnecessary data transfer  waste of bandwidth • Most of HPC applications exhibit regularity in data access, which is sometimes not well enjoyed. France-Japan PAAP Workshop

  7. ALU FPU register reconfigurable SCM Cache SCM Cache NIA ・・・ Memory (DRAM) Network SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000] (joint work with Prof. Boku @ Univ. of Tsukuba and others) • addressable SCM in addition to ordinary cache • a part of logical address space • no inclusive relations with Cache • SCM and cache are reconfigurable at the granularity of way (SCM: Software Controllable Memory) overview of SCIMA address space France-Japan PAAP Workshop

  8. load/store line transfer page-load/page-store Data Transfer Instruction • load/store • register  SCM/Cache • page-load/page-store • SCM  Off-Chip Memory • large granularity transfer • wider effective bandwidthby reducing latency stall • block stride transfer • avoid unnecessary data transfer • more effective utilizationof On-Chip Memory New Register SCM Cache Off-Chip Memory France-Japan PAAP Workshop

  9. first, apply (1) (2) allocate small stream buffer in SCM reserve SCMfor reused data (4) use SCM as a stream buffer conse- cutive (1) reserve SCMfor reused data (5) second, apply (4) (5) and (6) use SCM as a stream buffer stride (2) reserve SCMfor reused data allocate rest area of SCM for reused data (6) not use SCM (3) not-reusable reusable Strategy of Software Control • SCM must be controlled by software • arrays are classified into 6 groups Consecutiveness irregular Reusability ・prototype of semi-automatic compiler : users specify hints on reusability of data arrays France-Japan PAAP Workshop

  10. benchmark programs • CG, FT, QCD assumption • cache model: cache size = 64KB(4way) SCM size = 0KB • SCIMA mode: cache size = 16KB (1way) SCM size = 48KB • total # of way: 4 • line size: 32B, 128B due to fully exploitation of data reusability Results of Memory Traffic • unnecessary memory traffic is suppressed 1% - 61% of memory traffic decreases in SCIMA France-Japan PAAP Workshop

  11. assumption load/store latency: 2cyclebus throughput: 4B/cyclememory latency: 40cycle Results of Performance • CPU busy time • latency stall : elapsed time due to memory latency • throughput stall : elapsed time due to lack of throughput normalized execution time • 1.3-2.5 times faster than cache • latency stall reduction by large granularity of data transfer • throughput stall reduction by suppressing unnecessary data transfer France-Japan PAAP Workshop

  12. ♦Itanium (130W) Power Wall • Next Focus: Power Consumption of Processors • Is there any room for power reduction ? • If yes, then how to reduce ? Trends of Heat Density France-Japan PAAP Workshop

  13. Observation(1) Moore’s Law • Num. of transistors : doubles every 18 months France-Japan PAAP Workshop

  14. Observation (2) – frequency – • Frequency doubles every 3 years. • Number of transistors : doubles every 18 months • Number of switching on a chip: 8 times every 3 years France-Japan PAAP Workshop

  15. Observation (3) – performance – • # of switching on a chip: 8 times every 3 years • effective performance: 4 times every 3 years • “microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann • unnecessary switching = chance of power reduction: doubles every 3 years France-Japan PAAP Workshop

  16. 4 6 8 10 12 An Evidence of the Observation-unnecessary switching = x2 / 3 years - [Zyuban00] @ ISLPED’00 • energy/instr. increases to exploit ILP for higher performance • at functional units : no increase • at issue window, register file : increase • flushed instruction by incorrect prediction: increase rename map table bypass mechanism load/store window issue window register file functional units flushed instruction access energy per instruction (nJ) committed instruction Issue Width waste of power France-Japan PAAP Workshop

  17. Registers • Register consumes a lot of power • roughly speaking, power ∝(num. of registers) X (num. of ports) • high performance wide issue superscalar processors more registers, more read/write ports • Open Question • in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design • scalar registers with SIMD operations • vector registers with vector operations • ……… • Personal Impression • vector registers are accessed in well-organized fashion, it is easy to reduce “num. of ports” by sub-banking technique • can vector operations make good use of local on-chip memory? (at least, traditional vector processors can never!) France-Japan PAAP Workshop

  18. Cache Cache Core Core Core Dual Core helps … Rule of thumb In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 France-Japan PAAP Workshop

  19. Cache Large Core Small Core C1 C2 Cache C3 C4 Multi-Core helps more … Power Power = 1/4 4 Performance Performance = 1/2 3 2 2 1 1 1 1 no need for wider instruction issue  4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1 France-Japan PAAP Workshop

  20. ) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active [Borkar-MICRO05] 1000 [Borkar-MICRO05] VDD leakage current 800 Power (W), Power Density (W/cm 600 ON 400 Input 0 OFF 200 0 90nm 65nm 45nm 32nm 22nm 16nm Leakage problem IEEE Computer Magazine • How to attack leakage problem? France-Japan PAAP Workshop

  21. Introduction of our research • Innovative Power Control for Ultra Low-Power and High-Performance System LSIs • 5 years project started October, 2006 • supported by JST (Japan Science and Technology Agency) as a CREST (Core Research for Evolutional Science and Technology) program • Objective: drastic power reduction of high-performance system LSIs by innovative power controlthrough tight cooperation of various design levels including circuit, architecture, and system software. • Members: • Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] • Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS • Prof. H. Amano (Keio Univ): architecture & F/E design • Prof. K. Usami (Shibaura I.T.): circuit & B/E design France-Japan PAAP Workshop

  22. Sleep How to reduce leakage: Power Gating • Focusing on Power Gating for reducing leakage • Inserting a Power Switch (PS) between VDD and GND • Turning off PS when sleep logic gates VDD VDD logic gates GND Virtual GND Power Switch France-Japan PAAP Workshop

  23. Circuit A Circuit B Circuit C Power Switch Sleep Control ckt Run-time Power Gating (RTPG) • control power switch at run time • Coarse grain: Mobile processor by Renesas (independent power domains for BB module, MPEG module, ..) • Fine grain (our target): power gating within a module France-Japan PAAP Workshop

  24. Fine-grain Run-time Power Gating • Longer sleep time is preferable • Leakage savings • Overheads: power penalties for wakeup • Evaluation through a real chip not reported • Test vehicle: 32b x 32b Multiplier • Either or both operands (input data) are likely less than 16-bit • Circuit portions to compute upper bits of product need not to operate  waste leakage power By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array France-Japan PAAP Workshop

  25. 4.0 Power dissipation(mW) 3.5 125C 3.0 85C 2.5 25C 2.0 Sequence 3 (Domain H and M sleep) Sequence 1 (No sleep) Sequence 2 (Domain H sleeps) Test chip "Pinnacle" real measurement - Exhibits good power reduction - Current Status • Designing a pipelined microprocessor with FG-RTPG • Compiler (instruction scheduler) to increase sleep time Not applied FG-RTPG applied France-Japan PAAP Workshop

  26. Low Power Linux Scheduler based onstatistical modeling • Co-optimization of System Software and Architecture • Objective: • process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint • How to find the lowest frequency with satisfying performance constraints ? • it depends on hardware and program characteristics • performance ratio is different from frequency ratio • hard to find the answer straightforward  modeling by statistical analysis of hardware events France-Japan PAAP Workshop

  27. Evaluation result Pentium M 760 (Max 2.00 GHz, FSB 533 MHz) • Specified threshold • Black dotted line • Perf. is within the threshold in all the cases except for mgrid • 3-7% below the threshold • Accurate model is obtained • Linux scheduler using this model is developed May 8, 2007 27 France-Japan PAAP Workshop

  28. Summary • Challenge for high performance processors: • Memory Wall and Power Wall • One solution to memory wall • make good use of on-chip memory with software controllability • Solutions to power wall • many cores will relax the problem, but • leakage current is getting a big problem • new research/approach is required • our project “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs” is introduced France-Japan PAAP Workshop

  29. France-Japan PAAP Workshop

More Related