1 / 12

GSRC Soft-Systems Vision of Long-term Microarchitecture Research

GSRC Soft-Systems Vision of Long-term Microarchitecture Research. Wen-mei Hwu and Kurt Keutzer with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,

corbin
Download Presentation

GSRC Soft-Systems Vision of Long-term Microarchitecture Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GSRC Soft-SystemsVision of Long-termMicroarchitecture Research Wen-mei Hwu and Kurt Keutzer with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd, Dan R. Burke, Nacho Navarro, Steve S. Lumetta University of Illinois at Urbana-Champaign and Yujia Jin, Andrew Mihal, Matt Moskewcz, Will Plishker, Kaushik Ravindran, N. R. Satish, Scott Weber University of California at Berkeley

  2. Powerful trends • Technology trends • Increasing speed, power variability of transistors limit clock frequency increase • Increasing interconnect delay and shrinking clock domains limit the size of individual computing engines. • Soft errors, signal integrity, and defects require new approaches to reliability 1.4 10000 Interconnect RC Delay 1.3 30% 1000 Clock Period 1.2 Normalized Frequency 130nm 100 Delay (ps) Copper Interconnect 1.1 10 5X RC delay of 1mm interconnect 1.0 0.9 1 1 2 3 4 5 Normalized Leakage (Isb) 350 250 180 130 90 65 Source: Shekhar Borkar, Intel

  3. Powerful Trends (cont.) • Architectural trends • No future scaling path for monolithic microarchitectures • Too much power, too much cache, too hard to verify, diminishing returns • No single processing model will cover all applications – lesson from Pentium 4 vs. Itanium 2 • Traditional multi-core approach is unlikely to be fruitful without major software and microarchitecture retooling

  4. Powerful Trends (cont.) • Business trends • >$2B per fabrication facility, >$30M per IC total design cost, >$1M mask set, >$1M CAD tool investment per design engineer – decreasing ASIC design starts and strong demand for programmable replacements • Software trends • Relatively more and more processor cycles will be devoted to applications that deal with physical world – rich in inherent parallelism • More and more application source will be available to advanced compilers – open source movement and Microsoft Phoenix • New programming models and environments motivated by productivity will increase compiler’s ability to enhance parallelism and manage data movement • Deep, scalable software analysis systems becoming available to application compilation at production level

  5. Bottom line analysis • Thread/task-level parallelism will be the main venue for continued scaling • Expending power while developing parallelism is decreasingly viable • Compilers and programmers must deliver, saving energy for computation • Spatial computing, decentralized models key to efficiency and reliability • Already prevalent in certain embedded contexts—will also dominate general purpose • Processors will be heterogeneous compute engines tailored to their application • Some tailoring done at chip design time via special purpose on-chip application engines • Some tailoring done at user programming time via reconfigurable logic, interconnects • Distributed memory model and proactive data management will be critical to sustaining parallel execution • Bulk cache structures need to be replaced or supplemented by localized storage that provide massive bandwidth at much lower power and latency • Pointers and dynamically allocated data structures must be systematically managed in the memory model

  6. Near-term chip multiprocessor

  7. Soft Systems microarchitecture model MTM, with its distributed implementation will be key to modular systems design

  8. More on accuracy Inaccurate Analysis: Steensgaard Accurate Analysis: Andersen, Field-sensitive, Context-sensitive, Heap cloning Just One Heap Object ( calloc( ) allocation ) MPEG-4 Decoder, GetContextInter()

  9. Analyzing large, complex programs[SAS2004] Originally, problem size exploded as more contexts were encountered 1012 This results in an efficient analysis process without loss of accuracy 104 New algorithm contains problem size with each additional context

  10. Question 1: Why is this different from previous parallelizing compilers and multiprocessors? • We are extending these previous work in three critical directions • Extraction of independent memory objects and conversion of implicit memory data flow into managed data movement and communication • Applicability to mass software base - requires pointer analysis, control flow analysis, beyond traditional array dependence analysis • Incorporation of high-level programming models with deep program analysis CPU CPU DRAM PE’s DRAM Az_4 PE’s Az_4 Weight_Ai (Az, F_ga3, Ap3) Weight_Ai (Az, F_g4, Ap4) synth synth Residu (Ap3, &syn_subfr[i],) res2 res2 Copy (Ap3, h, 11) Weight_Ai Weight_Ai Set_zero (&h[11], 11) m_syn m_syn (Ap4, h, h, 22, &h) Syn_filt Copy+ F_g3 Residu F_g3 Set_zero tmp = h[0] * h[0]; for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i]; F_g4 F_g4 tmp1 = tmp >> 8; Syn_filt tmp = h[0] * h[1]; for (i = 1 ; i < 21 ; i++) syn syn D R A M tmp = tmp + h[i] * h[i+1]; tmp2 = tmp >> 8; Corr0/Corr1 if (tmp2 <= 0) Ap3 Ap3 tmp2 = 0; else tmp2 = tmp2 * MU; Ap4 preemph Ap4 tmp2 = tmp2/tmp1; preemphasis (res2, temp2, 40) h h Syn_filt Syn_filt (Ap4, res2, &syn_p), tmp tmp 40, mem_syn_pst, 1); tmp1 tmp1 agc (&syn[i_subfr], &syn) agc 29491, 40) tmp2 tmp2

  11. Question 2: Haven’t we tried the compilerapproach in Itanium? • We proved that one can build compilers with aggressive whole program analysis and ILP techniques that achieve excellent results on large applications • Whole program compilation with GCC compatibility, new Microsoft Phoenix initiative, Oracle success – many of the traditional roadblocks to pervasive optimizing compilation have been removed • We did not (attempt to) prove that the compiler is capable of parallelizing applications at the memory and thread/task level • Now a major focus of the GSRC Soft Systems work in compilation of traditional and programming models • Need to develop (micro)architectural models for proactive management of memory and communication

  12. Microarchitecture research focus of Soft Systems • Hardware: predictable PE’s plugged into a heterogeneous multiprocessing system with intelligent interface units • Intelligent interface units cooperatively move data and code to fully utilize processing elements under compiler and runtime control • Interface units help to maintain desirable software properties, contain software errors, and enable debugging • Processing engines that embody major forms of parallelism to cover a wide range of applications: VLIW, systolic arrays, SIMD • Software: software model and tools will make or break the microarchitectures • Goal oriented research on robust memory and thread level analysis and transformations, integration of programming models and analysis techniques to harness computational power of highly parallel systems • Continue to extend compilation into major code bases: open source movement, Oracle, Microsoft Phoenix initiative, etc. • Aggressive migration of important application and system code into spatial accelerators

More Related