1 / 25

Toward a Sustainable Architecture at Extreme Scale

Toward a Sustainable Architecture at Extreme Scale. Zhimin Tang, CTO tangzhm@sugon.com. Outline. Sustainable (Cost Effective) HPC Counter-examples in the history Current and Future Challenges New computing forms from sensor to cloud

corin
Download Presentation

Toward a Sustainable Architecture at Extreme Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward a Sustainable Architecture at Extreme Scale Zhimin Tang, CTO tangzhm@sugon.com

  2. Outline • Sustainable (Cost Effective) HPC • Counter-examples in the history • Current and Future Challenges • New computing forms from sensor to cloud • Silicon based IC process approaching its physical limit • Strategy • Abandon HPC only acceleration features • Design sustainable architecture for HPC and other applications

  3. Considerations of Cost Effectiveness or Sustainability • Application (Algorithm) Requirements • High performance • Technology Constraints • CMOS vs. bipolar, Moore’s Law • Commercial MPU vs. customed ASIP • Economical Feasibility • Good eco-system • Mass production • Low energy consumption

  4. HPCs in the History • Vector Supercomputers • CMOS Dominated, SIMD Weakness

  5. Connection Machine • SIMD PE Array • Optimal only for someAlgorithms • Custom chips, tiny processor

  6. MIMD with Custom CPUs • Chip Level Integration (SoC) • nCube/2, KSR-1 (COMA), … • High NRE cost due to custom design without mass production • Low node processor performance

  7. Why No Cost Effectiveness • HPC Is a Small Market • Architectures Designed Only for HPC • Lower volume, higher cost (NRE) • No enough resource to implement a top level (wrt performance) solution • Longer time-to-market, behind Moore’s Law • Result: COTS Solutions in Last 20 Years • Commercial off-the-shelf • Co-design with the IT Ecosystem • From Cloud computers to sensors

  8. Ecosystem Requirements • High Performance and Low Cost • Low cost is continuing a must • New factors of cost: energy/power, big NRE • Performance no longer the bottleneck • for most applications • like car, train, airplane in transportation • New appearances of performance • Computing: MIPS/MFLOPS • Transaction processing: TPM • Cloud applications: requests serviced in unit time

  9. Energy Efficiency • Two Ends of Computing System • Cloud: large scale power dissipation • Terminal: limited battery life • Energy: compute < memory < communication • For each FLOP in Linpack • FPU spends 10pJ, Memory access 475pJ • Wireless Sensor Network • RF radio consumes most of the power • What We Need Besides Locality?

  10. Needs New Architecture • Architecture Consuming Less Energy • Many core, custom designed for applications • Flattened software stack • Architecture for New Performance Metrics • High volume throughput computers • New Algorithms and Methodology • Complexity of computation • Complexity of memory access and communication

  11. Constraints to Innovation • Existing Software Ecosystem • standard or de facto interfaces • e.g., ISA: Instruction Set Architecture • Pro: Compatibility of Software • Con: Obstacles of Innovation, legacy • Huge Expenses of Development • new architecture needs new processors • NRE of chip development increasing rapidly, as CMOS process approaching its limit • NRE: Non-Recurring Engineering

  12. CMOS Technology • Approaching Limit, And No Replacement! • Moore’s law:7nm@2024, ~30 atoms • Different with the Transfer in 1990’s • Bipolar (ECL/TTL) is faster, but consumes much power • CMOS developed for 20 years, no too slow, low cost, and low power • But Now, Liquid Cooling for CMOS • In the foreseeable future, still CMOS

  13. More and More than Moore 2011 ITRS Exec. Summary Fig. 4

  14. Dark Silicon • At 8nm, above half of transistors must be turned off • Speedup of 4-8 for 5 process generations ISCA’11, IEEE Micro’12, CACM’13

  15. Economical Feasibility • Moore’s Law Provides More Transistors • But switching speed no longer faster • Process development in nanometer scale increases NRE tremendously • Mass Production Is Essential • Otherwise, chip business is not sustainable • Advantages of general-purposed processors • How about Many-core Processors? • GPU, Tilera, MIC, …

  16. Pros and Cons of MPU • Most Advanced Process, Mass Product • Stable, reliable, low cost • Mature ecosystem and solutions • Not Optimal for Many Applications • Aim: not too bad for most applications • Over allocation of resources • Waste of resources, Consumption of more energy

  17. MPU not good for Cloud • High L1-I Cache Miss Rate • Processor idle (instruction starvation) • Small ILP and MLP • Wide issue not effective • Low Efficiency of Memory Access • Large L3 takes ½ chip area, no help to improve performance • Useless High Bandwidth On-chip • Few Data sharing among cores

  18. Low Utilization of Resources • Only 1/3 are frequently used GPU L2 Cache L2 Cache L2 Cache L2 Cache OOOFPU OOOFPU OOOFPU OOOFPU L3 Cache

  19. Pros and Cons of ASIP • Optimal Designed for Some Applications • high efficiency, low resource, low power • But No Lunches Are Free • Much design/verification work • Stability/Reliability? • May affect the time to market • How to amortize the huge NRE • Small market means high cost

  20. MPU + Accelerator • GPU • Pro: mass production • Con: PCIE overhead, small memory size • MIC PHI • Mass production possible? • FPGA • Resource utilization • Ease of programming • MPU interface, e.g., QPI or PCIE

  21. Design of New Processors • Crossing the Gap between Generaland Special • Many Simple Cores • Reduce power consumption • Multiple Hardware Thread in Each Core • Massive threads on chip • Exploit concurrency, tolerate latency • Dynamic Scheduling of On-chip Threads • Improve performance for general apps

  22. Combining Multithreadingand Vector Pipelining 流水向量处理引擎 Vector Registers IR RF ID I$ D$/SPM Switch to single thread Deep scalar pipeline Switch to vector pipeline

  23. Thread Parallelism and DataParallelism in Two dimensions Deep thread parallelism and data parallelism Vector Register File IR RF ID I$ D$/SPM Wide data parallelism Wide thread parallelism IR RF ID I$ D$/SPM

  24. In Conclusion • A Universal Architecture • Scalable and reconfigurable processor array • Supports thread and data level parallelism • Fulfill All Requirements from Terminal to Cloud Data Center • High performance computers • Cloud computing servers • Equipment in Core network • Terminals for Cloud and mobile Internet

More Related