1 / 21

Inherently Lower-Power High-Performance Superscalar Architectures

Paper Review. Inherently Lower-Power High-Performance Superscalar Architectures. Rami Abielmona Prof. Maitham Shams 95.575 March 4, 2002. Flynn’s Classifications (1972) [1]. SISD – Single Instruction stream, Single Data stream Conventional sequential machines

jena-jones
Download Presentation

Inherently Lower-Power High-Performance Superscalar Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paper Review Inherently Lower-Power High-Performance Superscalar Architectures Rami Abielmona Prof. Maitham Shams 95.575 March 4, 2002

  2. Flynn’s Classifications (1972) [1] • SISD – Single Instruction stream, Single Data stream • Conventional sequential machines • Program executed is instruction stream, and data operated on is data stream • SIMD – Single Instruction stream, Multiple Data streams • Vector machines (superscalar) • Processors execute same program, but operate on different data streams • MIMD – Multiple Instruction streams, Multiple Data streams • Parallel machines • Independent processors execute different programs, using unique data streams • MISD – Multiple Instruction streams, Single Data stream • Systolic array machines • Common data structure is manipulated by separate processors, executing different instruction streams (programs)

  3. Pipelined Execution • Effective way of organizing concurrent activity in a computer system • Makes it possible to execute instructions concurrently • Maximum throughput of a pipelined processor is one instruction per clock cycle • Shown in figure 1 is a two-stage pipeline, with buffer B1 receiving new information at the end of each clock cycle Figure 1, courtesy [2]

  4. Superscalar Processors • Processors equipped with multiple execution units, in order to handle several instructions in parallel [11] • Maximum throughput is greater than one instruction per cycle (multiple-issue) • Baseline architecture is shown in figure 2 [3] Figure 2, courtesy [3]

  5. Important Terminology [2] [4] • Issue WidthThe metric designating how many instructions are issued per cycle • Issue WindowComprises the last n entries of the instruction buffer • Register FileSet of n-byte, dual read, single write bank of registers • Register RenamingTechnique used to prevent stalling the processor for false data dependencies between instructions • Instruction SteeringTechnique used to send decoded instructions to appropriate memory banks • Memory Disambiguation UnitMechanism for enforcing data dependencies through memory

  6. Motivations and Objectives • To analyze the high-end processor market for BOTH power and area-performance trade-offs (not previously done) • To propose a superscalar architecture which achieves a power reduction without compromising performance • Analysis to be carried out on structures that increase energy dissipation, with an increasing issue width • Register rename logic • Instruction issue window • Memory disambiguation unit • Data bypass mechanism • Multiported register file • Instruction and data caches

  7. Energy Models [5] • Model 1 – Multiported RAM • Access energy (R or W) = Edecode + Earray + ESA + EctlSA + Epre + Econtrol • Word line energy = Vdd2 Nbits( Cgate Wpass,r + ( 2Nwrite+ Nread ) Wpitch Cmetal ) • Bit line energy = Vdd Mmargin Vsense Cbl,read Nbits • Model 2 – CAM (Content-Addressable Memory) • Using IW write word lines and IW write bitline pairs • Model 3 – Pipeline latches and clocking tree • Assume balanced clocking tree (less power dissipation than grids) • Assume lower power single phase clocking scheme • Near minimum transistor sizes used in latches • Model 4 – Functional Units • Eavg = Econst + Nchange x Echange • Energy complexity is independent of issue width

  8. Preliminary Simulation Results E ~ (IW)γ • Wrote own simulator, incorporating all the developed energy models (based on SimpleScalar tool set) • Ran simulations for 5 superscalar designs, with IW ranging from 4 to 16 • Results show that total committed energy increases with IW, as wider processors rely on deeper speculation to exploit ILP • Energy/instruction grows linearly for all structures except functional units (FUs) • Results obtained using 35-micron and Vdd = 3.3V technologies, which FUs scale well with. However, RAMs, CAMs and long wires do not scale well, and thus have to be LOW-POWER structures Table 1

  9. Problem Formulation • Energy-Delay Product • E x D = energy/operation x cycles/operation • E x D = (energy/cycle) / IPC2 • E x D ~ (IW)γ - α ~ (IPC) (γ – α) / α • E x D = energy/operation x cycles/operation • Problem Definition • If α = 1, then E x D ~ (IPC)γ-1 ~ (IW)γ-1 • If α = 0.5, then E x D ~ (IPC)γ-1/2 ~ (IW)2γ-1 • Need new techniques to achieve more ILP with conventional superscalar design

  10. Intermediary Recap • We have discussed • Superscalar processsor design and terminology • Energy modeling of microarchitecture structures • Analysis of energy-delay metric • Preliminary simulation results • We will introduce • General solution methodology • Previous decentralization schemes • Proposed strategy • Simulation results of multicluster architecture • Conclusions

  11. General Design Solution • Decentralization of microarchitecture • Replace tightly coupled CPU with a set of clusters, each capable of superscalar processing • Can ideally reduce γ to zero, with good cluster partitioning techniques • Solution introduces the following issues • Additional paths for intercluster communication • Need for cluster assignment algorithms • Interaction of cluster with common memory system

  12. Previous Decentralized Solutions

  13. Proposed Multicluster Architecture (1) • Instead of tightly coupled CPUs, proposed architecture will involve a set of clusters, each containing: • instruction issue window • local physical register file • set of execution units • local memory disambiguation unit • one bank of interleaved data cache • Refer to figure 3 on next slide

  14. Proposed Multicluster Arch. (2) Figure 3

  15. Multicluster Architecture Details • Register Renaming and Instruction Steering • Each cluster is provided with a local physical RF • Global Map Table maintains mapping between architectural registers and physical registers • Cluster Assignment Algorithm • Tries to minimize • intercluster register dependencies • delay through cluster assignment logic • Whole graph solution is NP-complete, therefore near-optimal solutions devised by divide & conquer method • Intercluster Communication • Remote Access Window (RAW) used for remote RF calls • Remote Access Buffer (RAB) used to keep the remote source operand • One cycle penalty incurred for a remote RF

  16. Multicluster Architecture Details(Cont’d) • Memory Dataflow • Centralized memory disambiguation unit does not scale with increasing issue width and bigger sizes of the load/store window • Proposed scheme: Every cluster is provided with a local load/store window that is hardwired to a particular data cache bank • Developed a bank predictor in order to combat not knowing which cluster the instruction is being routed to at the decode stage • Stack Pointer (SP) References • Realized an eager mechanism for handling SP references • With a new reference to SP, an entry is allocated in RAB • Upon instruction completion, results written into RF and RAB • RAB entry is not freed after instruction reads contents • RAB entry is freed only when a new SP reference commits

  17. Results and Analysis • A single address transfer bus is sufficient for handling intercluster address transfers • A single bus is used to handle intercluster data transfers arising from bank mispredictions • 4-6 entries are used in the RAB for low-power • 2 extra entries are sufficient in the RAB for SP refs. • Intercluster traffic is reduced by 20 % and performance improved by 3 % using SP eager mechanism • Multicluster architecture showed 20 % better performance than the best configurations with centralized architectures, with a 50 % reduction in power dissipation

  18. Conclusions • Main Result of Work Using this architecture will allow the development of high-performance processors while keeping the microarchitecture energy-efficient, as proven by the energy-delay product • Main Contribution of Work A methodology for doing energy-efficiency analysis was derived for use with the next generation high-performance decentralized superscalar processors • Other Major Contributions • Opened analyst’s eyes to the 3-D IPC-area-energy space • A roadmap for future high-performance low-power microprocessor development has been proposed • Coined the energy-efficient family concept, composed of equally optimal energy-efficient configurations

  19. References (1) [1] M.J. Flynn, “Very High-Speed Computing Systems,” Proceedings of the IEEE, vol. 54, December 1966, p.p. 1901-1909. [2] C. Hamacher, Z. Vranesic and S. Zaky, “Computer Organization,” fifth edition, McGraw-Hill: New York, 2002. [3] V. Zyuban and P. Kogge, “Inherently Lower-Power High-Performance Superscalar Architectures,” IEEE Transactions on Computers, vol. 50, no. 3, March 2001, p.p. 268-285. [4] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors,”,Proceedings of the 29th Fault-Tolerant Computing Symposium, June 1999 [5] V. Zyuban, “Inherently Lower-Power High-Performance Superscalar Architectures,” PhD thesis, Univ. of Notre Dame, Mar. 2000. [6] R. Colwell et al., “A VLIW Architecture for a Trace Scheduling Compiler,” IEEE Trans. Computers, vol. 37, no. 8, pp. 967-979, Aug. 1988. [7] M. Franklin and G.S. Sohi, “The Expandable Split Window Architecture for Exploiting Fine-Grain Parallelism,” Proc. 19th Ann. Int’l Symp. Microarchitecture, May 1992.

  20. References (2) [8] S. Vajapeyam and T. Miltra, “Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences,” Proc. 24th Ann. Int’l Symp. Computer Architecture, June 1997. [9] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time through Partitioning,” Proc. 30th Ann. Int’l Symp. Microarchitecture, Dec. 1997. [10] S. Palacharla, N. Jouppi and J. Smith, “Complexity-Effective Superscalar Processor,” Proc. 24th Ann. Int’l Symp. Computer Architecture, pp. 206-218, June 1997. [11] K. Hwang, “Advanced Computer Architecture: Parallelism, Scalability, Programmability,” McGrawHill: New York, 1993.

  21. Questions/Comments ?

More Related