1 / 24

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors. Carmelo Acosta 1 Francisco J. Cazorla 2 Alex Ramírez 1,2 Mateo Valero 1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center. Overview. Introduction Simulation Methodology Results Conclusions.

tymon
Download Presentation

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta1Francisco J. Cazorla2 Alex Ramírez1,2 Mateo Valero1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center

  2. Overview • Introduction • Simulation Methodology • Results • Conclusions

  3. Introduction • As Process Technology advances it is more important what to do with transistors. • Current trend to replicate cores. • Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad • AMD: Opteron Dual-Core, Opteron Quad-Core • IBM: POWER4, POWER5 • Sun Microsystems: Niagara T1, Niagara T2

  4. Introduction Power4 (CMP) Power5 (CMP+SMT) • Memory Subsystem (green) spreads over more than half the chip area.

  5. Introduction • Each L1 is connected to each L2 bank with a bus-based interconnection network.

  6. Goal • Is directly applicable prior research in the SMT field in the new CMP+SMT scenario? • NO…we have to revisit well-known SMT ideas. • Instruction Fetch Policy

  7. Fetch ICOUNT ROB

  8. Fetch L2 miss ICOUNT ROB FETCH Stalled • Processor’s resources balanced between running threads. • All resources devoted to blue thread unused until L2 miss resolution.

  9. Fetch L2 miss FLUSH ROB FLUSH Triggered • All resources devoted to the pending instructions of the blue thread are freed.

  10. Fetch L2 miss FLUSH ROB Thread Stalled • Freed resources allow additional forward progress. • L2 miss late detection  L2 miss prediction.

  11. L2 b0 L2 b1 L2 b2 L2 b3 L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ Core Core Core Core Core Single vs Multi Core • More pressure on both: • Interconnection Network • Shared L2 banks

  12. L2 b0 L2 b1 L2 b2 L2 b3 L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ Core Core Core Core Core Single vs Multi Core More Unpredictable L2 Access Latency - BAD for FLUSH

  13. Overview • Introduction • Simulation Methodology • Results • Conclusions

  14. L2 b0 L2 b1 L2 b2 L2 b3 I$ I$ I$ I$ D$ D$ D$ D$ Core Core Core Core Simulation Methodology • Trace driven SMT simulator derived from SMTsim. • C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core) Core Details (* per thread)

  15. Simulation Methodology • Instruction Fetch Policies: • ICOUNT • FLUSH • Workload classified per type: • ILP  All threads have good memory behavior. • MEM  All threads have bad memory behavior. • MIX  Mixes both types of threads.

  16. Overview • Introduction • Simulation Methodology • Results • Conclusions

  17. Results : Single-Core (2 threads) • FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. • Mainly on MEM/MIX workloads

  18. +Cores  -Speedup Results : Multi-Core (2 threads/core) • FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore.

  19. +Cores  +latency +dispersion Results : L2 Hits Latency on Multi-Core L2 hit latency (cycles)

  20. Results : L2 miss prediction • In this four-cored example, the best choice is predicting L2 miss after 90 cycles.

  21. Results : L2 miss prediction • But, in this other four-cored example the best choice is not to predict L2 miss.

  22. Overview • Introduction • Simulation Methodology • Results • Conclusions

  23. Conclusions • Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation. • The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance. • For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario. • FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.

  24. Thank you Questions?

More Related