1 / 22

CS 7810 Lecture 23

CS 7810 Lecture 23. Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005. Niagara. Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on:

mari-mccall
Download Presentation

CS 7810 Lecture 23

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005

  2. Niagara • Commercial servers require high thread-level • throughput and suffer from cache misses • Sun’s Niagara focuses on: • simple cores (low power, design complexity, can accommodate more cores) • fine-grain multi-threading (to tolerate long memory latencies)

  3. Niagara Overview

  4. SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores

  5. Thread Selection • Round-Robin • Threads that are speculating on a load-hit receive • lower priority • Threads are unavailable if they suffer from cache • misses, long-latency ops

  6. Register File • Each procedure has eight local and eight in • registers (and eight out registers that serve as • in registers for the callee) – each thread has • eight such windows • Total register file size: 640! 3 read and 2 write • ports (1 write/cycle for long and short latency ops) • Implemented as a 2-level structure: 1st level • contains the current register windows

  7. Cache Hierarchy • 16KB L1I and 8KB L1D, write-thru, read-allocate, • write-no-allocate • Invalidate-based directory protocol – the shared • L2 cache (3MB, 4 banks) identifies sharers and • sends out the invalidates • Rather than store sharers per L2 line, the L1 tags • are replicated – such a structure is more efficient • to search through

  8. Next Generation: Rock • 4 cores; each core has 4 pipelines; each pipeline • can execute two threads: 32 threads

  9. Design Space Exploration: Methodology • Workloads: SPEC-JBB (Java middleware), • TPC-C (OLTP), TPC-W (transactional web), • XML-Test (XML parsing) – all are thread-oriented • Sun’s chip design databases were examined to • derive area overheads of various features • (primarily to evaluate the overhead of threading • and ooo execution)

  10. Pipelines 8-stage pipelines Scalar proc is fine-grain multi-threaded Superscalar proc is SMT Frequency not more than ½ of the max ITRS-projected frequency 400mm2 die 25% devoted to off-chip interfaces: mem controllers, I/O, clocking 11% devoted to the inter-core xbar Of the remaining area, 25-75% are allocated to cores/L2-cache

  11. Area Effect of Multi-Threading • The curve is linear for a while – study is restricted to such designs • Multi-threading adds a 5-8% area overhead per thread (primary • caches are included in the baseline) • A thread is statically assigned to an IDP – multiple threads can share • an IDP

  12. Design Space Exploration

  13. Single Core IPC 4 bars correspond to 4 different L2 sizes IPC range for different L1 sizes

  14. Aggregate IPC C1: 2p4t with 64KB L1 caches C2: 2p4t with 32KB L1 caches *L1 latencies are always constant

  15. Maximal Aggregate IPCs

  16. Maximal Aggregate IPCs

  17. Observations • Scalar cores are better than ooo superscalars • Too many threads (> 8) can saturate the caches • and memory buses • Processor-centric design is often better (medium • sized L2s are good enough)

  18. PACT 2001 Paper on CMP Designs • Different workload: SPEC2k (multi-programmed) • Private L2 caches (no cache coherence)

  19. Effect of L2 Size

  20. Effect of Memory Bandwidth

  21. Optimal Configurations

  22. Title • Bullet

More Related