1 / 10

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June 2000. Presented by Garver Moore ECE259 Spring 2006 Professor Daniel Sorin. Motivation. Economic: High demand for OLTP machines

jean
Download Presentation

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27th Annual ISCA, June 2000 Presented by Garver Moore ECE259 Spring 2006 Professor Daniel Sorin

  2. Motivation • Economic: High demand for OLTP machines • Disconnect between ILP-focus and this demand • OLTP -- High memory latency -- Little ILP (Get, process, store) -- Large TLP • OLTP unserved by aggressive ILP machines • Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines and low development costs and time to market • Amdahl’s Law

  3. The Piranha Processing Node* CPU: Alpha ECE152 work Single in-order 8-stage pipeline 180 nm process (2000) Almost entirely ASIC design 50% clock speed, 200% area versus full-custom methodology Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec. *Directly from Barroso et. al

  4. Communication Assist + Home Engine and Remote Engine support shared memory across multiple nodes + System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc. + OQ, Router, IQ, Switch standard +Total inter-node I/O Bandwidth : 32 GB/sec + Each link and block here corresponds to actual wiring and module. + This allows for rapid parallel development and an semi-custom design methodology + Also facilitates multiple clock domains THERE IS NO INHERENT I/O CAPABILITY.

  5. I/O Organization + Smaller than processing node + Router  2 links, alleviates need for routing table + Memory is globally visible and part of coherency scheme + CPU  optimized placement for drivers, translations etc. with low-latency access needs to I/O. + Re-used dL1 design provides interface to PCI/X interface + Supports arbitrary I/O:P ratio, network topology + Glueless scaling up to 1024 nodes of any type supports application specific customization

  6. Coherence: Local + L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory + Chip ICS responsible for all on-chip communication + L2 is “non-inclusive”. + “Large victim buffer” for L1s. Keeps tags and state copies of L1 data + The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist. + L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory. +L2 on forwards blocks conflicting requests

  7. Coherence: Global • Trades ECC granularity for “free” directory data storage (4x granularity  leaves 44 bits per 64 bit line) • Invalidation-based distributed directory protocol • Some optimizations • No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies • Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges. • Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry-counts (Origin), “No, seriously” (Token). • Routing toward empty buffers for old messages  linear buffer dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.

  8. Evaluation Methodology • Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications) • Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware • “Fudged” for full-custom effect • Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)

  9. Results

  10. Questions/Discussion • Deadlock avoidance w/o NACK • CMP vs SMP • “Fishy” evaluation methodology? • Specialized computing • Buildability?

More Related