Presented by: Nick Kirchem Feb 13, 2004

Piranha: A Scalable Architecture Based on Single-Chip MultiprocessingLuiz A. Barroso et al. (Compaq Computer Corporation) Presented by: Nick Kirchem Feb 13, 2004

Target and Motivation • Commercial applications (databases, OLTP) • Most important market for high performance servers • Data dependent computation (low ILP) • Little gained by complex multiple issue out-of-order processors • Complexity of current processors • Long design times • High development costs • Better use of transistors?

Project Goals • Design a Chip Multiprocessing (CMP) System • Integrate 8 simple processor cores on a single chip • Exploit thread-level parallelism instead of ILP • High performance, Low Cost • Achieve superior performance on commercial workloads • Small team, modest investment, short design time

Architecture Overview

Architecture Elements • Simple Processors (500 MHz, In-Order) • No I/O capability on chip (separate I/O nodes) • Up to 1024 nodes in a system • Individual L1 Caches (64KB, 2-way set-assoc) • One Logical L2 Cache, interleaved, 1MB • Intra-Chip Switch • Unidirectional crossbar • Transaction based, atomic transfers • Bandwidth ~3x memory bandwidth

Intra-Chip Cache Coherence • MESI protocol • No Inclusion (1 MB aggregate L1, 1MB L2) • But, L2 holds copy of L1 tags and state (no snooping required at L1) • L1 filled directly from memory (L2 = victim cache) • Coherence handled by L2 controllers • Can service request directly, forward to owner L1, forward to protocol engine, obtain from Memory

Inter-Node Coherence • Protocol Engines (microprogrammable controllers) • Home: exports local memory • Remote: imports remote memory • Directory Storage • Compute ECC at coarse granularity, use extra bits for directory info  no memory space overhead • Directory granularity = 1 node (not individual processor) • Interconnect: I/O queues, router (point-to-point, 4 links) • No NAKs – avoid deadlock by sufficient buffering, and guarantee forwarded requests can be serviced

Performance Evaluation • OLTP and DSS workloads: TPC-B/D, Oracle database • SimOS-Alpha environment • Compared: • Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz • Next-generation Microprocessor (OOO) 1 GHz • Single Chip Evaluation • OOO outperforms P1 (individual proc) by 2.3x • P8 outperforms OOO by 3x • Speedup of P8 over P1 = 7x • Multi-chip Configurations • Four chips (only 4 CPUs per chip ?!) • Results show that Piranha scales better than OOO

Questions/Concerns • Would the Piranha design be worthwhile if there were a well-designed SMT processor (with 4 or 8 threads)? • Reliability better or worse with multiple chips per processor? • Power consumption?

Presented by: Nick Kirchem Feb 13, 2004