CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

10th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-10) CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu

Outline • Brief Description of UltraSPARC T1 • Objectives • SpecJbb2005 Benchmark • Results Laboratory for Computer Architecture

A new multi-threaded processor that combines CMP & SMT in CMT 8 cores with each one handling 4 hardware context threads 32 active hardware context threads Simple in-order pipeline with no branch prediction unit per core Optimized for multithreaded performance  Throughput High throughput  hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty UltraSPARC T1 Laboratory for Computer Architecture

SMP vs. CMT Laboratory for Computer Architecture

UltraSPARC T1 Core Pipeline • Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath • Core area = 11 mm2 (90 nm technology) • 4 way MT adds ~ 20% area to core Laboratory for Computer Architecture

Objectives • Evaluate CMP/CMT benefits • Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment • Show effectiveness of latency hiding Laboratory for Computer Architecture

Characteristics Model a self contained 3-tier system: Server, Database and Clients Every warehouse is a collection of Java objects with ~25MB of data Each client is represented by an individual thread No I/O effects Reported score: Billion of Operations per Second (BOPS) Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collection and threads SPECjbb 2005 Benchmark SPECjbb2005 3-tier architecture Laboratory for Computer Architecture

Experimental parameters Parameters Laboratory for Computer Architecture

On-chip performance counters for real/accurate results Niagara: Solaris10 tools : cpustat, cputrack 2 counters per Hardware Thread with one only for Instruction count Measurements Methodology Laboratory for Computer Architecture

Results – Latency hiding pay off Single core execution using 4 threads on one core Single Thread Execution on T1 SpecJbb Score (BOPS) X2 instead of 4 SpecJbb Score (BOPS) Number of Warehouses Number of Warehouses Laboratory for Computer Architecture

CMP / CMT Scaling – CMP benefits 8 corex 1 thread/cores SpecJbb Score (BOPS) Number of Warehouses Laboratory for Computer Architecture

CMP / CMT Scaling – CMT benefits 8 corex 2 threads/cores SpecJbb Score (BOPS) Number of Warehouses • 75% of the benefit of adding a single core • Significant less area and power requirements (remember that 4 way MT adds ~ 20% area to each core) Laboratory for Computer Architecture

CMP / CMT Scaling – SMT benefits 8 corex 4 threads/cores SpecJbb Score (BOPS) Number of Warehouses Laboratory for Computer Architecture

CMP / CMT Scaling – SMT benefits SpecJbb Score (BOPS) Number of Warehouses • Additional hardware threads > 2 give an additional benefit of 45% • Gradually diminishing returns in terms of SMT efficiency • Garbage collector significantly effects regions 4 and 5 Laboratory for Computer Architecture

SPECjbb Score Scaling IPC of three configurations Best case SPECjbb score speedup IPC Norm. SPECjbb score Number of Virtual Processors Laboratory for Computer Architecture

Conclusions • Throughput vs. Latency in multiprocessing/multithreaded environments • Latency hiding is a good/promising technique against aggressive speculation • Adding SMT can give up to 75% the benefit of CMP with significant less cost • Moving to higher levels of SMT shows diminishing returns  tradeoffs between #cores and #Hardware threads per core Laboratory for Computer Architecture

Thank you… Questions?? The Laboratory for Computer Architecture Web-site: http://lca.ece.utexas.edu Laboratory for Computer Architecture

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

Presentation Transcript