1 / 23

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

Dimitris Kaseridis & Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu. Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara). Unique Chips and Systems (UCAS-4). Outline.

brier
Download Presentation

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimitris Kaseridis &Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu Performance Analysis of Multiple Threads/CoresUsing the UltraSPARC T1(Niagara) Unique Chips and Systems (UCAS-4)

  2. Outline • Brief Description of UltraSPARC T1 Architecture • Analysis Objectives / Methodology • Analysis of Results • Interference on Shared Resources • Scaling of Multiprogrammed Workloads • Scaling of Multithreaded Workloads D. Kaseridis - Laboratory for Computer Architecture

  3. UltraSPARC T1 (Niagara) • A multi-threaded processor that combines CMP & SMT in CMT • 8 cores with each one handling 4 hardware context threads  32 active hardware context threads • Simple in-order pipeline with nobranchpredictor unit per core • Optimized for multithreaded performance  Throughput • High throughput  hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty D. Kaseridis - Laboratory for Computer Architecture

  4. UltraSPARC T1 Core Pipeline • Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path • Blue areas are replicated copies per hardware context thread D. Kaseridis - Laboratory for Computer Architecture

  5. Objectives • Purpose • Analysis of interference of multiple executing threads on the shared resources of Niagara • Scaling abilities of CMT architectures for both multiprogrammed and multithreaded workloads • Methodology • Interference on Shared Resources (SPEC CPU2000) • Scaling of a Multiprogrammed Workload (SPEC CPU2000) • Scaling of a Multithreaded Workloads (SPECjbb2005) D. Kaseridis - Laboratory for Computer Architecture

  6. Analysis Objectives / Methodology D. Kaseridis - Laboratory for Computer Architecture

  7. Methodology (1/2) • On-chip performance counters for real/accurate results • Niagara: • Solaris10 tools : cpustat, cputrack , psrsetto bind processes to H/W threads • 2 counters per Hardware Thread with one only for Instruction count D. Kaseridis - Laboratory for Computer Architecture

  8. Methodology (2/2) • Niagara has only one FP unit  only integer benchmark was considered • Performance Counter Unit in the granularity of a single H/W context thread • No way to break down effects of more threads per H/W thread • Software profiling tools too invasive • Only pairs of benchmarks was considered to allow correlation of benchmarks with events • Many iterations and use average behavior D. Kaseridis - Laboratory for Computer Architecture

  9. Analysis of Results Interference on shared resources Scaling of a multiprogrammed workload Scaling of a multithreaded workload D. Kaseridis - Laboratory for Computer Architecture

  10. Interference on Shared Resources Two modes considered: • “Same core” mode executes a benchmark on the same core • Sharing of pipeline, TLBs, L1 bandwidth • More like an SMT • “Two cores” mode execute each member of pair on a different core • Sharing of L2 capacity/bandwidth and main memory • More like an CMP D. Kaseridis - Laboratory for Computer Architecture

  11. Interference “same core”(1/2) • On average 12% drop of IPC when running in a pair • Crafty followed by twolf showed the worst performance • Eon best behavior keeping the IPC almost close to the single thread case D. Kaseridis - Laboratory for Computer Architecture

  12. Interference “same core”(2/2) D. Kaseridis - Laboratory for Computer Architecture • DC misses increased 20% on average / 15% taking out crafty • Worst DC misses are vortex andperlbmk • Highest ratios of L2 misses demonstrated are not the one that features an important decrease in IPC  mcf and eon pairs with more than 70% L2 misses • Overall, small performance penalty even when sharing pipeline and L1, L2 bandwidth  latency hiding technique is promising

  13. Interference “two cores” • Only stressing L2 and shared communication buses • On average the misses on L2 are almost the same as in the case on “same core”: • underutilized the available resources • Multiprogrammed workload with no data sharing D. Kaseridis - Laboratory for Computer Architecture

  14. Scaling of Multiprogrammed Workload • Reduced benchmark pair set • Scaling 4  8  16 threads with configurations D. Kaseridis - Laboratory for Computer Architecture

  15. Scaling of Multiprogrammed Workload • “Same core” • “Mixed mode” mode D. Kaseridis - Laboratory for Computer Architecture

  16. Scaling of Multiprogrammed “same core” IPC ratio DC misses ratio • 4  8 case • IPC / Data cache misses not affected • L2 data misses increased but IPC is not • Enough resources running fully occupied • memory latency hiding • 8  16 case • More cores running same benchmark • Some footprint / request to L2 /Main memory • L2 requirements / shared interconnect traffic decreased performance L2 misses ratio D. Kaseridis - Laboratory for Computer Architecture

  17. Scaling of Multiprogrammed “mixed mode” • Mixed mode case • Significant decrease in IPC when moving both • from 4  8 and 8  16 threads • Same behavior as “same core” case for DC • and L2 misses with an average of 1% - 2% • difference • Overall for both modes • Niagara demonstrated that moving from 4 to 16 threads can be done with less than 40% on average performance drop • Both modes showed that significantly increased L1 and L2 misses can be handed favoring throughput IPC ratio D. Kaseridis - Laboratory for Computer Architecture

  18. Scaling of Multithreaded Workload • Scaled from 1 up to 64 threads • 1  8 threads mapped 1 thread per core • 8 16 threads mapped at maximum 2 threads per core • 16  32 threads up to 4 threads per core • 32 64 more threads per core, swapping is necessary Configuration used for SPECjbb2005 D. Kaseridis - Laboratory for Computer Architecture

  19. Scaling of Multithreaded Workload SPECjbb2005 score per warehouse GCeffect D. Kaseridis - Laboratory for Computer Architecture

  20. Scaling of Multithreaded Workload • Ratio over 8 threads case with 1 thread per core • Instruction fetch and DTLB stressed the most • L1 data and L2 Caches managed to scale even for more then 32 threads GCeffect D. Kaseridis - Laboratory for Computer Architecture

  21. Scaling of Multithreaded Workload • Scaling of Performance • Linear scaling of almost 0.66 per thread up to 32 threads • 20x speed up at 32 threads • SMP (2 Threads/core) gives on average 1.8x speed up over the CMP configuration (region 1 • SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup over the 2-way SMT per core and the single-threaded CMP, respectively. D. Kaseridis - Laboratory for Computer Architecture

  22. Conclusions • Demonstration of interference on a real CMT system • Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation • Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread • Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses D. Kaseridis - Laboratory for Computer Architecture

  23. Q/A Thank you… Questions? The Laboratory for Computer Architecture web-site: http://lca.ece.utexas.edu Email: kaseridi@ece.utexas.edu D. Kaseridis - Laboratory for Computer Architecture

More Related