1 / 38

TRANSPARENT THREADS

TRANSPARENT THREADS. Gautham K.Dorai and Dr.Donald Yeung ECE Dept., Univ. Maryland, College Park. SMT Processors. ICOUNT , IQPOSN, MISSCOUNT, BRCOUNT [Tullsen ISCA’96]. Priority Mechanisms. Pipeline. Multiple Threads. Individual Threads run Slower!.

chilton
Download Presentation

TRANSPARENT THREADS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TRANSPARENT THREADS Gautham K.Dorai and Dr.Donald Yeung ECE Dept., Univ. Maryland, College Park

  2. SMTProcessors • ICOUNT, IQPOSN, MISSCOUNT, BRCOUNT [Tullsen ISCA’96] Priority Mechanisms Pipeline Multiple Threads

  3. Individual Threads run Slower! Individual Thread Performance 31%

  4. Single-Thread Performance • Multiprogramming (Process Scheduling) • Subordinate Threading (Prefetching/Pre-Execution,Cache management, Branch Prediction etc.) • Performance Monitoring (Dynamic Profiling)

  5. Transparent Threads Foreground Thread Background Thread (Transparent) 0% SLOWDOWN

  6. Single-Thread Performance • Multiprogramming • Latency of critical high-priority process • Subordinate Threading • Performance Monitoring • Benefit vs Cost (Overhead) Tradeoff

  7. Road Map • Motivation • Transparent Threads • Experimental Evaluation • Transparent Software Prefetching • Conclusion

  8. Shared vs. Private Transparency – No stealing of shared resources PC ROB Predictor Functional Units Issue Queues Register File I-Cache Register Map D-Cache Fetch Queue

  9. Slots, Buffers and Memories SLOTS – Allocation based on current cycle only BUFFERS – Allocation based on future cycles PC PC MEMORIES – Allocation based on future cycles ROB Predictor Issue Units Issue Queues Register File I-Cache I-Cache I-Cache Register Map D-Cache Fetch Queue

  10. Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHEBLOCK ICOUNT 2.N Fetch Slots

  11. Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHEBLOCK ICOUNT 2.N Fetch Slots

  12. Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHEBLOCK ICOUNT 2.N Fetch Slots

  13. Buffer Transparency ROB Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Issue Queue

  14. Background Thread Window Partitioning • Limit on Background Thread Instructions ROB • Stops fetch when ICOUNT reaches limit Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

  15. Background Thread Window Partitioning ROB • Foreground Thread can occupy all available entries Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

  16. Background Thread Flushing ROB • No limit on Background Thread Head Tail Fetch Hardware PC1 PC2 Tail Fetch Queue Foreground Background Issue Queue Head

  17. Background Thread Flushing ROB • No limit on Background Thread Head • Flush Triggered on Conflict Tail Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue Head

  18. Background Thread Flushing ROB • No limit on Background Thread Head • Flush Triggered on Conflict Fetch Hardware Tail PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue Head

  19. Foreground Thread Flushing ROB • Instructions remain stagnant in ROB Load Miss Head • Flush Triggered on load miss at head • Flush Stagnated Entries Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Issue Queue Tail

  20. Foreground Thread Flushing ROB Load Miss Head • Flush F Entries from the tail • Block the fetch for T Cycles Fetch Hardware PC1 PC2 Tail Fetch Queue Head Foreground Background Issue Queue

  21. Foreground Thread Flushing ROB Load Miss Head • After T Cycles allow to fetch again • F & T depend on R (Residual Cache Latency) Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue

  22. SimpleScalar-based SMT

  23. Benchmark Suites • Evaluate Transparency Mechanisms • Transparent Software Prefetching

  24. Transparency Mechanisms Background Thread Window Partitioning (32 Entries) Slot Prioritization Background Thread Flushing Private Caches Equal Priority Private Predictor EP SP BP BF PC PP

  25. Transparency Mechanisms Equal Priority – 30% Slowdown Slot Prioritization – 16% Slowdown Background Window Partitioning – 9% Slowdown Background Thread Flushing – 3% Slowdown EP SP BP BF PC PP

  26. Performance Mechanisms ICOUNT 2.8 with Flushing Foreground Thread Window Partitioning (112F + 32B) Equal Priority ICOUNT 2.8 2B 2F 2P EP

  27. Performance Mechanisms Equal Priority – 31% degradation ICOUNT 2.8 - 41% slower than EP ICOUNT 2.8 + Foreground Thread Flushing – 23% slower than EP Foreground Thread Window Partitioning – 13% slower than EP 2B 2F 2P EP Normalized IPC

  28. Transparent Software Prefetching Conventional Transparent Software Prefetching Computation Thread Transparent Prefetch Thread For (I=0; I < N-PD; I+=8) { prefetch(&b[I]); b[I] = z + b[I]; } For (I=0; I < N-PD; I+=8) { b[I] = z + b[I]; } For (I=0; I < N-PD; I+=8) { prefetch(&b[I]); } • In-lined Prefetch Code • Offload Prefetch Code to Transparent Threads • Profitability of Prefetching • Zero Overhead – No profiling Required • Benefit vs Cost tradeoff (Profiling required)

  29. Transparent Software Prefetching Naive Conventional Software Prefetching Profiled Conventional Software Prefetching No Prefetching Transparent Software Prefetching Normalized Execution Time NP PF PS TSP VPR

  30. Transparent Software Prefetching Naïve Software Prefetching – 19.6% Overhead, 0.8% Performance Selective Software Prefetching – 14.13% Overhead, 2.47% Performance Transparent Software Prefetching – 1.38% Overhead, 9.52% Performance NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP VPR BZIP GAP EQUAKE ART AMMP IRREG

  31. Conclusions • Transparency Mechanisms • 3% overhead on foreground thread • Less than 1% without cache and predictor contention • Throughput Mechanisms • Within 23% of Equal Priority • Transparent Software Prefetching • 9.52% gain with 1.38% Overhead • Eliminates the need for profiling • Availability of spare bandwidth • Can be used transparently for interesting applications

  32. Related Work • Tullsen’s work on Flushing mechanisms [Tullsen Micro-2001] • Raasch’s work on prioritization [Raasch MTEAC Worshop 1999] • Snavely’s work on Job Scheduling [Snavely ICMM-2001] • Chappell’s work on Subordinate Multithreading and Dubois’s work on Assisted Execution [Chappell ISCA-1999][Dubois Tech-Report Oct’98]

  33. Foreground Thread Window Partitioning • Advantages • Minimal guaranteed entries • Disadvantages • Transparency minimized Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

  34. Benchmark Suites • Evaluate Transparency Mechanisms • Transparent Software Prefetching

  35. Transparency Mechanisms EP SP BF EP SP BF EP SP BF EP SP BF EP SP EP SP BF EP SP BF EP SP BF EP SP BF

  36. Transparency Mechanisms EP SP BF EP SP BF EP SP BF EP SP BF SP BF EP SP BF EP SP BF EP SP BF EP SP BF

  37. Transparency Mechanisms

  38. Transparent Software Prefetching NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF

More Related