1 / 42

Thread criticality for power efficiency in CMPs

ECE 692 Topic Presentation. Thread criticality for power efficiency in CMPs. Khairul Kabir Nov. 3 rd , 2009. Why Thread Criticality prediction?. Critical thread One with the longest completion time in the parallel region . T0 T1 T2 T3. Insts Exec. Problems

margie
Download Presentation

Thread criticality for power efficiency in CMPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 692 Topic Presentation • Thread criticality for power efficiency in CMPs KhairulKabir Nov. 3rd, 2009

  2. Why Thread Criticality prediction? • Critical thread • One with the longest completion time in the parallel region T0 T1 T2 T3 InstsExec • Problems • Performance degradation • Energy inefficiency D-Cache Miss • Sources of variability • Algorithm, process variation, thermal emergencies etc. I-Cache Miss • Purpose • Load balancing for performance improvement • Energy optimization using DVFS Stall Stall

  3. Related Work • Instruction criticality [Fields et al., Tune et al. 2001 etc.] • Identify critical instruction • Thrifty barrier [Li et al. 2005] • Faster cores transitioned into low-power mode based on prediction of barrier stall time. • DVFS for energy-efficiency at barriers [Liu et al. 2005] • Faster core tracks the waiting time and predicts the DVFS for next execution of the same parallel loop • Meeting points [Cai et al. 2008] • DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops)

  4. Thread Criticality Predictors for Dynamic • Performance, Power, and Resource Management in Chip Multiprocessors AbhishekBhattacharjee Margaret Martonosi Dept. of Electrical Engineering Princeton University

  5. What is This Paper About? • Thread criticality predictor(TCP) design • Methodology • Identify architectural events impacting thread criticality • Introduce basic TCP hardware • Thread criticality predictor uses • Apply to Intel’s Threading Building Blocks(TBB) • Apply for energy-efficiency in barrier-based programs

  6. Thread Criticality Prediction Goals • Design goals • 1. Accuracy • 2. Low-overhead implementation • Simple HW (allow SW policies to be built on top) • 3. One predictor, many uses Design decisions 1. Find suitable architectural metric 2. History-based local approach versus thread-comparative approach 3. This paper: TBB, DVFS and other uses: shared last-level cache management, SMT and memory priority, …

  7. Methodology • Evaluations on a range of architectures: high-performance and embedded domains • GEMS simulator – To evaluate the performance on architectures representative of the high-performance domain • ARM simulator – To evaluate the performance benefits of TCP-guided task stealing in Intel’s TBB • FPGA-based emulator used to assess energy savings from TCP-guided DVFS

  8. Architectural Metrics • History-based TCP • Requires repetitive barrier behavior • Information local to core: no communication • Problem for in-order pipelines: variant IPCs • Inter-core TCP metrics • Instruction count • Cache misses • Control flow changes • Translate lookaside buffer(TLB) miss

  9. Thread-Comparative Metrics for TCP: Instruction Counts

  10. Thread-Comparative Metrics for TCP: L1 D Cache Misses

  11. Thread-Comparative Metrics for TCP: L1 I & D Cache Misses

  12. Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses

  13. Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses

  14. Basic TCP Hardware • TCP hardware components • Per core criticality counters • Interval bound register

  15. Basic TCP Hardware Per-core Criticality Counters track poorly cached, slow threads Periodically refresh criticality counters with Interval Bound Register Inst 15 Inst 30 Inst 35 Inst 1 Inst 135 Inst 5 Inst 2 Inst 20 Inst 5: Miss Over Inst 20 Inst 25: L2 $ Miss Inst 5: L1 D$ Miss! Inst 2 Inst 25: Miss Over Inst 1 Inst 10 Inst 5 Inst 2 Inst 15 Inst 20: Miss Over Inst 125 Inst 20: L1 I$ Miss! Inst 25 Inst 1 Inst 35 Inst 20 Inst 15 Inst 5 Inst 1 Inst 30 Inst 135 Inst 2 Core 0 Core 1 Core 2 Core 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 Cache Miss! L1 Cache Miss! L2 Cache Miss! Shared L2 Cache L2 Controller TCP Hardware Criticality Counters 0 0 0 0 0 11 1 1 1 1 0 0 0 0 0 0

  16. TBB Task Stealing & Thread Criticality • TBB dynamic scheduler distributes tasks • Each thread maintains software queue filled with tasks • Empty queue – thread “steals” task from another thread’s queue • Approach 1: Default TBB uses random task stealing • More failed steals at higher core counts  poor performance • Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008] • Steal based on number of items in SW queue • Must track and compare maximum occupancy counts

  17. TCP-Guided TBB Task Stealing Core 0 SW Q0 Core 1 SW Q1 Core 2 SW Q2 Core 3 SW Q3 Task 0 Task 4 None Task 5 Task 7 Task 6 Core 2: Steal Req. Core 3: L1 Miss Task 1 • TCP initiates steals from critical thread • Modest message overhead: L2 access latency • Scalable: 14-bit criticality counters  114 bytes of storage @ 64 cores Task 2 Task 3 Task 7 Shared L2 Cache Criticality Counters Clock: 0 Clock: 10 Clock: 100 Clock: 30 Core 3: L2 Miss 14 0 5 0 2 0 0 21 1 11 Interval Bound Register TCP Control Logic Scan for max val. Steal from Core 3

  18. TCP-Guided TBB Task Stealing • TBB with random task stealing • TBB with TCP-guided task stealing

  19. TCP-Guided TBB Performance % Performance improvement versus random task stealing Avg. Improvement over Random (32 cores) = 21.6 % Avg. Improvement over Occupancy (32 cores) = 13.8 %

  20. Adapting TCP for Energy Efficiency in Barrier-Based Programs Insts Exec T0 T1 T2 T3 • Approach: DVFS non-critical threads to eliminate barrier stall time • Challenges: • Relative criticalities • Miss-prediction costs • DVFS overheads L2 D$ Miss L2 D$ Over T1 critical, => DVFS T0, T2, T3

  21. Hardware and Algorithm for TCP-Guided DVFS • TCP hardware components • Criticality counters • SST – Switching Suggestion Table • SCT – Suggestion Confidence Table • Interval bound register • TCP-guided DVFS algorithm – two key steps • Use SST to translate criticality counter values into thread criticalities • Criticality counter value is above a pre-defined threshold T and running at the nominal frequency • Determined by criticality counter value with SST entries • Suggests frequency switch if matching SST entry is different from current frequency • Feeds the suggested target frequency from SST to the SCT • Assesses confidence on SST’s DVFS suggestion

  22. TCP-Guided DVFS – Effect of Criticality Counter Threshold • Lowest bar – pre-calculated or correct state, averaged across all barrier instances • Central bar – learning time taken until the correct DVFS state is first reached • Upper bar – prediction noise or time spent in erroneous DVFS after having arrived at the correct one • Low T increases susceptibility to temporal noise • Too many frequency changes and performance overhead result without good suggestion confidence

  23. TCP for DVFS: Results Average 15% energy savings • Benchmark with more load imbalance generally save more energy

  24. Conclusions • Goal 1: Accuracy • Accurate TCPs based on simple cache statistics • Goal 2: Low-overhead hardware • Scalable per-core criticality counters used • TCP in central location where cache information is already available • Goal 3: Versatility • TBB improved by 13.8% over best known approach @ 32 cores • DVFS used to achieve 15% energy savings • Two uses shown, many others possible…

  25. Meeting Points: Using Thread Criticality to Adapt Multi-core Hardware to parallel Region QiongCai, José González, Ryan Rakvic, GrigoriosMagklis, Pedro ChaparroAntonio González

  26. Introduction • Meeting point thread characterization • Identifies the critical thread of a single multi threaded application • Identifies amount of the slacks of non-critical threads • Proposed applications • Thread delaying for multi-core systems • Save energy consumptions by scaling down the frequency and voltage of the cores containing non-critical threads • Thread balancing for simultaneous multi-threaded cores • Improves overall performance by giving higher priority to the critical thread

  27. Example: a parallelized loop from PageRank (lz77 method) • Observations: • The code is already written to achieve workload balance but imbalance still exists. CPU1 is slower than CPU0. • Reasons for imbalance: (i) Different cache misses (ii) Different control paths How To Find Critical Threads Dynamically?

  28. Identification of Critical Threads • Insertion of meeting points • Place in a parallel region that is visited by all thread • Can be done by the hardware, the compiler or the programmer • Identification technique • A thread-private counter is incremented • The most critical thread is the one with the smallest counter • Slack of a thread is estimated as the difference of its counter and the counter of the slowest counter

  29. Thread delaying • CPUs of the non-critical threads, can be put into deep sleep • Consumes almost zero energy • Not the most energy-efficient approach to deal with workload imbalance • Make non-critical threads run at a lower frequency/voltage level • All threads arrive at the barrier at the same time

  30. Thread Delaying Frequency barrier Thread 1 A • Proposal: • Energy = Activity x Capacitance x Voltage2 • Reduce voltage when executing parallel threads • Delay threads arriving early to the barrier Frequency Thread 2 B Frequency Thread 3 C Thread 4 Frequency D Area -> Energy needed to execute the instructions of the thread

  31. 3 3 3 2 2 2 2 1 1 1 1 3 Thread Delaying Frequency barrier Thread 1 A Frequency Thread 2 B Frequency Thread 3 C Thread 4 Frequency D Area -> Energy needed to execute the instructions of the thread

  32. Thread Delaying Energy B C D A B C D A B Energy Saved C D

  33. Implementation of Thread delaying • MP-COUNTER_TABLE • Contains as many entries as number of cores in the processor • 32-bit counter • Consistent among all cores • HISTORY-TABLE • An entry for each possible frequency level • 2-bit up-down saturating counter • Implementation • Each core broadcasts the counter value in each 10 execution of the meeting point instruction • Invoke thread delaying algorithm • History table is updated

  34. Thread Balancing • Speeding up a parallel application running more than one thread • Two-way in-order SMT with an issue bandwidth of two instruction per cycle • Both threads have ready instructions, allow both of them • One thread has ready instruction, can issue up to two instruction per cycle • If threads belong the same parallel application, prioritize critical thread • Thread balancing • Identify critical thread • Give the critical thread more priority in the issue logic

  35. Thread Balancing Logic • Targeted for 2-way SMT: • Imbalance hardware logic: identify critical thread • Issue prioritization logic • If a thread is critical and it has two ready instructions, it is allowed to issue both instructions regardless of the number of ready instructions the non-critical thread has • Otherwise, the base issue policy is applied

  36. Simulation Framework and Benchmarks • SoftSDV for Intel64/IA32 processor • Simulate multithreaded primitives including locks and synchronization operation and shared memory and events • RMS(Recognition, Mining, and Synthesis) benchmark • Highly data-intensive and highly parallel(computer vision, data mining, etc) • Benchmarks are parallelized by pthreads or OpenMP • 99% of total execution is parallel for all except FIMI (28% coverage)

  37. Performance Results for Thread delaying • Baseline is aggressive • Every core is running at full speed and stops when it is completed. Once the core stops, it consumes zero power • Save 4% - 44% energy • Energy savings come from the large frequency decreases on non-critical thread

  38. Performance Results for Thread Balancing • Baseline is aggressive • Every core is running at full speed and stops when it is completed • Performance benefit ranges from 1% - 20% • Performance benefit correlates with imbalance levels

  39. Conclusions • Meeting point thread characterization dynamically estimates the criticality of the threads in a parallel execution • Thread delaying combines per-core DVFS and meeting point thread characterization together to reduce energy consumption on non-critical threads • Thread balancing gives higher priority in the issue queue of an SMT core to the critical thread.

  40. Comparison of the Two Papers

  41. Critiques • Paper 1 • It did not mention how to calculate the values for SST • Accuracy of barrier based DVFS depends on pre-calculated SST’s values • Paper 2 • The total number of times each thread visits the meeting point should be roughly same, that means meeting point thread characterization cannot handle variable loop iteration size • It just works well for parallel loop, but fails for any large parallel region without parallel loop • It might not be always feasible for hardware to detect parallel loop and insert the meeting point

  42. Thank you !

More Related