1 / 40

Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread Allocation. Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology. Performance and Power Aware CMP Thread Allocation. Thread Allocation. L1. L1. L1. L2. Threads.

joella
Download Presentation

Performance and Power Aware CMP Thread Allocation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance and Power Aware CMP Thread Allocation Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology

  2. Performance and Power Aware CMP Thread Allocation Thread Allocation L1 L1 L1 L2 Threads L1 L1 L1 L1 L1 L1

  3. Performance Power Trade Off Performance maximization Use all the cores High power consumption Power Minimization Single core Low performance Router $ Shared Cache Core

  4. Performance Power Metric (PPM) Less Power ↔ More Performance (smaller α) (larger α) Preferred tradeoff between performance and power. In Short PPM

  5. Outline • Performance and Power Model • Thread Allocation • Numerical Results

  6. Simplified Performance ModelSingle coarse-grain multi-threaded core • The model is an extension of Agarwal`s model for asymmetric threads For simplicity we assume: • No Sharing Effect • Miss-rate doesn`t depend on the number of threads • Holds for small number of threads and large private cache • Miss rate & total memory access don`t vary by time • No context switch overhead * "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992

  7. Terminology Single coarse-grain multi-threaded core • Thread i runs δi clocks until it suffers a L1 cache miss • T - Clocks to fetch from the shared cache Thread 1 Cache response Thread 2 Cache response $ Thread 1 Cache miss Thread 2 Cache miss 1 2 1 2 . . . Idle Time δ2 T • clocks • latency T δ1 δ1 T=h·t+ TL2$ access time Hops Hop-Latency

  8. Memory Bound Case Thread i Performance When Core Utilization Thread 1 Cache response Thread 2 Cache response Thread 1 Cache miss Thread 2 Cache miss 1 2 1 2 . . . Idle Time δ2 T T δ1 δ1 Each thread got executed every

  9. CPU Bound Case Thread i Performance When → Saturation Core Utilization Thread 1 Cache miss Thread 2 Cache miss Thread 1 Cache response Thread 2 Cache response δ2 T . . . . . . . . . . M 1 2 . . . . . . . . . . 1 2 . . . δ1 δ1 T Each thread is executed every clocks

  10. Performance Per Thread Saturation Threshold 1 Hop Saturation

  11. Performance Per Thread Saturation Threshold 1 Hop 2 Hops Saturation

  12. Performance Per Thread Saturation Threshold 1 Hop 2 Hops More Hops Saturation

  13. Power Model • Core power consumption: • - Power consumption of a fully utilized core • - Idle core power consumption

  14. Outline • Performance and Power Models • Thread Allocation • Numerical Results

  15. The Thread Allocation Problem • Given: • CMP Topology composed of M identical cores • P applications each with Ti symmetric threads (1≤i≤P). • α – Preferred tradeoff between performance and power. • Find thread allocation: which maximizes • For simplicity: • 1) We assume that • ni(c) is continuous • 2) Perform result • Discretization core index application index Average Thread Performanceα α≥1 Power PPM

  16. Minimum Utilization (MU) • Activating a core increases the power consumption by at least Pidle • In order to justify operation of a core, an appropriate increase of the performance is required. Reminder: MU is the Minimum Utilization which justifies operating a core.

  17. Minimum Utilization (MU)Calculation • Compare the PPM value of two cases: Threads are executed by: m cores in exactly threshold saturation and the (m+1)th core utilization equals MU Threads are executed by: m over-saturated cores = PPM2nd case PPM1st case

  18. Minimum Utilization (MU)Calculation x 104 m=1 First core in threshold saturation 1st case: All threads are executed by single over saturated-core 2nd case: The first core in threshold saturation and the remaining threads are executed by the second core. First core in threshold saturation. Utilization of the second core equals MU Power increases by Pidle PPM (MIPS2/Power) PPM1st case = PPM2nd case Threads

  19. Minimum Utilization (MU)Calculation • Compare the PPM value of two cases: Threads are executed by: m cores in exactly threshold saturation and the (m+1)th core utilization equals MU Threads are executed by: m over-saturated cores = 2nd case 1st case

  20. Minimum Utilization (MU)Approximated Value and α Dependency Power is more important. Operate a core only if it`s highly utilized Minimum Utilization (%) Performance is more important. Operate a core even if its utilization is low α Less Power ↔ More Performance

  21. The Thread Allocation AlgorithmHighlights Iterative • Iterative. • In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation. • Operating a core only if MU Threshold is achieved. Threshold ITA Algorithm

  22. Outline • Performance and Power Model • Thread Allocation Problem • Numerical Results

  23. How to evaluate ITA PPM ? • Compare average PPM values • ITA • Equal Utilization • Optimization algorithms • Scenarios • 2-8 cores and 2-8 applications • Using the following distributions:

  24. Equal Utilization Comparison Average Improvement of 47% 4.7 Cores 3.6 Cores (2,5) 7.2 Cores 7.9 Cores (5,8)

  25. Comparison with Optimization Methods • The best PPM of: • Constrained Nonlinear Optimization • Pattern Search Algorithm • Genetic Algorithm • These methods were run for 10,000X longer than ITA

  26. Optimization Methods Compare Average Improvement of 9% ITA: 4.7 Cores Opt. Methods: 7.1 Cores ITA: 3.6 Cores Opt. Methods: 4.6 Cores (2,5) ITA: 7.1 Cores Opt. Methods: 7.9 Cores ITA: 7.9 Cores Opt. Methods: 8 Cores (5,8) Applications Cores

  27. Summary • Tunable Performance Power Metric • Minimum Utilization concept • Approach for low computational thread allocation on CMP • Future work: • Extension for distributed cache Threads and data co-allocation • Sharing Effect Consideration • Heterogonous CMPs

  28. Questions ?

  29. Backup

  30. Performance Power Metric • Follows definitions used in logic circuit design. • If E is the energy and t is the delay, Penzes & Martin ‎introduces the metric E•tα, where α becomes larger as the performance becomes more important. • * “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J • 12th ACM Great Lakes Symposium on VLSI, 2002.

  31. Minimum Utilization (MU)Calculation Cont. MU of (m+1)th core α=1 • MU value depends on how many cores are already operating. • For large enough value of m, MU value is constant…. • Approximate constant value is reasonable… (Keep it Simple…) α=1.5 α=2 α=2.5 α=3

  32. MU vs. Pidle/Pactive α=1 α=1.2 α=1.4 Minimum Utilization (%) α=1.6 α=1.8 α=2 Pidle/Pactive

  33. Previous Work

  34. Previous Work Neglecting Sharing Effect • Fedorovaet al. "Chip multithreading systems need a new operating system scheduler“ • Its goal is to highly utilize the cores. • Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the pipeline resource contention. • Neglect the sharing effect among threads (Similar to my research…) • Doesn`t take into account varying distances of cores from the L2 shared cache . • Doesn`t consider the power consumption.

  35. Discritization • There are a lot of discritization methods … • Use Histogram Specification method (image processing)

  36. Results DiscretizationExample D C On average, results discretization reduces PPM value by 5%. C Number of Threads D D D C C C D Core Hops Distance

  37. Results Discretization

  38. Initialize Current core=Closer core to the shared cache Current application= Application with highest miss rate Flow Chart Allocate threads of current application over current core until at most threshold saturation All the threads of current application were allocated? Last application? Yes, Last application Yes, Not last application Current application = Unallocated application with highest cache miss Finish No ,Yes. Last core Current core at saturation threshold? Last core ? No ,Yes. Not last core All unallocated threads achieve MU on the next available closer core? Allocate all remaining threads over already operating cores (over saturation) No Yes Finish Current core = The next available closer core to the shared cache

  39. Time Complexity ComparisonOptimization methods to ITA Ratio ITA Operations / Minimum of Optimization Methods Operations ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the optimization methods. It outperforms the best of optimization methods by 9%. Applications Cores

  40. Ratio of memory access instructions out of the total instruction mix of thread i Cache miss rate for thread i

More Related