1 / 23

Jae W. Lee and Krste Asanovic { leejw, krste } @mit

METERG: Measurement-Based End-to-End Performance Estimation Technique in QoS-Capable Multiprocessors. Jae W. Lee and Krste Asanovic { leejw, krste } @mit.edu MIT Computer Science and Artificial Intelligence Lab April 5, 2006.

thai
Download Presentation

Jae W. Lee and Krste Asanovic { leejw, krste } @mit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. METERG: Measurement-Based End-to-End Performance Estimation Techniquein QoS-Capable Multiprocessors Jae W. Lee and Krste Asanovic {leejw, krste}@mit.edu MIT Computer Science and Artificial Intelligence Lab April 5, 2006

  2. Our focus: Running soft real-time apps (e.g. media codecs, web services) on general-purpose MPs Assuming multiprogrammed workloads: real-time + best-effort apps Problem: Contention for shared resources causes wider variationof a task’s execution time. Multiprocessors Are Ubiquitous Embedded CMP (e.g. ARM MPCore) Desktops P1 Pn Handheld/ Embedded Shared bus I/O MEM Servers

  3. For fixed input to fixed task, execution time varies depending on the activities of other processors because of shared memory system. Execution Time Variation in MPs Execution time Processor 1 Processor n Severe Contention No Contention Shared Bus Shared Memory Shared Resource Instance # of TaskX i+2 i+4 i+1 i+3 i

  4. Execution time increases by 85 % when number of background processes increases from 0 to 7. How can we limit the impact of resource contention?  QoS Support Impact of Resource Contention Measure Execution Time Variable Number of Background Processes Normalized Execution Time of memread on P1(a fixed input) 1.85 Processor 2 Processor 1 Processor 8 1.05 1.02 1.00 Shared Bus Shared Memory Shared Resource Number of Background Processes

  5. Memory QoS guarantees per-processor memory bandwidth and latency. It guarantees minimum performance and distributes unclaimed bandwidth to sharers in order to maximize throughput. Challenge: How can we find the minimal QoS parametersthat meet a RT task’s given performance goal? Minimal reservation improves total system throughput. Memory QoS Reduces MP Execution Time Variation Range of Execution Time Without QoS With QoS Instance # of TaskX i+1 i+2 i i+3 i+4

  6. User’s performance goal (user metric) should be translated into QoS parameters (system metric). User metric: Execution time, Transactions per Second (TPS), Frames per Second (FPS) System metric: Memory bandwidth, Latency Example: MP server virtualization Question: What guarantees can you make to users? Users prefer user metric (e.g. TPS) to system metric (e.g. Memory bandwidth). Translating Performance Goal in User Metric into QoS Parameters Partition 1 Partition 2 Partition n • Minimal QoS parameters are important to maximize system throughput. Multiprocessor System Substrate

  7. Static timing analyses: Tools for WCET analyses Consist of Hardware modeling + program analysis [+] Can find minimal resource reservation for all possible inputs [-] Analysis cost is prohibitive. Becoming harder for increasing hardware/software complexity in MPs Possibly overkill for soft RT apps: Our primary concern is performance degradation from resource contention (not from input data). We can toleratea small number of deadline violations to maximize overall system throughput. Finding Minimal Resource Reservation (1): Analysis

  8. Measurement-based Approach Measures task’s execution time for a certain input set Motto: “The best model for a system is the system itself.” [Colin-RTSS ’03] Goal: To find execution time under worst-case contention (=TMAX(x, i)) for given QoS parameter (x) and instruction sequence (i) Finding Minimal Resource Reservation (2): Measurement Probability distribution (x: QoS parameter) Range of variation TMAX(x, i) Execution Time for fixed inst seq i Degree of Contention Low ← → High

  9. Measurement-based Approach (Cont) [+] Easy [-] Not absolutely safe Can be practically useful for soft real-time apps  Can be useful if we are given “near-worst” input set [-] Measured time varies by degree of contention in runtime  We will address this problem in the rest of this talk! Finding Minimal Resource Reservation (2): Measurement Probability distribution (x: QoS parameter) Range of variation TMAX(x, i) Execution Time for fixed inst seq i Degree of Contention Low ← → High

  10. METERG QoS System: Improving measurement-based approach Estimates TMAX(x, i) easily by measurement Provides hardware support for safe and tightestimation for TMAX(x, i) Introduces two QoS modes Deployment mode (operation mode) Enforcement mode (measurement mode for estimation) Our Approach: METERG Probability distribution 1. Always greater than TMAX(x, i) (Safe) Range of variation (Deployment) Range of variation (Enforcement) 2. Very narrow and close to TMAX(x, i) (Tight) Execution Time for fixed inst seq i TMAX(x, i)

  11. Now each QoS block supports two operation modes: 1) Deployment mode: Conventional QoS mode (Operation) QoS parameters are treated as a lower bound of received resources in runtime. 2) Enforcement mode: New QoS Mode (Estimation) QoS parameters are treated as an upper bound of received resources in runtime. 50 50 Modifying Shared Blocks to Support Two QoS Modes QoS Parameter 1) Deployment Mode 2) Enforcement Mode

  12. Single-threaded apps; no shared objects. No contentions within a node. (e.g. multithreaded processors) No preemption of a scheduled task. Negligible impact of initial states on execution time. Assumptions

  13. Baseline MP System • Simple fixed frame-based scheduling for memory access • Example (x1=0.2, xn=0.3) • xi determines • Minimum bandwidth • Worst-case latency Processor 1 Processor n x1 xn Frame (size = 10 Timeslots) Shared Bus 1 n n 1 n Shared Memory QoS-capable Shared block Not taken: Given to a requester by round-robin fashion Taken By P1 xi: Fraction of time slots reserved for Processor i

  14. Example (x1=0.2) 0.2 0.2 Runtime Usage Example in METERG System Reservation: Processor 1 Processor n Frame (size = 10 Time Slots) 1 1 x1 xn Shared Bus Runtime usage (Deployment): Shared Memory QoS-capable Shared block Not reserved but used Reserved & used Runtime usage (Enforcement): X X X X X X X X Reserved but unused Not allowed to use Reserved & used

  15. How to find minimal resource reservation with METERG: In measurement phase: Step 1: Measure performance in enforcement mode for a QoS parameter. Step 2: Iterate Step 1 to find a minimal QoS parameter (yet meeting the performance goal). Step 3: Store the minimal QoS parameter. In operation phase: Step 4: Execute the program in deployment mode with the stored QoS parameter for guaranteed performance. Obtain Minimal Resource Reservation Using METERG

  16. Shared bus example again (x1=0.2) Bandwidth inequality is guaranteed but not latency. BandwidthDEP (xi) ≥ BandwidthENF (xi) (O) MAX [LatencyDEP (xi)] ≤ MIN [LatencyENF (xi)] (X)  May cause end-to-end performance inversion. Bandwidth Guarantees Are Not Enough: An Example Deployment Mode Enforcement Mode Memory Request Memory Request 4 Timeslots 2 Timeslots X X X X X X X X Used by other processors Not allowed to use

  17. Relaxed enforcement (R-ENF) mode: Bandwidth-only guarantees BandwidthDEP (xi) ≥ BandwidthR-ENF (xi) There is a (small) chance for observed execution time in enforcement mode to be smaller than TMAX(x, i). (Not Safe) Strict enforcement (S-ENF) mode: Bandwidth and latency guarantees BandwidthDEP (xi) ≥ BandwidthS-ENF (xi) && MAX[LatencyDEP (xi)] ≤ MIN[LatencyS-ENF (xi)] Observed execution time is always greater than TMAX(x, i) in enforcement mode as long as there is no timing anomaly in processor.(Safe) Estimation of TMAX(x, i) is looser. Two Enforcement Modes: Safety-Tightness Tradeoff

  18. Delay queue In Deployment mode: Bypassed In Enforcement mode: Deferring delivery to processor for “lucky” messages Memory System with Strict Enforcement Mode: An Implementation METERG QoS Memory block Processor 1 OpMode1 Mem_Req Mem_Reply TMAX(DEP)(x1): Worst-Case latency in Deployment Mode for given param x1 Tactual1: Actual time taken to service request 1 Delay Queue (bypassed in DEP mode) Processor n Data Timer V 1 data1 TMAX(DEP)(x1)-Tactual1 1 data2 TMAX(DEP)(x1)-Tactual2 0

  19. Simulation Setup 8-way SMP running Linux In-order core running with 5x faster clock than the system bus Detailed memory contention model with a simple fixed-frame TDM bus 32 KB unified blocking L1 cache / No L2 cache Synthetic benchmark: Memread Infinite loop accessing a large chunk of memory sequentially Performance bottlenecked by memory system performance Evaluation: Simulation Setup Processor 1 Processor 8 Request Reply Request Reply OpMode8 OpMode1 E D E D Network Interface (NI) Shared Bus Shared Memory METERG QoS block

  20. Evaluation: Varying Number of Processes Measure P1 (x1 = 0.25) Variable # of BE Processes Normalized Execution Time 1.85 (BE) Processor 2 Processor 1 Processor 8 1.45 (ENF) 1.10 (DEP) Shared Bus Shared Memory 0 3 7 1 Shared Resource # of Background Processes (x1 =0.25) • Execution time degradation as number of background processes increases from 0 to 7 without QoS (Best-Effort): 85 % with QoS (Deployment): 10 % • Estimated execution time upper bound for given QoS parameter (x1=0.25): 45 %

  21. Evaluation: Varying QoS Parameters Measure P1 (Variable x1) 7 BE Background Processes Normalized Execution Time Processor 2 Processor 1 Processor 8 2.35 (ENF) 1.51 (DEP) Shared Bus 1.09 1.01 Shared Memory Shared Resource QoS Parameter (x1) • Performance estimation becomes tighter as QoS parameter (x1) increases. • Because the worst-case latency in accessing memory decreases accordingly

  22. Performance impact on a QoS process by other QoS processes is negligible (<2%) as long as system is not oversubscribed. Evaluation: Varying Number of QoS Processes Estimated lower bound for xi=0.25 Normalized IPC* 0.91 0.91 0.90 0.91 0.90 0.89 0.81 (0.69) (0.54) 0.45 0.37 8-BE Case 0.27 0.14 1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE QoS Parameters: (0.25, 0, 0, 0, 0, 0, 0, 0) (0.25, 0.25, 0, 0, 0, 0, 0, 0) (0.25, 0.25, 0.25, 0, 0, 0, 0, 0) (0.25, 0.25, 0.25, 0.20, 0, 0, 0, 0) (* Higher Number = Better Performance)

  23. In general-purpose MPs, shared resource contention causes large variation of execution time of a task. (~2x increase observed in 8-processor case) METERG QoS System provides an easy way (by measurement) to estimate execution time under worst-case resource contention for a given QoS parameter. Introducing a new QoS mode (Enforcement Mode) for performance estimation Using METERG QoS System, minimal QoS parameter can be easily found for given instruction sequence and performance goal. Conclusion

More Related