1 / 44

Chapter 7 Performance Analysis Techniques

Chapter 7 Performance Analysis Techniques. Outline. Real-time performance analysis Applications of Queue Theory Input / Output Performance Analysis of memory requirements. 7.1 Real-time performance analysis. Theoretical preliminaries. Complexity classes P, NP, NP-complete, NP-hard

eljah
Download Presentation

Chapter 7 Performance Analysis Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7Performance Analysis Techniques

  2. Outline • Real-time performance analysis • Applications of Queue Theory • Input / Output Performance • Analysis of memory requirements

  3. 7.1 Real-time performance analysis

  4. Theoretical preliminaries • Complexity classes P, NP, NP-complete, NP-hard • P: the class of problems that can be solved by an algorithm that runs in polynomial time on a deterministic computing machine. • NP: the class of all problems that can not be solved in polynomial time by any deterministic machine • But can verify a candidate solution be correct or not by a P-class algorithm. • NP-complete: belongs to the class NP and all other problems in NP are polynomial transformable to it. • NP-hard: if all problems in NP are polynomial transformable to this problem, but it has not been shown that it belongs to the class NP.

  5. Examples • Boolean satisfiability problem (N-Sat, N Boolean variables) is NP-complete. • But, the SAT problem involving 2 or 3 number of Boolean variables is in P. • Can arise for requirements consistency checking • In general, NP-complete problems in RTS tend to be those relating to resource allocation occurring in multitask scheduling situation. • Implies no easy way to find the solutions

  6. More examples • The problem of deciding whether it is possible schedule a set of periodic tasks that use only semaphores to enforce mutual exclusion is NP-hard. • The multiprocessor scheduling problem with two processors, no resources, arbitrary partial-order relations, and every task having a 1-unit computation time is polynomial. • The multiprocessor scheduling problem with two processors, no resources, independent tasks, and arbitrary task computation times is NP-complete. • The multiprocessor scheduling problem with two processors, no resources, independent tasks, arbitrary partial-order, and task computation times of either 1 or 2 units of time is NP-complete. • Partial-order: any task can call itself, A calls B, the reverse it not possible; if A calls B, B calls C, then A can calls C.

  7. Arguments related to parallelization • Amdahl’s law • Statement: For a constant problem size, the incremental speedup approaches zero as the number of processor elements grows. • Formalism: Let N be # of equal processors available for parallel processing; Let S(0≤s≤1) be the fraction of program code that is of serial nature (cannot be parallelized) The speedup saturates to a limit value

  8. Some discussion • Amdahl’s pessimistic law is cited as an argument against parallel systems and in particular, against massively parallel processors. • Taken as an insurmountable bottleneck that limited the efficiency and application of parallelism to various problems. • Later research provided new sights into Amdahl’s law and its relation to large-scale parallelism.

  9. Flaws of Amdahl’s law • Key assumption of Amdahl’s law • “Problem size remains constant”, but the problem size tends to scale with the size of a parallel system • Items that scale with the problem size • parallel or vector part of a program • Items that do not grow with the problem size • Inherent time for vector start-up, program loading, serial bottlenecks, I/O that make up the serial component

  10. Gustafson’s law • Definition: If the firmly serial code fragment, S, and the parallelized fragment, (1-S) are processed by a parallel computer system with N equal processor, the achievable speedup is • Does not saturate as N approaches infinity • Provides a more optimistic picture of speedup • The current “multi-core era” could be viewed as a partial consequence of Gustafson’s Law. “A more efficient way to use a parallel computer is to have each processor perform similar work, but on a different section of the data .. where large computations are concerned” (Hills, 1998)

  11. Gustafson vs. Amdahl • Gustafson’s unbound speedup compared with Amdahl’s saturating speedup when 50% of code is suitable for parallelization Speedup

  12. Execution time estimation from program code • Analyzing RTSs to see if they meet their critical deadlines is • rarely possible exactly due to NP-completeness of most scheduling problems. • But possible to get a handle on the system’s behavior through approximate analysis • First step in performing schedulability analysis is to predict, estimate or measure the execution time of essential code units. • Methods to decide the task’s execution time ei • Using logical analyzer (most accurate, employed in the final stages during system integration) • Counting CPU-specific instructions manually or using automated tools • Reading system clock before and after executing the particular program code.

  13. Example: instruction counting app • Given that a certain program module converts raw sensor pulses into the actual acceleration components that are later compensated for temperature and other effects. • The module is just to decide if the aircraft is still on the ground, in which case only a small acceleration reading for each of the XYZ components is allowed (represented by the symbol constant PRE_TAKE) • C code with assembly instruction are given.

  14. Example1 • Tracing the worst-case execution path and counting the instructions shows that • 12 integer (7.2µs) and • 15 floating-point (75µs) instructions for a total execution time of 82.2 µs. • Since this sequence of code runs in a 5ms cycle, the corresponding time-loading is only 82.2/5000

  15. Example 2: Estimation in non-pipelined CPU platform • All execution paths • Path 1: 1-4, 9-10, 12 • 7 instructions @0.6µs each -> 4.2µs (BCET) • Path 2: 1-7, 11-12 • Path 3: 1-8, 12 • 9 instructions @ 0.6µs each-> 5.4µs (WCET)

  16. Example 2: Estimation in pipelined CPU platform • Assume a three-stage pipeline • Fetch (F), decode (D), execute (E) • Each stage takes 0.6µs/3=0.2µs Go to Figure 7.2, 7.3, 7.4 • Execution time of all three paths is 2.6µs

  17. Some discussions • RTSs designers frequently use special software to estimate instruction execution times and CPU throughput. Users • can typically input • CPU type • Memory speeds for different address ranges • Instruction mix • and can compute total instruction times and throughput

  18. Example 3: Timing accuracy with a 60-kHz system clock • Suppose • 2000 repetitions of the program code take 450 ms • Clock granularity of 16.67µs. • Hence, the execution time measurement has a high accuracy as

  19. C code to compute the time of instruction execution • API functions • current_clock_time(): a system function that returns the current time • function_bo_be_timed(): the actual code to be timed. Go to timer Code

  20. Analysis of polled-loop systems • The response time consists of three components • The cumulative hardware delays involved in setting the software flag by some external device • The time for the polled loop to test the flag • The time needed to process the event associated with the flag • Assumption: sufficient processing time is available between consecutive events Processing Delay (Milliseconds?) Flag-Testing Delay (Microseconds) Flag-setting Delay (Nanoseconds) Excitation Response Time Response

  21. Analysis of polled-loop systems • If events overlap each other, • A new event is initiated while a previous one is still being processed • Then, the response time becomes worse, the time for the Nth overlapping event is bounded by • tF: the time to check the flag • tP: the time to process the event • Ignore the time for the external device to set the flag. • In practice, we place some limit on N • N is the number of events that are allowed to overlap. • Overlapping events may not be desirable at all in certain situations.

  22. phase_a1(); phase_b1(); phase_a2(); phase_b2(); …. Review:Coroutine central dispatcher 4 void task_b() { for(;;) { switch(state_b) { case 1: phase_b1(); break; case 2: phase_b2(); break; case 3: phase_b3(); break; } }} 1 2 3 Two tasks task_a and task_b are executing in parallel and in isolation. State_a and state_b are global variables managed by the dispatcher to maintain synchronization and inter-task communication. void task_a() { for(;;) { switch(state_a) { case 1: phase_a1(); break; case 2: phase_a2(); break; case 3: phase_a3(); break; } }}

  23. Analysis of coroutine systems Tracing the execution path in a two-task coroutine system. A central dispatch calls Task_1 and task_2 by turns, and a switch statement is not shown here. void task_1() … task_1a(); return; task_1b(); return; task_1c(); return; Begin Here void task_2() … task_2a(); return; task_2b(); return; • The absence of interrupts in coroutine systems makes the determination of response time easy. • Time obtained by tracing the worst-case execution path through all tasks. • Must first determine the execution time of each phase Repeat the Sequence

  24. Review: Round-robin scheduling is simple and predictable. • Achieve fair allocation of CPU resources to tasks of the same priority by time multiplexing • Each executable task is assigned a fixed time quantum or time slice to execute • A fixed-rate clock is used to initiate an interrupt at a rate corresponding to the time slice B completes B takes over Task B C runs its slice Task A Task A Task C Begin A preempted A resumes A finishes its slice

  25. Analysis of round-robin systems • Assumptions and definitions • n tasks in the ready queue, no new ones arrive after scheduling, none terminates prematurely • Let q be constant timeslice for each task • Possible slack times in timeslice are not utilized • Let c=max{c1,…,cn} be maximum execution time Thus, the worst-case time T from readiness to completion for any task (upper bound)

  26. Example: turnaround time calculation without context switching overhead • Suppose only one task with a maximum execution time of 500ms, and the time quantum is 100ms, thus • Suppose five equally important tasks, each with a maximum execution time 500ms, time quantum is 100ms

  27. Non-negligible context switching overhead • Let o be context switching overhead with task switching. • Thus, each task waits no longer than (n-1)q until its next time slice, plus • an inherent overhead of n*o time units each time around for context switching.

  28. Examples • Suppose one task with a maximum execution time of 500ms, time quantum is 40ms, and context switch time is 1ms, thus • Suppose six equally important tasks, each with a maximum execution time 600ms, time quantum is 40ms, context switch costs 2ms

  29. Selection of time quantum q • In terms of the time quantum, it is desirable that q < c to achieve fair behavior for the round-robin system. • If q is very large, the round-robin algorithm is in fact the first-come, first-served algorithm, in that each task will execute to its completion within the very large time quantum.

  30. Review: Fixed-priority scheduling: rate-monotonic approach • Theorem Given a set of periodic tasks and preemptive priority scheduling, assigning interrupt priorities such that the tasks with shorter periods have higher priorities, yields an optimal scheduling algorithm. Optimality implies: if a schedule that meets all the deadlines exists with fixed priorities, the RM algorithm will produce a feasible schedule.

  31. Analysis of fixed-period/priority systems • For any task with an execution time of ei time units, the response time Ri is where is the max possible delay in execution (caused by higher priority tasks) during [t, t+Ri) • The most critical time instance, when all higher-priority tasks are released along to the task , Ii has the maximum contribution for Ri. (7.7)

  32. Analysis of fixed-period systems • Consider a task of higher priority than . • Within the interval [0, Ri), the number of releases of will be where is the execution period of . • Each release of contributes to the amount of interference from other tasks of higher priority that will suffer. (7.8)

  33. A recursive solution to response time • Each task of higher priority is interfering with task , Hence where HRP(is) is the set of higher-priority tasks w.r.t. • Substitute this into equation 7.2. yields (7.9) (7.10) (7.11)

  34. A recursive solution to response time • Due to inconvenience ceiling function, it is difficult to solve for Ri directly. A net recursive solution is provided as following: • Compute the consecutive values of iteratively until the first value of m is found such that = • If the recursive equation does not have a solution, the value of will continue grow. • As in the overloaded case: a tasks set has CPU utilization factor greater than 100%. (7.11)

  35. Example: compute response time in a rate-monotonic case • Consider a task set to be scheduled rate - monotonically as shown below • Let first calculate the CPU utilization factor, U to make sure the RTS is not overload.

  36. Example: compute response time in a rate-monotonic case • The highest priority task has a response time equal to its execution time, so R1 = 3 • The medium or lowest priority taskand has its response time iteratively computed according to equation 7.11.

  37. Analysis of non-periodic systems • In practice, a RTS having one or more aperiodic or sporadic cycles could be modeled as a rate-monotonic system, • the non-periodic tasks is approximated as having a period equal to their worst-case expected inter-arrival time. • If this rough approximation leads to unacceptable high utilizations, use some heuristic analysis instead. (Queuing theory)

  38. Response times for interrupt-driven systems • The calculation depends on several factors • Interrupt latency • Scheduling/dispatching times • Negligible when CPU uses a separate interrupt controller supporting multiple interrupts • Can compute using simple instruction counting when a single interrupt is supported with an interrupt controller • Context switch times • Determination of context save/restore times is similar to execution time estimation for any application code

  39. Interrupt latency • It is a varying period defined between • a device requests an interrupt and • the first instruction for the associated interrupt service routine executes. • Worst case interrupt latency • Occur when all possible interrupts in the system are requested simultaneously • Main contributors • Number of tasks, as RTOS needs to disable interrupts while it processing lists of blocked or waiting tasks • Perform some latency analysis to verify that OS is not disabling interrupts for an unacceptably long tome. • In hard RTSs, keep tasks # as low as possible

  40. Time needed to complete the execution of a particular ML instruction being interrupted. • Find the WCET of every ML instruction by measurement, simulation or manufacturer’s datasheet • Instruction with the longest execution time will maximize the contribution to interrupt latency if it just begun executing when the interrupt request arrives. In a certain 32-bit MCU, • all fixed-point instructions take 2µs • Floating point instructions take 10µs • Special instructions like trigonometric function take 50µs

  41. Deliberate disabling of interrupts by RT software • Interrupts are disabled for a number of reasons • Protection of critical regions • Buffering routines • Context switching • So, allow interrupt disabling by system software, not application software.

  42. Architecture enhancements render a system unanalyzable for RT performance • Instruction and data cache • Fetch instructions from slower main memory • A time-consuming replacement algorithm tor bring the missing instructions into cache • Instruction pipelines • Assume that at every possible opportunity, the pipeline needs to be flushed. • Direct memory access (DMA) • Assume that cycle stealing is occurring at every chance, inflating instruction fetch times They improve average computing performance, destroy determinism, and thus make prediction troublesome.

  43. Review: DMA controller • During a DMA transfer, the ordinary CPU data transfer process cannot proceed. • CPU proceeds only with nonbus-related activities • CPU cannot provide service for any interrupts until the DMA cycle is over • Cycle stealing mode • No more than a few bus cycles are used at a time for DMA transfer • Thus, a single transfer cycle of a large data block is split to several shorter transfer cycles

  44. Discussions • Traditional worst-case analysis leads to impractically pessimistic outcomes. • Sol: Use probabilistic performance model for caches, pipelines and DMA. • Definitely meet all the required deadlines, but it is sufficient to have a probabilistic guarantee very close to 100% instead of an absolute guarantee • Practical relaxation dramatically reduces the WCET to be considered, as in schedulability analysis • But in hard RTSs, it remains problematic to use the advanced CPU and memory architectures.

More Related