1 / 28

Ch. 3-3 CPU

Ch. 3-3 CPU. Input and output. Supervisor mode, exceptions, traps. Memory management and address translation Caches CPU Performance and power consumption Co-processors. Elements of CPU performance. Cycle time. CPU pipeline. Memory system. Pipelining.

Download Presentation

Ch. 3-3 CPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch. 3-3 CPU Input and output.Supervisor mode, exceptions, traps.Memory management and address translationCachesCPU Performance and power consumptionCo-processors.

  2. Elements of CPU performance • Cycle time. • CPU pipeline. • Memory system.

  3. Pipelining • Several instructions are executed simultaneously at different stages of completion. • Various conditions can cause pipeline bubbles that reduce utilization: • branches; • memory system delays; • etc.

  4. Pipeline structures • Both ARM and SHARC have 3-stage pipes: • fetch instruction from memory; • decode opcode and operands; • execute.

  5. add r0,r1,#5 fetch decode execute fetch decode execute sub r2,r3,r6 cmp r2,#3 fetch decode execute time 1 2 3 4 ARM pipeline execution

  6. Performance measures • Latency: time it takes for an instruction to get through the pipeline. • Throughput: number of instructions executed per time period. • Pipelining increases throughput without reducing latency.

  7. Pipeline stalls • If every step cannot be completed in the same amount of time, pipeline stalls. • Bubbles introduced by stall increase latency, reduce throughput.

  8. fetch decode ex ld r2 ex ld r3 fetch decode ex sub fetch decode ex cmp time Data stalls ldmia r0,{r2, r3} sub r2, r3, r6 cmp r2, #3

  9. Control stalls • Branches often introduce stalls (branch penalty). • Stall time may depend on whether branch is taken. • May have to squash instructions that already started executing. • Don’t know what to fetch until condition is evaluated.

  10. Branch stalls bne foo fetch decode ex bne ex bne ex bne sub r2,r3,r6 fetch decode . . . fetch foo add r0,r1,r2 fetch decode ex add time

  11. Delayed branch • Requires that n instructions after branch be always executed whether branch is executed or not. • SHARC supports delayed and non-delayed branches. • Specified by a bit in the branch instruction. • If non-delayed, stall for two cycles.

  12. L1=5; DM(I0,M1)=R1; L8=8; DM(I8,M9)=R2; CPU cannot use DAG onecycle just after loading DAG’s register. CPU performs NOP between register assign and DM. Example: SHARC code scheduling NOP inserted

  13. L1=5; L8=8; DM(I0,M1)=R1; DM(I8,M9)=R2; Avoids two NOP cycles. Rescheduled SHARC code

  14. Example: ARM execution time • Determine execution time of FIR filter: for (i=0; i<N; i++) f = f + c[i]*x[i]; • Only branch in loop test may take more than one cycle. • BLT loop takes 1 cycle in the best case, 3 in the worst case.

  15. Execution time a for loop ; loop initiation code MOV r0,#0 ; use r0 for i MOV r8,#0 ; use separate index for arrays ADR r2,N ; get address for N LDR r1,[r2] ; get value of N MOV r2,#0 ; use r2 for f ADR r3,c ; load r3 with base of c ADR r5,x ; load r5 with base of x ;loop body loop LDR r4,[r3,r8] ; get c[i] LDR r6,[r5,r8] ; get x[i] MUL r4,r4,r6 ; compute c[i]*x[i] ADD r2,r2,r4 ; add into running sum ;update loop counter and array index ADD r8,r8,#4 ; add one word offset to array index ADD r0,r0,#1 ; add 1 to I ;test for exit CMP r0,r1 ; exit? BLT loop ; if i < N, continue tloop = tinit + N(tbody+tupdate) + (N-1)ttest,worst + ttest,best

  16. Superscalar execution • Superscalar processor can execute several instructions per cycle. • Uses multiple pipelined data paths. • Programs execute faster, but it is harder to determine how much faster.

  17. Data dependencies • Execution time depends on operands, not just opcode. • Superscalar CPU checks data dependencies dynamically: r0 r1 add r2,r0,r1 add r3,r2,r5 r2 r5 r3 data dependency

  18. Memory system performance • Caches introduce indeterminacy in execution time. • Depends on order of execution. • Cache miss penalty: added time due to a cache miss. • Several reasons for a miss: compulsory, conflict, capacity.

  19. CPU power consumption • Most modern CPUs are designed with power consumption in mind to some degree. • Power vs. energy: • Power is energy consumption per unit time • Heat generation depends on power consumption; • Battery life depends on energy consumption. • CMOS (Complementary metal oxide semiconductor)

  20. CMOS power consumption • Voltage drops: power consumption proportional to V2. • Toggling: • A CMOS circuit uses most of the power which changing its output value • Reduce a circuit operation speed => more activity means more power. • Eliminating unnecessary changes to CMOS inputs • Leakage: • basic circuit characteristics • Some charge leaks out of circuit nodes even though CMOS is not active • can be eliminated by disconnecting power.

  21. CPU power-saving strategies • Reduce power supply voltage. • Run at lower clock frequency. • Disable function units with control signals when not in use. • Disconnect parts from power supply when not in use.

  22. Power management styles • Static power management: does not depend on CPU activity. • Example • user-activated power-down mode. • Power down by an instruction, but activated on the receipt of an interrupt or other event. • Dynamic power management: based on CPU activity. • Example: disabling function units of the CPU.

  23. Application: PowerPC 603 energy features • Provides doze, nap, sleep modes, used by the program and operating system. • Dynamic power management features: • Can shut down unused execution units. • DC, IC, Load-Store, etc. • Cache organized into subarrays to minimize amount of active circuitry. • Two out of eight subarrarys used on any given clock cycle

  24. PowerPC 603 activity • Percentage of time units are idle for SPEC integer/floating-point: unit Specint92 Specfp92 D cache 29% 28% I cache 29% 17% load/store 35% 17% fixed-point 38% 76% floating-point 99% 30% system register 89%97%

  25. trs Prun run sleep Psleep tsr Power-down costs • Going into a power-down mode costs: • time; • energy. • Must determine if going into mode is worthwhile. • Can model CPU power states with power state machine.

  26. SA-1100 VDD_FAULT VDD VDDX BATT_FAULT PWR_EN Application: StrongARM SA-1100 power saving • Processor takes two supplies: • VDD is main 3.3V supply. • VDDX is 1.5V. • Three power modes: • Run: normal operation. • Idle: stops CPU clock, with other logics still powered. • Real-time clock, operating system timer, interrupt control, general-purpose I/O, and power manager are operational. • Return to run mode upon receiving an interrupt • Sleep: shuts off most of the chip’s activity.

  27. Sleep mode • Entering sleep mode • Shut down on-chip activity; • Reset the CPU; and • Negate the PWR_EN pin to tell that the chip’s power supply should be driven to 0 volts. • Enable a separate power supply for power manager • Low speed clock sufficient to manage sleep mode • CPU software sets several registers to prepare for sleep mode • Wake-up sequence • PWR_EN pin asserted and waits for 10 ms; • 3.686-MHz oscillator is ramped up to speed; and • Internal reset is negated and CPU boot sequence begins.

  28. SA-1100 power state machine Prun = 400 mW run 10 ms 160 ms 90 ms 10 ms 90 ms idle sleep Pidle = 50 mW Psleep = 0.16 mW

More Related