ELEC 669 Low Power Design Techniques Lecture 1

ELEC 669Low Power Design TechniquesLecture 1 Amirali Baniasadi amirali@ece.uvic.ca

ELEC 669: Low Power Design Techniques Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: amirali@ece.uvic.ca Office Tel: 721-8613 Web Page for this class will be at http://www.ece.uvic.ca/~amirali/courses/ELEC669/elec669.html Will use paper reprints Lecture notes will be posted on the course web page.

CourseStructure • Lectures: • 1-2 weeks on processor review • 5 weeks on low power techniques • 6 weeks: discussion, presentation, meetings • Reading paper posted on the web for each week. • Need to bring a 1 page review of the papers. • Presentations: Each student should give to presentations in class.

Course Philosophy • Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details) • One Project (50%) • Presentation (30%)- Will be announced in advance. • Final Exam: take home (20%) • IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course.

Project • More on project later

Topics • High Performance Processors? • Low-Power Design • Low Power Branch Prediction • Low-Power Register Renaming • Low-Power SRAMs • Low-Power Front-End • Low-Power Back-End • Low-Power Issue Logic • Low-Power Commit • AND more…

Fetch Decode Issue Complete Commit A Modern Processor 1-What do each do? 2-Possible Power Optimizations? Front-end Back-end

Power Breakdown Alpha21464 PentiumPro

Instruction Set Architecture (ISA) Fetch Instruction From Memory DecodeInstruction determine its size & action Fetch Operand data Execute instruction & compute results or status Store Result in memory Determine Next Instruction’s address • Instruction Execution Cycle

What Should we Know? • A specific ISA (MIPS) • Performance issues - vocabulary and motivation • Instruction-Level Parallelism • How to Use Pipelining to improve performance • Exploiting Instruction-Level Parallelism w/ Dynamic Approach • Memory: caches and virtual memory

What is Expected From You? • Read papers! • Be up-to-date! • Come back with your input & questions for discussion!

Power? • Everything is done by tiny switches • Their charge represents logic values • Changing charge  energy • Power  energy over time • Devices are non-ideal  power  heat • Excess heat  Circuits breakdown Need to keep power within acceptable limits

POWER in the real world

Power as a Performance Limiter Conventional Performance Scaling: Goal: Max. performance w/ min cost/complexity How: -More and faster xtors. -More complex structures. Power: Don’t fix if it ain’t broken Not True Anymore: Power has increased rapidly Power-Aware Architecture a Necessity

Power-Aware Architecture Conventional Architecture: Goal: Max. performance How: Do as muchas you can. This WorkPower-Aware Architecture Goal: Min. Power and Maintain Performance How: Do as little as you can, while maintaining performance Challenging and new area

Why is this challenging • Identify actions that can be delayed/eliminated • Don’t touch those that boost performance • Cost/Power of doing so must not out-weight benefits

Definitions • Performance is in units of things-per-second • bigger is better • If we are primarily concerned with response time • performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y)

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E) Speedup(with E) =1/ ((1-F) + F/S)

Amdahl's Law-example A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement? Fraction enhanced= 0.4 Speedup enhanced = 10 Speedup overall = 1 = 1.56 0.6 +0.4/10

Why Do Benchmarks? • How we evaluate differences • Different systems • Changes to a single system • Provide a target • Benchmarks should represent large class of important programs • Improving benchmark performance should help many programs • For better or worse, benchmarks shape a field • Good ones accelerate progress • good target for development • Bad benchmarks hurt progress • help real programs v. sell machines/papers? • Inventions that help real programs don’t help benchmark

SPEC first round • First round 1989; 10 programs, single number to summarize performance • One program: 99% of time in single line of code • New front-end compiler could improve dramatically

SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Summary • Time is the measure of computer performance! • Remember Amdahl’s Law: Improvement is limited by unimproved part of program

Execution Cycle Obtain instruction from program storage Instruction Fetch Determine required actions and instruction size Instruction Decode Locate and obtain operand data Operand Fetch Compute result value or status Execute Deposit results in storage for later use Result Store Determine successor instruction Next Instruction

What Must be Specified? Instruction Fetch ° Instruction Format or Encoding – how is it decoded? ° Location of operands and result – where other than memory? – how many explicit operands? – how are memory operands located? – which can or cannot be in memory? ° Data type and Size ° Operations – what are supported ° Successor instruction – jumps, conditions, branches Instruction Decode Operand Fetch Execute Result Store Next Instruction

What Is an ILP? • Principle: Many instructions in the code do not depend on each other • Result: Possible to execute them in parallel • ILP: Potential overlap among instructions (so they can be evaluated in parallel) • Issues: • Building compilers to analyze the code • Building special/smarter hardware to handle the code • ILP: Increase the amount of parallelism exploited among instructions • Seeks Good Results out of Pipelining

What Is ILP? • CODE A: CODE B: • LD R1, (R2)100 LD R1,(R2)100 • ADD R4, R1 ADD R4,R1 • SUB R5,R1 SUB R5,R4 • CMP R1,R2 SW R5,(R2)100 • ADD R3,R1 LD R1,(R2)100 • Code A: Possible to execute 4 instructions in parallel. • Code B: Can’t execute more than one instruction per cycle. Code A has Higher ILP

In-Order A B C A: LD R1, (R2) B: ADD R3, R4 C: ADD R3, R5 D: CMP R3, R1 D Out of Order Execution Programmer: Instructions execute in-order Processor: Instructions may execute in any order if results remain the same at the end Out-of-Order B: ADD R3, R4 C: ADD R3, R5 A: LD R1, (R2) D: CMP R3, R1

Source instruction • Dependant instruction • Latency (clock cycles) • FP ALU op • Another FP ALU op • 3 • FP ALU op • Store double • 2 • Load double • FP ALU op • 1 • Load double • Store double • 0 Assumptions • Five-stage integer pipeline • Branches have delay of one clock cycle • ID stage: Comparisons done, decisions made and PC loaded • No structural hazards • Functional units are fully pipelined or replicated (as many times as the pipeline depth) • FP Latencies Integer load latency: 1; Integer ALU operation latency: 0

Simple Loop & Assembler Equivalent • for (i=1000; i>0; i--) x[i] = x[i] + s; • Loop: LD F0, 0(R1) ;F0=array element • ADDD F4, F0, F2 ;add scalar in F2 • SD F4 , 0(R1) ;store result • SUBI R1, R1, #8 ;decrement pointer 8bytes (DW) • BNE R1, R2, Loop ;branch R1!=R2 • x[i] & s are double/floating point type • R1 initially address of array element with the highest address • F2 contains the scalar value s • Register R2 is pre-computed so that 8(R2) is the last element to operate on

Unscheduled Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) SUBI R1, R1, #8 stall BNE R1, R2, Loop stall 10 clock cycles Can we minimize? Scheduled Loop: LD F0, 0(R1) SUBI R1, R1, #8 ADDD F4, F0, F2 stall BNE R1, R2, Loop SD F4, 8(R1) 6 clock cycles 3 cycles: actual work; 3 cycles: overhead Can we minimize further? Schedule • Source instruction • Dependant instruction • Latency (clock cycles) • FP ALU op • Another FP ALU op • 3 • FP ALU op • Store double • 2 • Load double • FP ALU op • 1 • Load double • Store double • 0 Where are the stalls?

LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop Eliminate Incr, Branch LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, -8(R1) ADDD F4, F0, F2 SD F4 , -8(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, -16(R1) ADDD F4, F0, F2 SD F4 , -16(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, -24(R1) ADDD F4, F0, F2 SD F4 , -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop Loop Unrolling Four copies of loop Four iteration code Assumption: R1 is initially a multiple of 32 or number of loop iterations is a multiple of 4

Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall Loop Unroll & Schedule Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12, -16(R1) SUBI R1, R1, #32 BNE R1, R2, Loop SD F16, 8(R1) Schedule No stalls! 14 clock cycles or 3.5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further?

Summary Iteration 10 cycles Unrolling 7 cycles Scheduling Scheduling 6 cycles 3.5 cycles (No stalls)

Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors

Fetch Decode Issue Complete Commit A Modern Processor Multiple Issue Front-end Back-end

1990’s: Superscalar Processors • Bottleneck: CPI >= 1 • Limit on scalar performance (single instruction issue) • Hazards • Superpipelining? Diminishing returns (hazards + overhead) • How can we make the CPI = 0.5? • Multiple instructions in every pipeline stage (super-scalar) • 1 2 3 4 5 6 7 • Inst0 IF ID EX MEM WB • Inst1 IF ID EX MEM WB • Inst2 IF ID EX MEM WB • Inst3 IF ID EX MEM WB • Inst4 IF ID EX MEM WB • Inst5 IF ID EX MEM WB

Elements of Advanced Superscalars • High performance instruction fetching • Good dynamic branch and jump prediction • Multiple instructions per cycle, multiple branches per cycle? • Scheduling and hazard elimination • Dynamic scheduling • Not necessarily: Alpha 21064 & Pentium were statically scheduled • Register renaming to eliminate WAR and WAW • Parallel functional units, paths/buses/multiple register ports • High performance memory systems • Speculative execution

SS + DS + Speculation • Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together • CPI >= 1? • Overcome with superscalar • Superscalar increases hazards • Overcome with dynamic scheduling • RAW dependences still a problem? • Overcome with a large window • Branches a problem for filling large window? • Overcome with speculation

The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit

Superscalar Microarchitecture Floating point register file Functional units Memory interface Floating point inst. buffer Inst. Cache Pre-decode Inst. buffer Decode rename dispatch Functional units and data cache Integer address inst buffer Integer register file Reorder and commit

Register renaming methods • First Method: • Physical register file vs. logical (architectural) register file. • Mapping table used to associate physical reg w/ current value of log. Reg • use a free list of physical registers • Physical register file bigger than log register file • Second Method: • physical register file same size as logical • Also, use a buffer w/ one entry per inst. Reorder buffer.

Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall Register Renaming Example Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12, -16(R1) SUBI R1, R1, #32 BNE R1, R2, Loop SD F16, 8(R1) Schedule No stalls! 14 clock cycles or 3.5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further?

r0 r1 r2 r3 r4 R8 r0 r1 r2 r3 r4 R8 R7 R7 R5 R5 R1 R2 R9 R9 R2 R6 R13 R6 R13 Register renaming: first method Mapping table Mapping table Add r3,r3,4 Free List Free List

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

More Realistic HW: Register Impact • Effect of limiting the number of renaming registers FP: 11 - 45 Integer: 5 - 15 IPC

Reorder Buffer • Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed

R8 R7 R5 rob6 R9 register renaming:reorder buffer Before add r3,r3,4 Add r3, rob6, 4 add rob8,rob6,4 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 R8 R7 R5 rob8 R9 8 7 6 0 7 6 0 R3 0 R3 …. r3 …..….. Reorder buffer Reorder buffer

Instruction Buffers Floating point register file Functional units Memory interface Floating point inst. buffer Inst. Cache Pre-decode Inst. buffer Decode rename dispatch Functional units and data cache Integer address inst buffer Integer register file Reorder and commit

Issue Buffer Organization • a) Single, shared queue b)Multiple queue; one per inst. type No out-of-order No Renaming No out-of-order inside queues Queues issue out of order

ELEC 669 Low Power Design Techniques Lecture 1

ELEC 669 Low Power Design Techniques Lecture 1

Presentation Transcript

CMOS VLSI Design Lecture 1 3 : Design for Low Power

Low Power CMOS Design

VHDL Design Tips and Low Power Design Techniques

Low Power Wireless Design

ELEC 5270/6270 Spring 2013 Low-Power Design of Electronic Circuits Energy Source Design

Low power Design Strategies

ELEC 669 Low Power Design Techniques Lecture 2

Low-Power Design Techniques in Digital Systems

ELEC 5270/6270 Spring 2013 Low-Power Design of Electronic Circuits

Various Low-Power SoC Design Techniques

Low Power Processor Design

ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Low Voltage Low-Power Devices

Circuit Design Techniques for Low Power DSPs

ELEC ENG 4BD4 Lecture 1

ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors

ELEC 5270/6270 Spring 2011 Low-Power Design of Electronic Circuits Test Power

ELEC 669 Low Power Design Techniques Lecture 2

LOW POWER DESIGN METHODS

Low Power Design Techniques

Low-Power CMOS Design