Revolver: Processor Architecture for Power Efficient Loop Execution - PowerPoint PPT Presentation

ciara
revolver processor architecture for power efficient loop execution n.
Skip this Video
Loading SlideShow in 5 Seconds..
Revolver: Processor Architecture for Power Efficient Loop Execution PowerPoint Presentation
Download Presentation
Revolver: Processor Architecture for Power Efficient Loop Execution

play fullscreen
1 / 46
Download Presentation
Revolver: Processor Architecture for Power Efficient Loop Execution
114 Views
Download Presentation

Revolver: Processor Architecture for Power Efficient Loop Execution

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti • Padmini Gaur( 13IS15F) • Sanchi (13IS20F)

  2. Contents • The Need • Approaches and Isssues • Revolver: Some basics • Loop Handling • Loop Detection • Detection and Training • Finite State Machine • Loop Execution • Scheduler • Units • Tag Propagation Unit • Loop pre-execution • Conclusion • References

  3. The Need • Per-transistor energy benefit improvement • Increasing computational efficiency • Power efficient mobile, server • Increasing energy contraints • Elimination of unnecessary pipeline activity • Managing energy utilization • Small energy requirements of instruction execution but Large control overheads

  4. So far: Approaches and Issues • Pipeline centric instruction caching • Emphasizing temporal instruction locality • Capturing loop instruction in buffer • Inexpensive retrieval for future iterations • Out-of-order processors: Issues? • Resource allocation • Program ordering • Operand dependency

  5. Instructions serviced by Loop Buffer

  6. Energy Consumption [Power Efficient Loop Execution Techniques: Mitchell Bryan Hayenga]

  7. Revolver: An enhanced approach • Out-of-order back-end • Overall design similar to normal processor • Non-loop instructions • Follow normal conventional pipeline • No Register Allocation Table on front-end instead Tag propagation unit at back-end • Loop mode: • Detection and dispatching loop to back-end

  8. The promises • No additional resource allocation • Energy consumption at front-end managed • Pre-execution of future iterations • Operand dependence linking moved to back-end

  9. Loop handling • Loop detection • Training feedback • Loop execution • Wakeup logic • Tag Propagation Unit • Load Pre-execution

  10. Loop Detection • Detection (at) stages: • Post-execution • At decode stage • Enabling loop mode at decode • Calculation of: • Start address • Required resources

  11. Detection and Training • Key mechanisms: • Detection logic at front-end -> dispatched • Feedback by back-end: Profitability of loops • Profitability • Disabling future loop-mode • Detection control • Loop Detection Finite State Machine

  12. FSM

  13. FSM states • Idle: Through decode until valid/profitable loop or PC-relative backward branch/jump detection • Profitability logged in Loop Address Table • LAT records: • Composition and profitability • Profitable loop dispatched • Backward jump/ branch and No loop • Train State

  14. Train state: • Records start address • End address • Allowable unroll factor • Resources required added to LAT • Loop ends -> Idle state • In dispatch state the decode logic guides the dispatch of loop instructions into the out of order backend.

  15. Disabling loop mode on: • System calls • Memory barriers • Load-store linked conditional pair

  16. Training Feedback • Profitability • 4-bit counter • Default value =8 • Loop mode enabling if value>=8 • Dispatched loop unrolled more than twice, +2 • Else, -2 • Mis-prediction other than fall-through, profitability set = 0 • Disabled loops: • Front-end increments by 1 for 2 sequential successful dispatch

  17. Loop: Basic idea • Unrolling loop: • Depending on back-end resources • As much as possible • Eliminating additional resource use after dispatch • Loop instruction stays in issue queue, executes till completion of iteration • Maintaining provided resources across multiple executions • Load-store queues modified maintaining program order

  18. Contd.. • Proper access of destination and source register • Loop exit: • Removing instructions from back-end • Loop fall-trough path dispatch

  19. Loop execution: Let’s follow • Green: Odd numbered • Blue: Even numbered • Pointers: • Program order maintenance: loop_start, loop_end • Oldest uncommitted entry: commit

  20. Loop execution, contd.. • Commiting: • Start to end • Wrapping to start: next loop iteration • Resetting issue queue entries for next loop iteration • Load queue entries invalidated • Store queue entries: • Passed to write-buffer • Immediate reuse in next iteration • Cannot write to buffer -> stall (very rare)

  21. Scheduler: Units • Wake-up array • Identifying Ready instructions • Select logic • Arbitration between reading instructions • Silo instruction • Producing the opcode and physical identifiers of selected instruction

  22. Scheduler: The design

  23. Scheduler: The concept • Managed as queue • Maintains program order among entries • Wakeup array • Utilizes logical register identifiers • Position dependence • Tag Propagation Unit (TPU) • Physical register mapping

  24. Wakeup Logic: Overview • Observes generated results: • Identifying new instructions capable of being executed • Program based ordering • Broadcast of logical register identifier • No need for renaming • No physical register identifier in use

  25. Wakeup: The design

  26. Wake up array • Rows: Instructions • Columns: Logical registers • Signals: • Request • Granted • Ready

  27. Wakeup operation • Allocation into wake up array • Marking logical source and destination registers • Unscheduled instruction • Deassert downstream register column • Preventing younger, dependent instructions from waking up • Request sent when: • Receiving all necessary source register broadcasts • Ready source registers

  28. Select grants the request: • Asserting downstream ready • Waking up younger dependent instructions • Wakeup logic cell: • 2 state bits: sourcing/producing logical register

  29. The simple logic

  30. An example with dependence

  31. Tag Propagation Unit (TPU) • No renaming! • Maps physical register identifier to logical registers • Enables reuse of physical register • As no additional resources • Physical register management • Possible speculative execution of next loop iterations

  32. Next loop iteration?? • Impossible if: • Instruction only has access to single physical destination register • Speculative execution: • Alternative physical register identifier needed • Solution: 2 physical destination registers • Alternative writing between 2

  33. With 2 destination registers • Double Buffering • Maintaining previous state while speculative computation • N+1 commits, reusing destination register of iteration N on iteration N+2 • No instruction dependence in N and N+2 • Speculative writing in output register allowed

  34. With Double buffering • Dynamic linkage between dependent instructions and source registers • Changing logical register mapping • Overwriting output register column • Instruction stored in program order: • Downstream instructions obtain proper source mapping

  35. Source, destination and iteration

  36. Register reclamation • Any instruction misprediction: • Flushing downstream instructions • Propagation of mappings to all newly scheduled instructions • Better than RAT: • Complexities reduced

  37. Queue entries: Lifetime • Received prior to dispatch • Retained till instruction exit from backend • Reused to execute multiple loop iterations • Immediate freeing of LSQ upon commit • Position based age logic in LSQ • Load queue entries: • Simply reset for future

  38. Store Queue entries: An extra effort • Need to write back • Drained into write buffer immediately between L1 Cache and queue • If cannot write stall • Very rare • Wrapping around of commit pointer

  39. Loop pre-execution • Pre-execution of future loads: • Parallelization • Enabling zero-latency loads • No L1 cache access latency • Repeated execution of load till completion of all iterations • Exploiting recurrent nature of loop: • Highly predictable address patterns

  40. Learning from example: String copy • Copying source array to destination array • Predictable load address • Accessing consecutive bytes from memory • Primary addressing access patterns: • Stride • Constant • Pointer-based • Placing simple pattern identification hardware alongside pre-executed load buffers

  41. Stride based addressing • Most common • Iterating over data array • Computing address Δ between 2 consecutive loads • Third load matches predicted stride: Stride verification • Pre-execution of next load • Constant: A special case of zero-sized stride • Reading from same address • Stack allocated variables/ Pointer aliasing

  42. Pointer based addressing • Value returned by current load -> next address • E.g. Linked list traversals

  43. Pre-execution: more.. • Pre-executed load buffer placed between load queue and L1 Cache interface • Store clashes with pre-executed load • Invalidating entry • Coherency maintenance • Pre-executed loads: • Speculatively waking up dependent operations on next cycle • Incorrect address prediction: • Scheduler cancels and re-issues operation

  44. Conclusion • Minimizing energy during loop execution • Elimination of front-end overheads originating from pipeline activity and resource allocation • Benefits achieved better than in loop buffers and μop caches • Pre-execution increases performance during loop execution by hiding L1 cache latencies • According to research, 5.3-18.3% energy-delay benefit

  45. References • Scheduling Reusable Instructions for Power Reduction (J. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, M. Irwin),2004 • Matrix Scheduler Reloaded (P. G. Sassone, J. Rupley, E. Breckelbaum, G. H. Loh, B. Black) • Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops (L. H. Lee, B. Moyer, J. Arends) • Power Efficient Loop Execution Techniques (Mitchell Bryan Hayenga)