vliw kaxiras@cs wisc edu kaxiras@ee upatras gr
Download
Skip this Video
Download Presentation
Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] }

Loading in 2 Seconds...

play fullscreen
1 / 43

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] } - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] }. VLIW Αρχές. ILP (Instruction-Level Parallelism) Superscalar, OoO: hardware finds it VLIW: let the Software, COMPILER, find it! No need for DYNAMIC EXECUTION Register renaming out Reservation Stations out

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] }' - najwa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2
VLIW Αρχές
  • ILP (Instruction-Level Parallelism)
  • Superscalar, OoO: hardware finds it
  • VLIW: let the Software, COMPILER, find it!
    • No need for DYNAMIC EXECUTION
      • Register renaming out
      • Reservation Stations out
      • Reorder Buffer out
      • Out-of-order issue out
vliw execution semantics2
VLIW execution semantics
  • UAL: Unit-assumed Latencies
    • All latencies eq.
    • New instr. issues after previous completes
      • Always finds results ready
  • NUAL: Non-Uniform Assumed Latencies
    • Latencies of operations non-unit
    • New instr. issues immediately, but ops may still be in progress
    • Instructions must be scheduled when their results are ready (no interlocks)!
vliw execution semantics3
VLIW execution semantics
  • NUAL: Non-Uniform Assumed Latencies
  • Two models:
    • Equals (EQ) Model: Each operation takes exactly its specified latency. Register values don’t change until operation completes. Example: TI C6x
    • Less-Than-or-Equals (LEQ): Operations may take up to their specified latency
vliw execution semantics4
VLIW execution semantics
  • Equals (EQ) Model
    • Reduces register pressure because source operands stay around longer.
    • Can’t reduce operation latencies and maintain source code compatibility.
  • Less-Than-or-Equals (LEQ):
    • Destination register contents become unreliable immediately
    • Can reduce operation latencies and maintain source code compatibility
slide12
Προβλήματα VLIW
  • Compiler δεμένος με implementation
  • Scheduler must know operation latencies
  • Cannot run binaries in another implementation
  • Dynamically scheduled VLIW
    • Αποσύνδεση operation latencies από τον compiler
dynamically scheduled vliw
Dynamically Scheduled VLIW
  • Compatibility problem: compiler must know latencies
  • Schedule with assumed latencies
  • Delay buffer inserted between FUs and register file, holds register updates and presents to the code the “assumed” latencies not the real latencies (similar to LEQ)
  • Scoreboard dynamically schedules VLIW instructions according to dependencies
  • VERY SIMILAR to OoO but simpler
role of compiler in vliw
Role of COMPILER in VLIW
  • Find parallelism -- schedule independent instructions
    • Find independent operations to create VLIW
    • Many available registers to reduce false data dependencies
  • INCREASE ILP (create parallelism)
    • Loop unrolling
    • Software Pipelining
    • Trace scheduling
    • Predication
loop unrolling
Loop Unrolling
  • Basic Idea: Unroll loops to get loop with fewer but longer iterations
  • Pros:
    • Creates parallelism -- instructions from different original iterations can be issued in parallel
    • Latency Tolerance -- can issue instructions from one iteration while waiting for instructions from another to complete
    • Reduces overhead -- fewer iterations means fewer compares and branches
loop unrolling1
Loop Unrolling
  • Cons:
    • Register pressure -- combining multiple iterations means more
    • live values, potential for register overflow.
  • REQUIRES MANY ARCHITECTURAL REGISTERS
    • INTEL’s EPIC (ITANIUM) Arch has 128 registers!!!
software pipelining
Software pipelining
  • Idea: Transform loop which performs one iteration at a time into loop which performs pipelined steps of different iterations.
    • Scheduling: Increase time between dependent instructions
  • Combines well with loop unrolling
software pipelining1
Software Pipelining
  • Modulo Scheduling
comparison to superscalar
Comparison to Superscalar
  • Loop Unrolling + Software pipelining = Register Renaming + Multiple branch prediction (loop branch) + Dynamic Scheduling
compiler reduce control dependencies
COMPILER: Reduce CONTROL dependencies
  • 1 in 5 instructions is a branch
  • 5-op VLIW ? Each VLI contains a branch!
      • Unacceptable ...
  • INCREASE STRAIGHT LINE CODE
    • code without branches
  • 2 Techniques in addition to loop unrolling:
    • TRACE SCHEDULING
    • PREDICATION
trace scheduling
TRACE SCHEDULING
  • Parallelism across IF branches vs. LOOP branches
  • Compiler Support - Two steps:
  • Trace Selection
    • Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code
  • Trace Compaction
    • Squeeze trace into few VLIW instructions
    • Need bookkeeping code in case prediction is wrong
trace scheduling1
Trace Scheduling
  • Similar to branch prediction in SuperScalar OoO
  • When things go wrong: execute fix-up code (undo wrong path). Compiler inserts all necessary code.
predication
PREDICATION
  • Avoid branch prediction by turning branches into conditionally executed instructions:
  • if (x) then A = B op C else NOP
    • If false, then neither store result nor cause exception
    • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction.
  • Drawbacks to conditional instructions
    • Complex conditions reduce effectiveness;
    • Cannot predicate very large blocks
predication1
Predication

Branch Prediction

Predication

intel hp epic
Intel/HP EPIC
  • Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”
  • IA-64: instruction set architecture; EPIC is type
  • EPIC = 2nd generation VLIW?
  • Itanium™ is name of first implementation (2001)
intel epic vliw instructions
Intel EPIC VLIW Instructions
  • IA-64 instructions are encoded in bundles, which are 128 bits wide.
    • Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length
  • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent
    • Smaller code size than old VLIW, larger than x86/RISC
    • Groups can be linked to show independence > 3 instr
intel ia 64 vliw instruction groups
Intel IA-64 VLIW Instruction groups
  • Instruction group: a sequence of consecutive instructions with no register data dependences
    • All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependencies through memory were preserved
    • An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups
itanium or itanic as in titanic
Itanium (or Itanic as in Titanic)
  • Highly parallel and deeply pipelined hardware at 800Mhz (2000)
  • 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
  • Hardware checks dependencies (interlocks => binary compatibility over time)
  • DYNAMICALLY SCHEDULED VLIW
  • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?
itanium1
Itanium
  • IA-64 Registers
  • The integer registers are configured to help accelerate procedure calls using a register stack
  • 8 64-bit Branch registers used to hold branch destination addresses for indirect branches
  • 64 1-bit predication registers
itanium2
Itanium
  • Both the integer and floating point registers support register rotation for registers 32-128.
  • Register rotation is designed to ease the task of allocating of registers in software pipelined loops
  • When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop
  • Makes the SW-pipelining usable for loops with smaller numbers of iterations
ad