Vliw kaxiras@cs wisc edu kaxiras@ee upatras gr
Download
1 / 43

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] } - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] }. VLIW Αρχές. ILP (Instruction-Level Parallelism) Superscalar, OoO: hardware finds it VLIW: let the Software, COMPILER, find it! No need for DYNAMIC EXECUTION Register renaming out Reservation Stations out

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Αρχιτεκτονικές VLIW Στέφανος Καξίρας { [email protected], [email protected] }' - najwa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Vliw kaxiras@cs wisc edu kaxiras@ee upatras gr

Αρχιτεκτονικές VLIWΣτέφανος Καξίρας{ [email protected], [email protected] }


VLIW Αρχές

  • ILP (Instruction-Level Parallelism)

  • Superscalar, OoO: hardware finds it

  • VLIW: let the Software, COMPILER, find it!

    • No need for DYNAMIC EXECUTION

      • Register renaming out

      • Reservation Stations out

      • Reorder Buffer out

      • Out-of-order issue out








Vliw execution semantics2
VLIW execution semantics

  • UAL: Unit-assumed Latencies

    • All latencies eq.

    • New instr. issues after previous completes

      • Always finds results ready

  • NUAL: Non-Uniform Assumed Latencies

    • Latencies of operations non-unit

    • New instr. issues immediately, but ops may still be in progress

    • Instructions must be scheduled when their results are ready (no interlocks)!


Vliw execution semantics3
VLIW execution semantics

  • NUAL: Non-Uniform Assumed Latencies

  • Two models:

    • Equals (EQ) Model: Each operation takes exactly its specified latency. Register values don’t change until operation completes. Example: TI C6x

    • Less-Than-or-Equals (LEQ): Operations may take up to their specified latency


Vliw execution semantics4
VLIW execution semantics

  • Equals (EQ) Model

    • Reduces register pressure because source operands stay around longer.

    • Can’t reduce operation latencies and maintain source code compatibility.

  • Less-Than-or-Equals (LEQ):

    • Destination register contents become unreliable immediately

    • Can reduce operation latencies and maintain source code compatibility


Προβλήματα VLIW

  • Compiler δεμένος με implementation

  • Scheduler must know operation latencies

  • Cannot run binaries in another implementation

  • Dynamically scheduled VLIW

    • Αποσύνδεση operation latencies από τον compiler


Dynamically scheduled vliw
Dynamically Scheduled VLIW

  • Compatibility problem: compiler must know latencies

  • Schedule with assumed latencies

  • Delay buffer inserted between FUs and register file, holds register updates and presents to the code the “assumed” latencies not the real latencies (similar to LEQ)

  • Scoreboard dynamically schedules VLIW instructions according to dependencies

  • VERY SIMILAR to OoO but simpler


Role of compiler in vliw
Role of COMPILER in VLIW

  • Find parallelism -- schedule independent instructions

    • Find independent operations to create VLIW

    • Many available registers to reduce false data dependencies

  • INCREASE ILP (create parallelism)

    • Loop unrolling

    • Software Pipelining

    • Trace scheduling

    • Predication


Loop unrolling
Loop Unrolling

  • Basic Idea: Unroll loops to get loop with fewer but longer iterations

  • Pros:

    • Creates parallelism -- instructions from different original iterations can be issued in parallel

    • Latency Tolerance -- can issue instructions from one iteration while waiting for instructions from another to complete

    • Reduces overhead -- fewer iterations means fewer compares and branches


Loop unrolling1
Loop Unrolling

  • Cons:

    • Register pressure -- combining multiple iterations means more

    • live values, potential for register overflow.

  • REQUIRES MANY ARCHITECTURAL REGISTERS

    • INTEL’s EPIC (ITANIUM) Arch has 128 registers!!!







Software pipelining
Software pipelining

  • Idea: Transform loop which performs one iteration at a time into loop which performs pipelined steps of different iterations.

    • Scheduling: Increase time between dependent instructions

  • Combines well with loop unrolling


Software pipelining1
Software Pipelining

  • Modulo Scheduling



Comparison to superscalar
Comparison to Superscalar

  • Loop Unrolling + Software pipelining = Register Renaming + Multiple branch prediction (loop branch) + Dynamic Scheduling


Compiler reduce control dependencies
COMPILER: Reduce CONTROL dependencies

  • 1 in 5 instructions is a branch

  • 5-op VLIW ? Each VLI contains a branch!

    • Unacceptable ...

  • INCREASE STRAIGHT LINE CODE

    • code without branches

  • 2 Techniques in addition to loop unrolling:

    • TRACE SCHEDULING

    • PREDICATION


  • Trace scheduling
    TRACE SCHEDULING

    • Parallelism across IF branches vs. LOOP branches

    • Compiler Support - Two steps:

    • Trace Selection

      • Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code

    • Trace Compaction

      • Squeeze trace into few VLIW instructions

      • Need bookkeeping code in case prediction is wrong


    Trace scheduling1
    Trace Scheduling

    • Similar to branch prediction in SuperScalar OoO

    • When things go wrong: execute fix-up code (undo wrong path). Compiler inserts all necessary code.


    Predication
    PREDICATION

    • Avoid branch prediction by turning branches into conditionally executed instructions:

    • if (x) then A = B op C else NOP

      • If false, then neither store result nor cause exception

      • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction.

    • Drawbacks to conditional instructions

      • Complex conditions reduce effectiveness;

      • Cannot predicate very large blocks


    Predication1
    Predication

    Branch Prediction

    Predication


    Intel hp epic
    Intel/HP EPIC

    • Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”

    • IA-64: instruction set architecture; EPIC is type

    • EPIC = 2nd generation VLIW?

    • Itanium™ is name of first implementation (2001)


    Intel epic vliw instructions
    Intel EPIC VLIW Instructions

    • IA-64 instructions are encoded in bundles, which are 128 bits wide.

      • Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length

    • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent

      • Smaller code size than old VLIW, larger than x86/RISC

      • Groups can be linked to show independence > 3 instr





    Intel ia 64 vliw instruction groups
    Intel IA-64 VLIW Instruction groups

    • Instruction group: a sequence of consecutive instructions with no register data dependences

      • All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependencies through memory were preserved

      • An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups



    Itanium or itanic as in titanic
    Itanium (or Itanic as in Titanic)

    • Highly parallel and deeply pipelined hardware at 800Mhz (2000)

    • 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

    • Hardware checks dependencies (interlocks => binary compatibility over time)

    • DYNAMICALLY SCHEDULED VLIW

    • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?


    Itanium1
    Itanium

    • IA-64 Registers

    • The integer registers are configured to help accelerate procedure calls using a register stack

    • 8 64-bit Branch registers used to hold branch destination addresses for indirect branches

    • 64 1-bit predication registers



    Itanium2
    Itanium

    • Both the integer and floating point registers support register rotation for registers 32-128.

    • Register rotation is designed to ease the task of allocating of registers in software pipelined loops

    • When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop

    • Makes the SW-pipelining usable for loops with smaller numbers of iterations




    ad