1 / 20

CS718 : VLIW - Software Driven ILP

CS718 : VLIW - Software Driven ILP. Example Architectures 6th Apr, 2006. Execution model - some issues. Register access within an instruction interaction between reads and writes within an instruction to the same register Operation completion under exception

isi
Download Presentation

CS718 : VLIW - Software Driven ILP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006 Anshul Kumar, CSE IITD

  2. Execution model - some issues • Register access within an instruction • interaction between reads and writes within an instruction to the same register • Operation completion under exception • which operations are completed when an exception occurs • Exposing pipeline latencies • what latency information the compiler has Anshul Kumar, CSE IITD

  3. Register access in an instruction • Read sees the original value of the register • allows swap of two registers in a single instruction • Read sees the value written by the write • a pair of operations that read and write a pair of registers can not be resolved • Different operations that read and write the same register in an instruction are not allowed • parallel operations are not forced to execute in parallel Anshul Kumar, CSE IITD

  4. None complete All that can complete or all before the excepting operation complete Free-for-all Simplest Complex (determine what remains to be fixed up) No guarantees Operation completion under exception Anshul Kumar, CSE IITD

  5. Exposing pipeline latencies • EQ model • the destination is written in a cycle which is known at compile time • LEQ model • more permissive, allows some binary compatibility Anshul Kumar, CSE IITD

  6. VLIW Examples • IA-64 and Itanium: HP and Intel • Trimedia: Philips • Transmeta Crusoe • DSPs: Texas Instruments, Analog Devices Anshul Kumar, CSE IITD

  7. IA-64 Register Model • 128 general purpose registers 64 bit • 128 floating point registers 82 bit • 64 predicate registers 1 bit • 8 branch registers (indirect branch) 64 bit • Registers for system control, memory mapping, performance counters, communication with OS Anshul Kumar, CSE IITD

  8. Register Stack • GPRs 0-31 always available • GPRs 32-127 used as a stack • GPRs and FPRs support register rotation for SW pipelining OUT LOCAL (frame i) OUT LOCAL (frame i -1) Anshul Kumar, CSE IITD

  9. IA-64 Execution Units ExecutionInstructionDescription UnitType I-unit A Arithmetic (integer) I non-ALU int (shifts, tests, move) M-unit A Arithmetic (integer) M Memory (load/store) F-unit F Floating point B-unit B Branches, calls, loops L+X L+X Extended immediates (executed by either B or I units) Anshul Kumar, CSE IITD

  10. Flexibility + explicit parallelism • Compiler forms groups of instructions which can be executed in parallel if execution resources are available • Instructions in a group may be scheduled in one or more cycles, depending upon resource availability Anshul Kumar, CSE IITD

  11. Instruction Formats • Instructions are encoded in 128 bit bundles • Each bundle = 5 bit template + 3  41 bit instruction • 5 bit template field specifies execution unit types required for the 3 instructions and position of stops, if any • stops indicate the boundaries of instruction groups Anshul Kumar, CSE IITD

  12. Template examples TemplateSlot 0Slot 1Slot 2 0 M I I 1 M I I 2 M I I 3 M I I 4 M L X 5 M L X 8 M M I 9 M M I Anshul Kumar, CSE IITD

  13. Example Schedule 1 TemplateSlot 0Slot 1Slot 2Cycle 9: MMI LD F0,0(R1) LD F6,-8(R1) 1 14: MMF LD F10,-16(R1) LD F14,-24(R1) ADD F4,F0,F2 3 15: MMF LD F18,-32(R1) LD F22,-40(R1) ADD F8,F6,F2 4 15: MMF LD F26,-48(R1) SD F4,0(R1) ADD F12,F10,F2 6 15: MMF SD F8,-8(R1) SD F12,-16(R1) ADD F16,F14,F2 9 15: MMF SD F16,-24(R1) ADD F20,F18,F2 12 15: MMF SD F20,-32(R1) ADD F24,F22,F2 15 15: MMF SD F24,-40(R1) ADD F28,F26,F2 18 28: MFB SD F28,-48(R1) ADD R1,R1,-56 BNE R1,R2,Loop 21 Anshul Kumar, CSE IITD

  14. Example Schedule 2 TemplateSlot 0Slot 1Slot 2Cycle 8: MMI LD F0,0(R1)LDF6,-8(R1) 1 9: MMI LDF10,0(R1)LDF6,-8(R1) 2 14: MMF LDF18,-16(R1) LDF14,-24(R1)ADD F4,F0,F2 3 14: MMF LDF26,-16(R1) ADD F8,F10,F2 4 15: MMFADD F12,F14,F2 5 14: MMFSDF4,0(R1)ADD F16,F18,F2 6 14: MMFSDF8,-8(R1)ADD F20,F14,F2 7 15: MMFSDF12,-16(R1)ADD F24,F22,F2 8 14: MMFSDF16,-24(R1)ADD F28,F26,F2 9 9: MMISDF20,-32(R1)SDF24,-40(R1) 11 28: MFBSDF28,-48(R1)ADDR1,R1,-56BNE R1,R2,Loop 12 Anshul Kumar, CSE IITD

  15. Predication Support • Almost all instructions predicated • 6 bit field specifies predicate register • Predicate registers are set by test instructions Anshul Kumar, CSE IITD

  16. Speculation Support • Control speculation using poison bit approach • One additional bit in GPRs - NaT (not a thing) • NaTVal in FPRs • Registers with NaT or NaTVal can’t be stored • special instructions to save and restore registers with poison bits/values • Load/store speculation using advanced load instruction and ALAT table with associative look up Anshul Kumar, CSE IITD

  17. Itanium Processor • Introduced in 2001 with 800MHz clock • 3 level cache: first split, first 2 on-chip • 2 I units, 2 M units, 3 B units, 2 F units • 10 stage pipeline • pre-fetch buffer with 8 bundles : 2 bundles pre-fetched per cycle • up to 2 bundles issued at a time: up to 6 instructions distributed to 9 execution units, with register renaming (rotation and stacking) • Good FP performance but not integer Anshul Kumar, CSE IITD

  18. Trimedia TM32 • Designed for embedded applications • Classic VLIW architecture, completely static scheduling • 5 operation slots per instruction • each specifies an operation or immediate field • no hazard detection hardware • compressed code stored in memory and cache, decompressed during fetch • each operation can be individually predicated • in an instruction with multiple branches, at most one predicate can be true • no virtual memory Anshul Kumar, CSE IITD

  19. Trimedia Function Units • 23 function units of 11 different types • min latency 0 (integer ALU) • max latency 16 (FP divide and square root) • a function unit can be specified by only certain instruction slots • ALU (all), DMem (4, 5), Branch (2, 3, 4), DSPALU (1, 3), FALU (1, 4), FTough (2) Anshul Kumar, CSE IITD

  20. Transmeta Crusoe • Designed for low power applications like mobile PC, mobile internet appliances • compatibility with x86 through translating software • 500 MHz to 1 GHz, 5 to 7 W power consumption • 64 bit (2 operations) and 128 bit (4 operations) versions, 64 integer registers [new 256 bit Efficeon] • Operation slot types: ALU, compute (int/fp/mm), Memory, Branch, Immediate • Support for speculative re-ordering: shadow register file, program-controlled store buffer, memory alias detection, conditional move Anshul Kumar, CSE IITD

More Related