120 likes | 376 Views
Superscalar, VLIW, Vector, Decoupled Week 4. CDA 5155. Processors Design Families. Superscalar Not an Architectural Specification! Vector Processors Simplest hardware – great for the right problems Statically Scheduled Multiple Issue Better known as Very Long Instruction Word (VLIW)
E N D
Superscalar, VLIW, Vector, Decoupled Week 4 CDA 5155
Processors Design Families • Superscalar • Not an Architectural Specification! • Vector Processors • Simplest hardware – great for the right problems • Statically Scheduled Multiple Issue • Better known as Very Long Instruction Word (VLIW) • Compiler dominated Scheduling • Better known as EPIC (almost VLIW) • Decoupled Architectures • Tightly interconnected Scalar Processors • Relatively unknown area, influencing current designs • (also my dissertation research)
Vector Processors “I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI(ASC) processor. Those three were all pioneering processors… One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made” - Seymour Cray (Cray-1 1976)
Vector Processor Design • Early “super computers” • Add Special instructions (addV) that operate on sequences (or vectors) of data • A single instruction defines a long sequence of operations to be performed. • Sequences do not have hazards – no stalling, forwarding, etc. • Eliminates the need for overhead instructions for loop iteration • Very simple pipeline organization • More constrained memory access makes scheduling LV/SV instructions match memory banking designs • This enables very efficient use of memory bus (like caches do to a smaller extent)
Handling Vectors in Memory LV V1 Mem[R1] Loads an entire vector of data starting at location M[R1] • This looks a lot like a cache line fill operation • Can design the number of memory banks to reflect the vector size. • What about non-contiguous accesses? • Column access on a 2D array; elements out of a structure • LV V1 Mem[R1],R2 Loads vector starting at R1, with a stride of R2 bytes • What about more complex accesses? • Indexed (scatter/gather) access • LV V1 Mem[R1], V2 V1[1] Mem[R1+V2[1]]; V1[2] Mem[R1+V2[2]]; etc.
Chaining Vectors Enable forwarding of vectors (DAXPY: Z = aX + Y) LV V1, R1 ; load X LV V2, R2 ; load Y MULSV V3, F0, V1 ; calculate aX ADDV V4, V3, V2 ; calculate (aX) + Y SV V4, R3 ; store at Z How can we overlap instructions?
Other Vector Issues • Compiler analysis to find vectorizable code • Determining vector length • Amdahl’s law • Complexity • Code base • Image Processing, scientific code (genomes?), graphics (MMX)
VLIW Processors • What happens to hardware complexity if we make the microarchitecture (pipeline organization) visible to the programmer/compiler? • Scheduling is a software problem • Hazard detection is a software problem • Memory Scheduling is (mostly) a software problem • Speculation (branch prediction) is (mostly) a software problem • Hardware is simpler! • Compiler/Programmer’s job is much harder
Non-unit latency • No hazard detection • If we write code that reads R3, it means whatever is in R3 at that cycle. • Note: that Superscalar will get the most recent definition (that is what the hazard detector check for) • R1 5 • R1 10 • R2 R1 (5 or 10?) • It depends on the structure of the pipeline (which is known by the software) • Pipeline registers are visible to the compiler (but may not be accessed)
Decoupled Processors Multiple Processors Asynchronous Queues P1: LD X[i] P3 P2: LD Y[i] P4 P3: Mul a,Mem P4 P4 Add P3, Mem Mem