CDA 5155

Superscalar, VLIW, Vector, Decoupled Week 4 CDA 5155

Processors Design Families • Superscalar • Not an Architectural Specification! • Vector Processors • Simplest hardware – great for the right problems • Statically Scheduled Multiple Issue • Better known as Very Long Instruction Word (VLIW) • Compiler dominated Scheduling • Better known as EPIC (almost VLIW) • Decoupled Architectures • Tightly interconnected Scalar Processors • Relatively unknown area, influencing current designs • (also my dissertation research)

Vector Processors “I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI(ASC) processor. Those three were all pioneering processors… One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made” - Seymour Cray (Cray-1 1976)

Vector Processor Design • Early “super computers” • Add Special instructions (addV) that operate on sequences (or vectors) of data • A single instruction defines a long sequence of operations to be performed. • Sequences do not have hazards – no stalling, forwarding, etc. • Eliminates the need for overhead instructions for loop iteration • Very simple pipeline organization • More constrained memory access makes scheduling LV/SV instructions match memory banking designs • This enables very efficient use of memory bus (like caches do to a smaller extent)

Organization of a Vector Machine

Handling Vectors in Memory LV V1  Mem[R1] Loads an entire vector of data starting at location M[R1] • This looks a lot like a cache line fill operation • Can design the number of memory banks to reflect the vector size. • What about non-contiguous accesses? • Column access on a 2D array; elements out of a structure • LV V1  Mem[R1],R2 Loads vector starting at R1, with a stride of R2 bytes • What about more complex accesses? • Indexed (scatter/gather) access • LV V1  Mem[R1], V2 V1[1]  Mem[R1+V2[1]]; V1[2]  Mem[R1+V2[2]]; etc.

Pipelining Vectors

Chaining Vectors Enable forwarding of vectors (DAXPY: Z = aX + Y) LV V1, R1 ; load X LV V2, R2 ; load Y MULSV V3, F0, V1 ; calculate aX ADDV V4, V3, V2 ; calculate (aX) + Y SV V4, R3 ; store at Z How can we overlap instructions?

Other Vector Issues • Compiler analysis to find vectorizable code • Determining vector length • Amdahl’s law • Complexity • Code base • Image Processing, scientific code (genomes?), graphics (MMX)

VLIW Processors • What happens to hardware complexity if we make the microarchitecture (pipeline organization) visible to the programmer/compiler? • Scheduling is a software problem • Hazard detection is a software problem • Memory Scheduling is (mostly) a software problem • Speculation (branch prediction) is (mostly) a software problem • Hardware is simpler! • Compiler/Programmer’s job is much harder

Non-unit latency • No hazard detection • If we write code that reads R3, it means whatever is in R3 at that cycle. • Note: that Superscalar will get the most recent definition (that is what the hazard detector check for) • R1  5 • R1  10 • R2  R1 (5 or 10?) • It depends on the structure of the pipeline (which is known by the software) • Pipeline registers are visible to the compiler (but may not be accessed)

Decoupled Processors Multiple Processors Asynchronous Queues P1: LD X[i] P3 P2: LD Y[i]  P4 P3: Mul a,Mem  P4 P4 Add P3, Mem  Mem

CDA 5155

CDA 5155

Presentation Transcript

CDA and CDA Equivalencies

CDA 4

CDA 3100

CDA 5155

CDA 5155

CDA 3100

CDA 3100

CDA 5155

CDA 3100

CDA, EFDA

CDA 5155 and 4150

CDA 5155

CDA 3100

CDA 5155

CDA 3100

CDA and CDA Equivalencies