superscalar vliw vector decoupled week 4 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CDA 5155 PowerPoint Presentation
Download Presentation
CDA 5155

Loading in 2 Seconds...

play fullscreen
1 / 12

CDA 5155 - PowerPoint PPT Presentation


  • 200 Views
  • Uploaded on

Superscalar, VLIW, Vector, Decoupled Week 4. CDA 5155. Processors Design Families. Superscalar Not an Architectural Specification! Vector Processors Simplest hardware – great for the right problems Statically Scheduled Multiple Issue Better known as Very Long Instruction Word (VLIW)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CDA 5155' - peigi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
processors design families
Processors Design Families
  • Superscalar
    • Not an Architectural Specification!
  • Vector Processors
    • Simplest hardware – great for the right problems
  • Statically Scheduled Multiple Issue
    • Better known as Very Long Instruction Word (VLIW)
  • Compiler dominated Scheduling
    • Better known as EPIC (almost VLIW)
  • Decoupled Architectures
    • Tightly interconnected Scalar Processors
    • Relatively unknown area, influencing current designs
      • (also my dissertation research)
vector processors
Vector Processors

“I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI(ASC) processor. Those three were all pioneering processors… One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made”

- Seymour Cray (Cray-1 1976)

vector processor design
Vector Processor Design
  • Early “super computers”
  • Add Special instructions (addV) that operate on sequences (or vectors) of data
    • A single instruction defines a long sequence of operations to be performed.
      • Sequences do not have hazards – no stalling, forwarding, etc.
      • Eliminates the need for overhead instructions for loop iteration
      • Very simple pipeline organization
      • More constrained memory access makes scheduling LV/SV instructions match memory banking designs
        • This enables very efficient use of memory bus (like caches do to a smaller extent)
handling vectors in memory
Handling Vectors in Memory

LV V1  Mem[R1]

Loads an entire vector of data starting at location M[R1]

      • This looks a lot like a cache line fill operation
        • Can design the number of memory banks to reflect the vector size.
  • What about non-contiguous accesses?
    • Column access on a 2D array; elements out of a structure
      • LV V1  Mem[R1],R2

Loads vector starting at R1, with a stride of R2 bytes

  • What about more complex accesses?
    • Indexed (scatter/gather) access
      • LV V1  Mem[R1], V2

V1[1]  Mem[R1+V2[1]]; V1[2]  Mem[R1+V2[2]]; etc.

chaining vectors
Chaining Vectors

Enable forwarding of vectors (DAXPY: Z = aX + Y)

LV V1, R1 ; load X

LV V2, R2 ; load Y

MULSV V3, F0, V1 ; calculate aX

ADDV V4, V3, V2 ; calculate (aX) + Y

SV V4, R3 ; store at Z

How can we overlap instructions?

other vector issues
Other Vector Issues
  • Compiler analysis to find vectorizable code
  • Determining vector length
  • Amdahl’s law
  • Complexity
  • Code base
    • Image Processing, scientific code (genomes?), graphics (MMX)
vliw processors
VLIW Processors
  • What happens to hardware complexity if we make the microarchitecture (pipeline organization) visible to the programmer/compiler?
    • Scheduling is a software problem
    • Hazard detection is a software problem
    • Memory Scheduling is (mostly) a software problem
    • Speculation (branch prediction) is (mostly) a software problem
  • Hardware is simpler!
  • Compiler/Programmer’s job is much harder
non unit latency
Non-unit latency
  • No hazard detection
    • If we write code that reads R3, it means whatever is in R3 at that cycle.
      • Note: that Superscalar will get the most recent definition (that is what the hazard detector check for)
      • R1  5
      • R1  10
      • R2  R1 (5 or 10?)
        • It depends on the structure of the pipeline (which is known by the software)
        • Pipeline registers are visible to the compiler (but may not be accessed)
decoupled processors
Decoupled Processors

Multiple Processors

Asynchronous Queues

P1: LD X[i] P3

P2: LD Y[i]  P4

P3: Mul a,Mem  P4

P4 Add P3, Mem  Mem