Wavescalar and the wavecache
Sponsored Links
This presentation is the property of its rightful owner.
1 / 21

WaveScalar and the WaveCache PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on
  • Presentation posted in: General

WaveScalar and the WaveCache. Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington. Worries to Keep You up at Night. In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP.

Download Presentation

WaveScalar and the WaveCache

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


WaveScalar and the WaveCache

Steven Swanson

Ken Michelson

Mark Oskin

Tom Anderson

Susan Eggers

University of Washington

CSE P548


Worries to Keep You up at Night

  • In 2016

    • 200,000 RISC-1 processors will fit on a die.

    • It will take 36 cycles to cross the die.

    • Still a lack of ILP.

    • Memory latency is still a problem.

    • For reasonable yields, only 1 transistor in 24 billion may be broken (if one flaw breaks a chip).

CSE P548


WaveScalar’s Solution: Utilize Die Capability

  • A sea of simple, RISClike processors

    • in-order, single-issue

    • takes advantage of billions of transistors without exacerbating the other problems

      • short design & implementation time

      • operates at a short cycle

      • not need lots of ILP

      • fewer defects

CSE P548


WaveScalar Processing Element

CSE P548


WaveScalar’s Solution: Short Wires

  • Dataflow execution model

    • each processor executes when it’s operands have arrived

    • same principle as out-of-order execution but applies to the processor & includes fetching

      • no single program counter

    • short wires:

      • no long control lines

      • no centralized hardware data structures

      • no need for sequential & individual instruction fetches

CSE P548


WaveScalar’s Solution: Short Wires

  • Dataflow execution model, cont’d.

    • differs from original dataflow computers

      • distributed tag management (matching between renamed producer-consumer registers)

        • special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution

        • all instructions in a “wave” execute on data with the same wave number

CSE P548


WaveScalar’s Solution: Short Wires

  • Dataflow execution model

    • differs from original dataflow computers

      • explicit wave-ordered memory

        • compiler assigns sequence number to each memory operation in a bread-first manner

        • sequence number for an operation, its predecessor & successor all sent with produced data

        • wave & sequence numbers provide a total order on memory operations through any traversal of a wave

          + normal memory semantics

          + no need for special dataflow languages; C & C++ programs execute just fine

CSE P548


WaveScalar’s Solution: Short Wires

  • Nearest-neighbor communication

    • code placement to locate consumers near their producers

    • short, fast node-to-node links rather than slow broadcast networks

      • exploits dataflow locality: probability of producing a value for a particular consumer instruction & therefore register (register renaming can destroy this)

    • instructions can dynamically migrate toward their neighbors during execution

CSE P548


Branch

Common Case

Rare Case

Join

Dynamic Optimization

  • The common case has higher costs, and the branch can detect this…

CSE P548


Branch

Common Case

Rare Case

Join

Dynamic Optimization

  • …and fix it, by moving. The join can do the same.

CSE P548


PE Domain

WaveScalar’s Solution: Short Wires

CSE P548


Cluster

WaveScalar’s Solution: Short Wires

CSE P548


WaveScalar’s Solution: Creative Use of Untapped Parallelism

  • Expand the window for exploiting ILP

    • no in-order fetch using only one PC (sucking though a straw)

    • place instructions with the processing elements

    • out-of-order execution on a grand scale

  • Allow multiple threads to execute concurrently

    • OS & applications

    • multiple applications, parallel threads

CSE P548


WaveScalar’s Solution: The I-Cache is the Processor

  • Model is processor-in-memory (PIM)

    • processing element associated with each instruction

  • WaveScalar version

    • processing elements placed in the I-cache to reduce latency

CSE P548


  • Route around processors with flaws

WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity

  • Fewer design & implementation errors from the grid of simple, uniform design

  • decentralized control

  • dynamic instruction migration

CSE P548


Research Agenda: Architecture

  • WaveScalar ISA

  • Microarchitecture design

    • node design

    • domain size

    • cache-coherence across clusters

    • cluster arrangement

  • Control & memory speculation

  • WaveScalar instruction management

    • hardware for instruction placement & replacement

    • hardware for dynamic, self-optimizing placement

CSE P548


Research Agenda: Architecture

  • Multithreaded WaveScalar

  • Design of the network & routing issues

  • Power management

  • Static & dynamic fault detection & recovery (rerouting instructions)

  • System-level design

  • Application to non-silicon designs

CSE P548


Research Agenda: Compilers

  • Instruction placement

  • Revisit classic optimizations

    • code savings vs. communication costs

    • cache pollution vs. loop parallelism

  • New opportunities for optimization

    • a match between compiler & execute models

    • WaveScalar-specific instructions

CSE P548


Research Agenda: OS & Networking

  • Tension between facilitating short routines & poor instruction locality

  • The software side of thread management

  • A bunch of stuff I don’t know about

    • optimizing the OS interface

    • new thread protection policies

    • memory management issues

    • security

    • lazy context switching

    • utilizing virtual machines

CSE P548


Putting It All Together

  • Grid of hundreds (maybe thousands) of simple, data-flow processing nodes

    • no centralized control; scalable

    • few design errors; increase in yield

  • Processing nodes embedded in the I-cache

  • Instructions execute in place

  • Send results directly to the consumers

    • short, point-to-point links

  • Instructions can dynamically migrate

    • reduce latency to hot consumers

    • map around defects

  • 3X performance without any prediction mechanisms

    • more with them

CSE P548


CSE P548


  • Login