Wavescalar and the wavecache
Download
1 / 21

WaveScalar and the WaveCache - PowerPoint PPT Presentation


  • 164 Views
  • Uploaded on

WaveScalar and the WaveCache. Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington. Worries to Keep You up at Night. In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' WaveScalar and the WaveCache' - mili


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Wavescalar and the wavecache

WaveScalar and the WaveCache

Steven Swanson

Ken Michelson

Mark Oskin

Tom Anderson

Susan Eggers

University of Washington

CSE P548


Worries to keep you up at night
Worries to Keep You up at Night

  • In 2016

    • 200,000 RISC-1 processors will fit on a die.

    • It will take 36 cycles to cross the die.

    • Still a lack of ILP.

    • Memory latency is still a problem.

    • For reasonable yields, only 1 transistor in 24 billion may be broken (if one flaw breaks a chip).

CSE P548


Wavescalar s solution utilize die capability
WaveScalar’s Solution: Utilize Die Capability

  • A sea of simple, RISClike processors

    • in-order, single-issue

    • takes advantage of billions of transistors without exacerbating the other problems

      • short design & implementation time

      • operates at a short cycle

      • not need lots of ILP

      • fewer defects

CSE P548



Wavescalar s solution short wires
WaveScalar’s Solution: Short Wires

  • Dataflow execution model

    • each processor executes when it’s operands have arrived

    • same principle as out-of-order execution but applies to the processor & includes fetching

      • no single program counter

    • short wires:

      • no long control lines

      • no centralized hardware data structures

      • no need for sequential & individual instruction fetches

CSE P548


Wavescalar s solution short wires1
WaveScalar’s Solution: Short Wires

  • Dataflow execution model, cont’d.

    • differs from original dataflow computers

      • distributed tag management (matching between renamed producer-consumer registers)

        • special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution

        • all instructions in a “wave” execute on data with the same wave number

CSE P548


Wavescalar s solution short wires2
WaveScalar’s Solution: Short Wires

  • Dataflow execution model

    • differs from original dataflow computers

      • explicit wave-ordered memory

        • compiler assigns sequence number to each memory operation in a bread-first manner

        • sequence number for an operation, its predecessor & successor all sent with produced data

        • wave & sequence numbers provide a total order on memory operations through any traversal of a wave

          + normal memory semantics

          + no need for special dataflow languages; C & C++ programs execute just fine

CSE P548


Wavescalar s solution short wires3
WaveScalar’s Solution: Short Wires

  • Nearest-neighbor communication

    • code placement to locate consumers near their producers

    • short, fast node-to-node links rather than slow broadcast networks

      • exploits dataflow locality: probability of producing a value for a particular consumer instruction & therefore register (register renaming can destroy this)

    • instructions can dynamically migrate toward their neighbors during execution

CSE P548


Dynamic optimization

Branch

Common Case

Rare Case

Join

Dynamic Optimization

  • The common case has higher costs, and the branch can detect this…

CSE P548


Dynamic optimization1

Branch

Common Case

Rare Case

Join

Dynamic Optimization

  • …and fix it, by moving. The join can do the same.

CSE P548


Wavescalar s solution short wires4

PE Domain

WaveScalar’s Solution: Short Wires

CSE P548


Wavescalar s solution short wires5

Cluster

WaveScalar’s Solution: Short Wires

CSE P548


Wavescalar s solution creative use of untapped parallelism
WaveScalar’s Solution: Creative Use of Untapped Parallelism

  • Expand the window for exploiting ILP

    • no in-order fetch using only one PC (sucking though a straw)

    • place instructions with the processing elements

    • out-of-order execution on a grand scale

  • Allow multiple threads to execute concurrently

    • OS & applications

    • multiple applications, parallel threads

CSE P548


Wavescalar s solution the i cache is the processor
WaveScalar’s Solution: The I-Cache is the Processor Parallelism

  • Model is processor-in-memory (PIM)

    • processing element associated with each instruction

  • WaveScalar version

    • processing elements placed in the I-cache to reduce latency

CSE P548


Wavescalar s solution design to compensate for circuit unreliablity

WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity

  • Fewer design & implementation errors from the grid of simple, uniform design

  • decentralized control

  • dynamic instruction migration

CSE P548


Research agenda architecture
Research Agenda: Architecture Parallelism

  • WaveScalar ISA

  • Microarchitecture design

    • node design

    • domain size

    • cache-coherence across clusters

    • cluster arrangement

  • Control & memory speculation

  • WaveScalar instruction management

    • hardware for instruction placement & replacement

    • hardware for dynamic, self-optimizing placement

CSE P548


Research agenda architecture1
Research Agenda: Architecture Parallelism

  • Multithreaded WaveScalar

  • Design of the network & routing issues

  • Power management

  • Static & dynamic fault detection & recovery (rerouting instructions)

  • System-level design

  • Application to non-silicon designs

CSE P548


Research agenda compilers
Research Agenda: Compilers Parallelism

  • Instruction placement

  • Revisit classic optimizations

    • code savings vs. communication costs

    • cache pollution vs. loop parallelism

  • New opportunities for optimization

    • a match between compiler & execute models

    • WaveScalar-specific instructions

CSE P548


Research agenda os networking
Research Agenda: OS & Networking Parallelism

  • Tension between facilitating short routines & poor instruction locality

  • The software side of thread management

  • A bunch of stuff I don’t know about

    • optimizing the OS interface

    • new thread protection policies

    • memory management issues

    • security

    • lazy context switching

    • utilizing virtual machines

CSE P548


Putting it all together
Putting It All Together Parallelism

  • Grid of hundreds (maybe thousands) of simple, data-flow processing nodes

    • no centralized control; scalable

    • few design errors; increase in yield

  • Processing nodes embedded in the I-cache

  • Instructions execute in place

  • Send results directly to the consumers

    • short, point-to-point links

  • Instructions can dynamically migrate

    • reduce latency to hot consumers

    • map around defects

  • 3X performance without any prediction mechanisms

    • more with them

CSE P548


CSE P548 Parallelism