1 / 44

Samira Khan University of Virginia Sep 9, 2019

Data-Centric System Design CS 6501 Fundamental Concepts: Computing Models. Samira Khan University of Virginia Sep 9, 2019. The content and concept of this course are adapted from CMU ECE 740. AGENDA. Logistics Review from last lecture Fundamental concepts Computing models.

ythomas
Download Presentation

Samira Khan University of Virginia Sep 9, 2019

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Centric System Design CS 6501 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 9, 2019 The content and concept of this course are adapted from CMU ECE 740

  2. AGENDA • Logistics • Review from last lecture • Fundamental concepts • Computing models

  3. Review Set 2 • Due Sep 11 • Choose 2 from a set of four • Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974. • Arvind and Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture”, IEEE TC 1990. • H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. • Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986.

  4. DATA FLOW CHARACTERISTICS • Data-driven execution of instruction-level graphical code • Nodes are operators • Arcs are data (I/O) • As opposed to control-driven execution • Only real dependencies constrain processing • No sequential I-stream • No program counter • Operations execute asynchronously • Execution triggered by the presence of data

  5. DATA FLOW ADVANTAGES/DISADVANTAGES • Advantages • Very good at exploiting irregular parallelism • Only real dependencies constrain processing • Disadvantages • Debugging difficult (no precise state) • Interrupt/exception handling is difficult (what is precise state semantics?) • Too much parallelism? (Parallelism control needed) • High bookkeeping overhead (tag matching, data storage) • Instruction cycle is inefficient (delay between dependent instructions), memory locality is not exploited

  6. DATA FLOW SUMMARY • Availability of data determines order of execution • A data flow node fires when its sources are ready • Programs represented as data flow graphs (of nodes) • Data Flow at the ISA level has not been (as) successful • Data Flow implementations under the hood (while preserving sequential ISA semantics) have been successful • Out of order execution • Hwu and Patt, “HPSm, a high performance restricted data flow architecture having minimal functionality,” ISCA 1986.

  7. OOO EXECUTION: RESTRICTED DATAFLOW • An out-of-order engine dynamically builds the dataflow graph of a piece of the program • which piece? • The dataflow graph is limited to the instruction window • Instruction window: all decoded but not yet retired instructions • Can we do it for the whole program? • Why would we like to? • In other words, how can we have a large instruction window?

  8. ANOTHER WAY OF EXPLOITING PARALLELISM • SIMD: • Concurrency arises from performing the same operations on different pieces of data • MIMD: • Concurrency arises from performing different operations on different pieces of data • Control/thread parallelism: execute different threads of control in parallel  multithreading, multiprocessing • Idea: Use multiple processors to solve a problem

  9. FLYNN’S TAXONOMY OF COMPUTERS • Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 • SISD: Single instruction operates on single data element • SIMD: Single instruction operates on multiple data elements • Array processor • Vector processor • MISD: Multiple instructions operate on single data element • Closest form: systolic array processor, streaming processor • MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) • Multiprocessor • Multithreaded processor

  10. SYSTOLIC ARRAYS

  11. WHY SYSTOLIC ARCHITECTURES? • Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory • Similar to an assembly line of processing elements • Different people work on the same car • Many cars are assembled simultaneously • Why? Special purpose accelerators/architectures need • Simple, regular design (keep # unique parts small and regular) • High concurrency  high performance • Balanced computation and I/O (memory) bandwidth

  12. SYSTOLIC ARRAYS • H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. Memory: heart PEs: cells Memory pulses data through cells

  13. SYSTOLIC ARCHITECTURES • Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs • Balance computation and memory bandwidth • Differences from pipelining: • These are individual PEs • Array structure can be non-linear and multi-dimensional • PE connections can be multidirectional (and different speed) • PEs can have local memory and execute kernels (rather than a piece of the instruction)

  14. SYSTOLIC COMPUTATION EXAMPLE • Convolution • Used in filtering, pattern matching, correlation, polynomial evaluation, etc … • Many image processing tasks

  15. SYSTOLIC ARCHITECTURE FOR CONVOLUTION

  16. y1=w1x1 y1=0 W2 W1 W3 x1

  17. y1=w1x1 + w2x2 y1=w1x1 W2 W1 W3 x2 x2

  18. y1=w1x1 + w2x2 + w3x3 y1=w1x1 + w2x2 W2 W1 W3 x3 x3

  19. CONVOLUTION • y1 = w1x1 + w2x2 + w3x3 • y2 = w1x2 + w2x3 + w3x4 • y3 = w1x3 + w2x4 + w3x5

  20. Convolution: Another Design

  21. x1 W1 W3 W2

  22. x2 x1 W1 W3 W2

  23. x1 x3 x2 y1 W1 W3 W2

  24. x2 x4 x1 x3 y2 W1 W3 W2 y1=w3x3

  25. x3 x5 x1 x2 x4 y3 W1 W3 W2 y1=w2x2+w3x3 y2=w3x4

  26. x4 x6 x2 x3 x5 x1 y4 W1 W3 W2 y3=w3x5 y2=w2x3+w3x4 y1=w1x1+w2x2+w3x3

  27. x1 x5 x7 x3 x4 x6 x2 y5 W1 W3 W2 y4=w3x6 y3=w2xx+w3x5 y2=w1x2+w2x3+w3x4

  28. More Programmability • Each PE in a systolic array • Can store multiple “weights” • Weights can be selected on the fly • Eases implementation of, e.g., adaptive filtering • Taken further • Each PE can have its own data and instruction memory • Data memory  to store partial/temporary results, constants • Leads to stream processing, pipeline parallelism • More generally, staged execution

  29. SYSTOLIC ARRAYS: PROS AND CONS • Advantage: • Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement • Downside: • Specialized  not generally applicable because computation needs to fit the PE functions/organization

  30. The WARP Computer • HT Kung, CMU, 1984-1988 • Linear array of 10 cells, each cell a 10 Mflop programmable processor • Attached to a general purpose host machine • HLL and optimizing compiler to program the systolic array • Used extensively to accelerate vision and robotics tasks • Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. • Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.

  31. The WARP Computer

  32. The WARP Computer

  33. SLIPSTREAM PROCESSORS • Goal: use multiple hardware contexts to speed up single thread execution (implicitly parallelize the program) • Idea: Divide program execution into two threads: • Advanced thread executes a reduced instruction stream, speculatively • Redundant thread uses results, prefetches, predictions generated by advanced thread and ensures correctness • Benefit: Execution time of the overall program reduces • Core idea is similar to many thread-level speculation approaches, except with a reduced instruction stream • Sundaramoorthy et al., “Slipstream Processors: Improving both Performance and Fault Tolerance,” ASPLOS 2000.

  34. SLIPSTREAMING • “At speeds in excess of 190 m.p.h., high air pressure forms at the front of a race car and a partial vacuum forms behind it. This creates drag and limits the car’s top speed. • A second car can position itself close behind the first (a process called slipstreaming or drafting). This fills the vacuum behind the lead car, reducing its drag. And the trailing car now has less wind resistance in front (and by some accounts, the vacuum behind the lead car actually helps pull the trailing car). • As a result, both cars speed up by several m.p.h.: the two combined go faster than either can alone.”

  35. SLIPSTREAM PROCESSORS • Detect and remove ineffectual instructions; run a shortened “effectual” version of the program (Advanced or A-stream) in one thread context • Ensure correctness by running a complete version of the program (Redundant or R-stream) in another thread context • Shortened A-stream runs fast; R-stream consumes near-perfect control and data flow outcomes from A-stream and finishes close behind • Two streams together lead to faster execution (by helping each other) than a single one alone

  36. SLIPSTREAM IDEA AND POSSIBLE HARDWARE

  37. INSTRUCTION REMOVAL IN SLIPSTREAM • IR detector • Monitors retired R-stream instructions • Detects ineffectual instructions and conveys them to the IR predictor • Ineffectual instruction examples: • dynamic instructions that repeatedly and predictably have no observable effect (e.g., unreferenced writes, non-modifying writes) • dynamic branches whose outcomes are consistently predicted correctly. • IR predictor • Removes an instruction from A-stream after repeated indications from the IR detector • A stream skips ineffectual instructions, executes everything else and inserts their results into delay buffer • R stream executes all instructions but uses results from the delay buffer as predictions

  38. WHAT IF A-STREAM DEVIATES FROM CORRECT EXECUTION? • Why • A-stream deviates due to incorrect removal or stale data access in L1 data cache • How to detect it? • Branch or value misprediction happens in R-stream (known as an IR misprediction) • How to recover? • Restore A-stream register state: copy values from R-stream registers using delay buffer or shared-memory exception handler • Restore A-stream memory state: invalidate A-stream L1 data cache (or speculatively written blocks by A-stream)

  39. Slipstream Questions • How to construct the advanced thread • Original proposal: • Dynamically eliminate redundant instructions (silent stores, dynamically dead instructions) • Dynamically eliminate easy-to-predict branches • Other ways: • Dynamically ignore long-latency stalls • Static based on profiling • How to speed up the redundant thread • Original proposal: Reuse instruction results (control and data flow outcomes from the A-stream) • Other ways: Only use branch results and prefetched data as predictions

  40. RUNAHEAD EXECUTION • A technique to obtain the memory-level parallelism benefits of a large instruction window • When the oldest instruction is a long-latency cache miss: • Checkpoint architectural state and enter runahead mode • In runahead mode: • Speculatively pre-execute instructions • The purpose of pre-execution is to generate prefetches • L2-miss dependent instructions are marked INV and dropped • Runahead mode ends when the original miss returns • Checkpoint is restored and normal execution resumes • Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003.

  41. Runahead Example Perfect Caches: Load 2 Hit Load 1 Hit Compute Compute Small Window: Load 2 Miss Load 1 Miss Compute Compute Stall Stall Miss 1 Miss 2 Runahead: Load 1 Miss Load 2 Miss Load 2 Hit Load 1 Hit Runahead Compute Compute Saved Cycles Miss 1 Miss 2

  42. BENEFITS OF RUNAHEAD EXECUTION Instead of stalling during an L2 cache miss: • Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches: • For both regular and irregular access patterns • Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. • Hardware prefetcher and branch predictor tables are trained using future access information.

  43. RUNAHEAD EXECUTION • Advantages: + Very accurateprefetches for data/instructions (all cache levels) + Follows the program path + Simple to implement, most of the hardware is already built in + Uses the same thread context as main thread, no waste of context + No need to construct a pre-execution thread • Disadvantages/Limitations: -- Extra executed instructions -- Limited by branch prediction accuracy -- Cannot prefetch dependent cache misses. -- Effectiveness limited by available “memory-level parallelism” (MLP) -- Prefetch distance limited by memory latency • Implemented in IBM POWER6, Sun “Rock”

  44. Data-Centric System Design CS 6501 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 9, 2019 The content and concept of this course are adapted from CMU ECE 740

More Related