1 / 31

SIMD Lane Decoupling Improved Timing-Error Resilience

SIMD Lane Decoupling Improved Timing-Error Resilience. Evgeni Krimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin). All systems power/energy bound. The good: Transistor still following Moore’s Law The bad: Transistor power efficiency improving too slowly

bary
Download Presentation

SIMD Lane Decoupling Improved Timing-Error Resilience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMD Lane DecouplingImproved Timing-Error Resilience EvgeniKrimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin)

  2. SIMD Lane Decoupling (C) M. Erez, E. Krimer All systems power/energy bound • The good: • Transistor still following Moore’s Law • The bad: • Transistor power efficiency improving too slowly • Larger fraction of power to non-compute resources • The conclusion: • Better algorithms • More efficient architectures • Proportionality: waste less of what you have • This paper: SIMD + timing speculation • Efficient architecture + proportional guardbands

  3. SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline • Setup: efficient architecture + proportional margining • Proportional margining w/ timing speculation • Timing speculation with SIMD • Problem and DPSP solution • Methodology and modeling • Evaluation

  4. SIMD Lane Decoupling (C) M. Erez, E. Krimer Voltage/timing margins “waste” energy • Illustrative only – not to scale Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Today time (1 cycle)

  5. SIMD Lane Decoupling (C) M. Erez, E. Krimer Voltage/timing margins “waste” energy • Illustrative only – not to scale Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Today time (1 cycle) Typical logic delay Maximumlogic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Future

  6. SIMD Lane Decoupling (C) M. Erez, E. Krimer Timing speculation to the rescue [Ernst04] • Razor latches • Speculate low delay • Detect violations • Early/late mismatch • Recover by stalling • Requires fast “global” signal • Alternative – flush • Requires extra ~10% logic • Path delay restrictions:Δ < t < Δ+cycle

  7. SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline • Setup: SIMD architecture + proportional margining • Proportional margining w/ timing speculation • Timing speculation with SIMD • Problem and DPSP solution • Methodology and modeling • Evaluation

  8. SIMD Lane Decoupling (C) M. Erez, E. Krimer SIMD leads to inefficient timing speculation

  9. SIMD Lane Decoupling (C) M. Erez, E. Krimer SIMD leads to inefficient timing speculation

  10. SIMD Lane Decoupling (C) M. Erez, E. Krimer Decoupled Parallel SIMD Pipeline (DPSP) • Shallow FIFO for control (or between stages)

  11. SIMD Lane Decoupling (C) M. Erez, E. Krimer Decoupled Parallel SIMD Pipeline (DPSP) • Decoupling mitigates SIMD impact

  12. SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP challenge 1: inter-lane communication • Decoupling may delay producer (store) • Micro barriers • Enforce SIMD semantics • Not a problem in practicewith GPUs • Execution model requiresexplicit sync across CTAs / work-groups

  13. SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP challenge 2: memory access locality • Loads and stores no longer aligned • Memory “divergence” • May increase pressure on on-chip memory access • May impact off-chip access • Old NVIDIA hardware had memory coalescing issues • No Problem with coalescing buffers and caches • Micro-barriers if problematic • Can be done implicitly or explicitly in hardware • Sync before every load • Prediction

  14. SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline • Setup: efficient architecture + proportional margining • Proportional margining w/ timing speculation • Timing speculation with SIMD • Problem and DPSP solution • Methodology and modeling • Evaluation

  15. SIMD Lane Decoupling (C) M. Erez, E. Krimer Evaluation flow

  16. SIMD Lane Decoupling (C) M. Erez, E. Krimer Measuring error rate • Inherently circuit andimplementation dependent • Used 3 exemplary circuits • SPICE-simulated adder [Ernst04] • FPGA-modeled multiplier [Ernst04] • Multiplier fabricated in our IBM 45nm SOI test chip[Pawlowski12] Pawlowski ISSCC’12

  17. SIMD Lane Decoupling (C) M. Erez, E. Krimer Modeling the error rate function • 2-parameter model Adder [Ernst04]Mul. [Ernst04]

  18. SIMD Lane Decoupling (C) M. Erez, E. Krimer ET2 energy-efficiency metric • Energy x (execution)Time2 • In circuit context: time=delay -> ED2 • Isolates architecture efficiency • Independent of DVFS • Shows improvements in addition to DVFS

  19. SIMD Lane Decoupling (C) M. Erez, E. Krimer Simple ET2 model • Throughput (1/T): • Relative energy: Dynamic Static

  20. SIMD Lane Decoupling (C) M. Erez, E. Krimer GP-GPU simulation adds some realism • Baseline uses ideal margins without specuation • Only max delay vs. typical delay left on table • Timing speculation overhead is 0 – 15% ET2 • GPGPUSim(version 2.1) • Cycle-based extendable GP-GPU simulator from UBC • Developer-recommended parameters • Extended to DPSP • Recovery through stall • Micro-barrier options • Explicit CTA/workgroup synchronization only (no mbarriers) • Implicit sync before every memory operation • Power model based on Hong & Kim, ISCA’10

  21. SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline • Setup: efficient architecture + proportional margining • Proportional margining w/ timing speculation • Timing speculation with SIMD • Problem and DPSP solution • Methodology and modeling • Evaluation • Design-space exploration • Architecture effects

  22. SIMD Lane Decoupling (C) M. Erez, E. Krimer ET2 vs. SIMD (no spec.) • DPSP DPSP Adder [Ernst04]Mul. [Ernst04] *- Relative ET2 - lower elevation is better

  23. SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP vs. SIMD (w/ spec.) • SIMD – DPSP Adder [Ernst04]Mul. [Ernst04] *- ET2 Difference - higher elevation is better

  24. SIMD Lane Decoupling (C) M. Erez, E. Krimer Bringing in architecture effects Fabricated MUL Adder

  25. SIMD Lane Decoupling (C) M. Erez, E. Krimer Summary • Design margins inefficiency • Naive timing speculation with SIMD is inefficient • DPSP enables efficient speculation in SIMD • Microbarriers maintain semantics when necessary • With GPU, frequent mbarriers help memory access • Simple models can capture error response • Error rate exponential with Vdd • Dependent on circuit and implementation • Design-space exploration shows potential • When and why timing speculation should (not) be used • DPSP consistently improves ET2 (10 – 45%) • DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.

  26. SIMD Lane Decoupling (C) M. Erez, E. Krimer backup

  27. SIMD Lane Decoupling (C) M. Erez, E. Krimer Detailed ET2 vs. Vdd behavior NN AES BFS MUM

  28. SIMD Lane Decoupling (C) M. Erez, E. Krimer Frequent micro-barriers improve ET2 Adder Multiplier Fab.

  29. SIMD Lane Decoupling (C) M. Erez, E. Krimer Modeling the error rate function Adder [Ernst04]Mul. [Ernst04]

  30. SIMD Lane Decoupling (C) M. Erez, E. Krimer Proportional margining • Static margin control • Binning • Vdd/frequency/biasing adjustment • Dynamic margin control • Vdd/frequency/biasing for slowly varying effects • Temperature and aging • Clocking tricks • From GALS to dynamic and elastic clocking Typical logic delay Maximum logic delay Process variation guard-band Other Noise guard-band Wearout guard-band Clock Skew and jitter time

  31. SIMD Lane Decoupling (C) M. Erez, E. Krimer Detailed results summary • BFS • High divergence rate • Requires implicit synchronizations • Limits DPSP opportunities • CP,DG,RAY • Sensitive to memory coalescing • Synchronization between memory operations solves it • MUM • Low SIMD occupancy limits the benefit of decoupling • WP • Not enough registers, lots of memory spills. • Extremely sensitive to memory latency and the exact scheduling – disturbed by DPSP

More Related