1 / 9

Is SC + ILP = RC?

Is SC + ILP = RC?. C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue Presented by: Eric Carty-Fickes. Introduction. SC produces memory order with hardware easier to program worse performance due to conservativism RC produces memory order with software harder to program

kaipo
Download Presentation

Is SC + ILP = RC?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Is SC + ILP = RC? C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue Presented by: Eric Carty-Fickes

  2. Introduction • SC • produces memory order with hardware • easier to program • worse performance due to conservativism • RC • produces memory order with software • harder to program • better performance due to explicitness

  3. catching up to RC • SC limitation: no software guarantees • memory order is arbitrary, no devices such as fences • SC can allow loads and stores to bypass one another • processor state must be remembered, but rollbacks should be avoided – slow • superscalar rollbacks are faster • rollbacks caused by data races, false sharing, cache conflicts • encourage load/store speculation but make it transparent • check for reading or replacement of speculative blocks

  4. SC++ • ILP allows more speculation in SC – invisible to outside world due to in order retirement • branch predictors, superscalar, non-blocking caches • maybe can perform up to the level of RC • allows stores to bypass as well as loads • allows out-of-order operations to hide latency • quickly recovers from mis-speculation • assumes applications designed for MP’s/DSM

  5. SC++ Architecture • modelled after R10K • SHiQ allows for prefetching and non-blocking caches • other processors see SC • history buffer allows speculative retirement • unblocks RoB stores • load/store queue takes stores from RoB • BLT has block addr’s for SHiQ

  6. Simulations • using RSIM for 8-node DSM, 16k L1, 8M L2 • all use non-blocking caches, prefetching, speculative loads • rollbacks = 1 cycle • SC++ rollbacks = 4 wide • SC blocks at stores • RC hides network latency with store overlaps • raytrace hurt by lock patterns, slow network

  7. More Simulations • RoB increase = more prefetch time • unstructured causes many rollbacks for SC • SC++o = no speculative stores • radix and raytrace = store-intensive, full load/store queue

  8. Another Simulation • L2 size reduced • less room for speculative state • lu sees many rollbacks caused by replacements

  9. Conclusions/Questions • SC++ nearly up to snuff with RC with minor additional hardware • does this really matter – is it that much harder to program with RC? • does this add any significant risk of errors due to extra hardware and speculation? • do you buy their argument that applications causing rollback are not suited to DSM systems anyway?

More Related