1 / 1

Morphable Multithreaded Memory Tiles (M 3 T)

M3T. CPU. CPU. Cache. Cache. Barrier. 6%. M. P. M. P. Memory. Barrier. . M. P. M. P. M. P. M. . M. P. M. Safe. Safe. Performance. CMP. TaskScalar. SMT. Superscalar. Parallelism. SpecInt. SpecFP. Scientific. Overhead. Speculative. Speculative. K. K. K. K.

umed
Download Presentation

Morphable Multithreaded Memory Tiles (M 3 T)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. M3T CPU CPU Cache Cache Barrier 6% M P M P Memory Barrier  M P M P M P M  M P M Safe Safe Performance CMP TaskScalar SMT Superscalar Parallelism SpecInt SpecFP Scientific Overhead Speculative Speculative K K K K K K Morphable Multithreaded Memory Tiles (M3T) Josep Torrellas (University of Illinois at Urbana-Champaign) Ben Abbott (Southwest Research Institute) Ted Bapty (Vanderbilt University) Bob Bassett, David Ngo (BAE SYSTEMS) Hubertus Franke, Jose Moreira(IBM Research) Architecture Compiler Support Novel compiler algorithms to build tasks M3T Architecture Novel Inter-Task Optimizations Task vectorization Task fusion Task fission Task partitioning Task motion Task telescoping Task elimination Front End Task T1 Task T2 Task T3 64 Processor cores 0.11um 309.3mm2 ASIC: 800 MHz Thermal Power: 66W Multichip support Morph into MIMD, VLIW, TaskScalar and Stream High Level Transformations X X+1 X+2 Task Selection Task T1,3 Inter-Task Optimizations Intra-Task Optimizations X, X+2 Code Generation Software Productivity TaskScalar Morph Unit of execution: a task A task can be committed or squashed in one shot Debugging Data Races [ISCA03] Reducing Parallel Programming Effort [ASPLOS02] TST: Task State Table PTW: Pending Task Window Sync Bus Task X Task Y • Do not fine tune synchronization • Code coarse critical section • Insert perhaps unnecessary barriers • Speculative synchronization: speculate past active barriers, locks, flags • Detect conflicts, roll back offending threads • Use caches to store speculative state • Maintain 1 or more safe tasks  forward progress • Lock: owner • Flag: producer • Barrier: lagging tasks … lock(L) LD A INC ST A unlock(L) … CPU+L1 CPU+L1 CPU+L1 TST TST TST … … … LD A INC ST A … A A Banked L2 Banked L2 Banked L2 task ? PTW PTW PTW No explicit orderbetween  and  Sync Time Reduction On-Chip Network Speculative Barrier Detect unsynchronized communication Incremental undo and re-execution Re-execution is deterministic Off-Chip Memory 17.7% A B BARRIER C TaskScalar Morph Evaluation Applications:Matrix: 20x20 element dense matrix multiply Bubble: Application with bursty task spawns that creates bubbles of activity Pathological: like Matrix but all activity concentrated on one cluster Task Ordering Speculative Lock D Lock L ACQUIRE Effect of Task Size Effect of Number of Processors Effect of Network Latency Lock L Wait F A B C Unlock L Set F Unlock L RELEASE E • Large reduction: 40% Code Speedup TaskScalar Hardware Speedups are high even for small-sized tasks Scalability of the speedups significantly depends on the application Speedups are very tolerant to network latency Effectiveness No TaskScalar Hardware Timeline of Tasks (Bubble) Timeline of Tasks (Matrix) Timeline of Tasks (Pathological) Man-Hours Invested Programming • TaskScalar attempts to • run section in parallel • speculate past synchronization • Result: appear as if we had invested more man-hours Bursty spawn and execution of tasks High load imbalance; waste of resources Smooth execution of tasks

More Related