Comparison of Many-Core Architectures: XMT vs. GPU Performance on Irregular Benchmarks

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks George C. Caragea FuatKeceli AlexandrosTzannes Uzi Vishkin XMT: An Easy-to-Program Many-Core XMT: Motivation and Background XMT Programming Model Ease of programming • Many-cores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable). • IF you could program it  great speedups. • XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • At each step, provide all instructions that can execute concurrently (not dependent on each other) • PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”) • PRAM-like programming: using reduced synchrony • Main construct: spawn-join block. Can start any number of virtual-threads at once • Necessary condition for success of a general-purpose platform • In von Neumann’s 1947 specs • Indications that XMT is easy to program: • XMT is based on rich algorithmic theory (PRAM) • Ease-of-teaching as a benchmark: • Successfully taught parallel programming to middle-school, high-school and up • Evaluated by education experts (SIGCSE 2010) • XMT superior to MPI, OpenMP and CUDA • Programmer’s workflow for deriving efficient programs from PRAM algorithms • DARPA HPCS productivity study: XMT development time half of MPI • Virtual-Threads advance at own speed, not lockstep • Prefix-sum (ps): similar to atomic fetch-and-add Paraleap: XMT PRAM-on-chip silicon XMTC Programming Language Arrzz int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } } • C with simple SPMD extensions • spawn: start any number of virtual threads • $: unique thread ID • ps/psm: atomic prefix sum. Efficient hardware implementation • XMTC Example: Array Compaction • Non-zero elements of A copied into B • Order is not necessarily preserved • After atomically executing ps(inc,base) • base = base + inc • inc gets original value of base • Elements copied into unique locations in B • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips • 2 Virtex-4 LX200, 1 Virtex-4 FX100 Tesla vs. XMT: Comparison of Architectures TESLA XMT Tested Configurations: GTX280 vs. XMT-1024 • Need configurations with equivalent area constraints (576 mm2 in 65nm) • Can not simply set the number of functional units and memory to the same values • Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology) • More area intensive side is emphasized in each category. Experimental Evaluation Benchmarks Performance Comparison • When using 1024-TCU XMT configuration: • 6.05x average speedup on irregular applications • 2.07x average slowdown on regular applications • When using 512-TCU XMT configuration • 4.57x average speedup on irregular • 3.06x average slowdown on regular • Case study: BFS on low parallelism dataset • Speedup of 73.4x over Rodinia implementation • Speedup of 6.89x over UIUC implementation • Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design) Conclusion and Future Work • SPAA’09: 10X over Intel Core Duo with same silicon area • Current work: • XMT outperforms GPU on all irregular workloads • XMT does not fall behind significantly on regular workloads • No need to pay high performance penalty for ease-of-programming • Promising candidate for pervasive platform of the future: • Highly parallel general-purpose CPUcoupled with: • Parallel GPU • Future work: • Power/energy comparison of XMT and GPU Experimental Platform • XMTSim: The cycle-accurate XMT simulator • Timing modeled after the 64-TCU FPGA prototype • Highly configurable to simulate any configuration • Modular design, enables architectural exploration • Part of XMT Software Release: • http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html

Comparison of Many-Core Architectures: XMT vs. GPU Performance on Irregular Benchmarks

Comparison of Many-Core Architectures: XMT vs. GPU Performance on Irregular Benchmarks

Presentation Transcript

Experimental Design and Other Evaluation Methods

Monitoring, Evaluation and (Experimental) Impact Evaluation

Experimental evaluation in education

IAT 334 Experimental Evaluation

Regulation of Network Infrastructure Investments - An Experimental Evaluation

Evaluation and Methodology For Experimental Computer Science

Experimental

Motivation Recommender framework Experimental evaluation Conclusions

EXPERIMENTAL

EXPERIMENTAL

EXPERIMENTAL TECHNIQUES &amp; EVALUATION IN NLP

Experimental

An experimental evaluation of continuous testing during development

Experimental Evaluation in Computer Science: A Quantitative Study

Subjective Evaluation Of Delayed Risky Outcomes: An Experimental Approach

Experimental Evaluation of Voltage Unbalance

Experimental Evaluation in Computer Science: A Quantitative Study

Experimental Evaluation

Experimental Evaluation of Advanced Automated Geospatial Tools

CS 391L: Machine Learning: Experimental Evaluation

Experimental evaluation of Moderated EDCA over WMP

12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems

Comparison of Many-Core Architectures: XMT vs. GPU Performance on Irregular Benchmarks