1 / 1

Experimental Evaluation

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks. George C. Caragea. Fuat Keceli. Alexandros Tzannes. Uzi Vishkin. XMT: An Easy-to-Program Many-Core. XMT: Motivation and Background. XMT Programming Model. Ease of programming.

rianne
Download Presentation

Experimental Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks George C. Caragea FuatKeceli AlexandrosTzannes Uzi Vishkin XMT: An Easy-to-Program Many-Core XMT: Motivation and Background XMT Programming Model Ease of programming • Many-cores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable). • IF you could program it  great speedups. • XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • At each step, provide all instructions that can execute concurrently (not dependent on each other) • PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”) • PRAM-like programming: using reduced synchrony • Main construct: spawn-join block. Can start any number of virtual-threads at once • Necessary condition for success of a general-purpose platform • In von Neumann’s 1947 specs • Indications that XMT is easy to program: • XMT is based on rich algorithmic theory (PRAM) • Ease-of-teaching as a benchmark: • Successfully taught parallel programming to middle-school, high-school and up • Evaluated by education experts (SIGCSE 2010) • XMT superior to MPI, OpenMP and CUDA • Programmer’s workflow for deriving efficient programs from PRAM algorithms • DARPA HPCS productivity study: XMT development time half of MPI • Virtual-Threads advance at own speed, not lockstep • Prefix-sum (ps): similar to atomic fetch-and-add Paraleap: XMT PRAM-on-chip silicon XMTC Programming Language Arrzz int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } } • C with simple SPMD extensions • spawn: start any number of virtual threads • $: unique thread ID • ps/psm: atomic prefix sum. Efficient hardware implementation • XMTC Example: Array Compaction • Non-zero elements of A copied into B • Order is not necessarily preserved • After atomically executing ps(inc,base) • base = base + inc • inc gets original value of base • Elements copied into unique locations in B • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips • 2 Virtex-4 LX200, 1 Virtex-4 FX100 Tesla vs. XMT: Comparison of Architectures TESLA XMT Tested Configurations: GTX280 vs. XMT-1024 • Need configurations with equivalent area constraints (576 mm2 in 65nm) • Can not simply set the number of functional units and memory to the same values • Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology) • More area intensive side is emphasized in each category. Experimental Evaluation Benchmarks Performance Comparison • When using 1024-TCU XMT configuration: • 6.05x average speedup on irregular applications • 2.07x average slowdown on regular applications • When using 512-TCU XMT configuration • 4.57x average speedup on irregular • 3.06x average slowdown on regular • Case study: BFS on low parallelism dataset • Speedup of 73.4x over Rodinia implementation • Speedup of 6.89x over UIUC implementation • Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design) Conclusion and Future Work • SPAA’09: 10X over Intel Core Duo with same silicon area • Current work: • XMT outperforms GPU on all irregular workloads • XMT does not fall behind significantly on regular workloads • No need to pay high performance penalty for ease-of-programming • Promising candidate for pervasive platform of the future: • Highly parallel general-purpose CPUcoupled with: • Parallel GPU • Future work: • Power/energy comparison of XMT and GPU Experimental Platform • XMTSim: The cycle-accurate XMT simulator • Timing modeled after the 64-TCU FPGA prototype • Highly configurable to simulate any configuration • Modular design, enables architectural exploration • Part of XMT Software Release: • http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html

More Related