slide1 l.
Skip this Video
Loading SlideShow in 5 Seconds..
Experimental Evaluation PowerPoint Presentation
Download Presentation
Experimental Evaluation

Loading in 2 Seconds...

play fullscreen
1 / 1

Experimental Evaluation - PowerPoint PPT Presentation

  • Uploaded on

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks. George C. Caragea. Fuat Keceli. Alexandros Tzannes. Uzi Vishkin. XMT: An Easy-to-Program Many-Core. XMT: Motivation and Background. XMT Programming Model. Ease of programming.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Experimental Evaluation' - rianne

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks

George C. Caragea



Uzi Vishkin

XMT: An Easy-to-Program Many-Core

XMT: Motivation and Background

XMT Programming Model

Ease of programming

  • Many-cores are coming. But 40yrs of parallel computing:
  • Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable).
  • IF you could program it  great speedups.
  • XMT: Fix the IF
  • XMT: Designed from the ground up to address that for on-chip parallelism
  • Tested HW & SW prototypes
  • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase
  • At each step, provide all instructions that can execute concurrently (not dependent on each other)
  • PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”)
  • PRAM-like programming: using reduced synchrony
  • Main construct: spawn-join block. Can start any number of virtual-threads at once
  • Necessary condition for success of a general-purpose platform
  • In von Neumann’s 1947 specs
  • Indications that XMT is easy to program:
  • XMT is based on rich algorithmic theory (PRAM)
  • Ease-of-teaching as a benchmark:
    • Successfully taught parallel programming to middle-school, high-school and up
    • Evaluated by education experts (SIGCSE 2010)
    • XMT superior to MPI, OpenMP and CUDA
  • Programmer’s workflow for deriving efficient programs from PRAM algorithms
  • DARPA HPCS productivity study: XMT development time half of MPI
  • Virtual-Threads advance at own speed, not lockstep
  • Prefix-sum (ps): similar to atomic fetch-and-add

Paraleap: XMT PRAM-on-chip silicon

XMTC Programming Language


int A[N],B[N]

int base=0;

spawn(0,N-1) {

int inc=1;

if (A[$]!=0) {





  • C with simple SPMD extensions
  • spawn: start any number of virtual threads
  • $: unique thread ID
  • ps/psm: atomic prefix sum. Efficient hardware implementation
  • XMTC Example: Array Compaction
  • Non-zero elements of A copied into B
  • Order is not necessarily preserved
  • After atomically executing ps(inc,base)
    • base = base + inc
    • inc gets original value of base
    • Elements copied into unique locations in B
  • Built FPGA prototype
  • Announced in SPAA’07
  • Built using 3 FPGA chips
    • 2 Virtex-4 LX200, 1 Virtex-4 FX100

Tesla vs. XMT: Comparison of Architectures



Tested Configurations: GTX280 vs. XMT-1024

  • Need configurations with equivalent area constraints (576 mm2 in 65nm)
  • Can not simply set the number of functional units and memory to the same values
  • Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology)
  • More area intensive side is emphasized in each category.

Experimental Evaluation


Performance Comparison

  • When using 1024-TCU XMT configuration:
    • 6.05x average speedup on irregular applications
    • 2.07x average slowdown on regular applications
    • When using 512-TCU XMT configuration
      • 4.57x average speedup on irregular
      • 3.06x average slowdown on regular
  • Case study: BFS on low parallelism dataset
    • Speedup of 73.4x over Rodinia implementation
    • Speedup of 6.89x over UIUC implementation
    • Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design)

Conclusion and Future Work

  • SPAA’09: 10X over Intel Core Duo with same silicon area
  • Current work:
    • XMT outperforms GPU on all irregular workloads
    • XMT does not fall behind significantly on regular workloads
    • No need to pay high performance penalty for ease-of-programming
  • Promising candidate for pervasive platform of the future:
    • Highly parallel general-purpose CPUcoupled with:
    • Parallel GPU
  • Future work:
    • Power/energy comparison of XMT and GPU

Experimental Platform

  • XMTSim: The cycle-accurate XMT simulator
  • Timing modeled after the 64-TCU FPGA prototype
  • Highly configurable to simulate any configuration
  • Modular design, enables architectural exploration
  • Part of XMT Software Release: