1 / 14

Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor

Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor. George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland.

hollie
Download Presentation

Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland www.umiacs.umd.edu/users/vishkin/XMT

  2. Hardware prototypes of PRAM-On-Chip 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT)architecture [SPAA98] (Cray started to use “XMT” ~7 years later) Objective of current paperMeaningful comparison of • Our FPGA design, with • State-of-the-Art (Intel) Processor Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to 1000+ cores on-chip

  3. XMT: A PRAM-On-Chip Vision • Manycores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it  great speedups. XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Unlike matching current HW (Some other SPAA papers) • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • This paper:~10X relative to Intel Core 2 Duo If there is time: Really serious about ease of programming

  4. Objective for programmer’s model What could I do in parallel at each step assuming unlimited hardware  . . # ops Serial Paradigm Natural (Parallel) Paradigm . . # ops . . .. .. .. .. time time Time = Work Work = total #ops Time << Work • Emerging: not sure, but the analysis should be work-depth. But, why not design for your analysis? (like serial) • XMT: Design for work-depth. Unique among manycores. - 1 operation now. Any #ops next time unit. - Competitive on nesting. (To be published.) - No need to program for locality.

  5. Programmer’s Model: Engineering Workflow • Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model • SPMD reduced synchrony • Threads advance at own speed, not lockstep • Main construct: spawn-join block. Note: can start any number of processes at once • Prefix-sum (ps). Independence of order semantics (IOS). • Establish correctness & complexity by relating to WD analyses. • Circumvents “The problem with threads”, e.g., [Lee]. • Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] • Trial&error contrast: similar startwhileinsufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} spawn join spawn join

  6. XMT Architecture Overview • One serial core – master thread control unit (MTCU) • Parallel cores (TCUS) grouped in clusters • Global memory space evenly partitioned in cache banks using hashing • No local caches at TCU • Avoids expensive cache coherence hardware … MTCU Hardware Scheduler/Prefix-Sum Unit Cluster 1 Cluster 2 Cluster C … Parallel Interconnection Network … Shared Memory (L1 Cache) Memory Bank 1 Memory Bank 2 Memory Bank M DRAM Channel 1 DRAM Channel D

  7. Paraleap: XMT PRAM-on-chip silicon • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips • 2 Virtex-4 LX200 • 1 Virtex-4 FX100 With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person.. basic simplicity of the XMT architecture simple  faster time to market, lower implementation cost.

  8. Benchmarks • Sparse Matrix – Vector Multiplication (SpMV) • Matrix stored in Compact Sparse Row (CSR) format • Serial version: iterate through rows • Parallel version: one thread per row • 1-D FFT • Fixed-point arithmetic implementation • Serial version: Radix-2 Cooley-Tukey Algorithm • Parallel version: Parallelized each stage of serial algorithm • Quicksort • Serial version: standard textbook implementation • Parallel version: two phases • Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum • Phase 2: Process all partitions in parallel using serial partitioning algorithm

  9. Experimental Platforms For meaningful comparison: compare cycle count

  10. Input Datasets • Large dataset represents realistic input sizes • Recommended by Intel engineer for comparison • Gives Intel Core 2 advantage because of larger cache • Small dataset • Fits in both Paraleap and Intel Core 2 cache • Provides most fair comparison for current XMT generation

  11. Clock-Cycle Speedup • Computed as: speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap • Paraleap outperforms Intel Core 2 on all benchmarks • Lower speed-ups for Large dataset because of smaller cache size • Will not be an issue for future implementations of XMT • Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo • No reason for clock frequency of XMT to fall behind

  12. Conclusion • XMT provides viable answer to biggest challenges for the field • Ease of programming • Scalability (up&down) • Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform • ICPP’08 paper compares with GPUs.

  13. Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: Cycle-accurate simulator of the XMT machine Compiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including Tutorial + manual for XMTC (150 pages) Classnotes on parallel algorithms (100 pages) Video recording of 9/15/07 HS tutorial (300 minutes) Video recording of grad Parallel Algorithms lectures (30+hours) www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html Next Major Objective Industry-grade chip. Requires 10X in funding.

  14. Ease of Programming • Benchmark: can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

More Related