1 / 16

Presentation at the 4 th PMEO-PDS Workshop

Presentation at the 4 th PMEO-PDS Workshop. Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University Denver, Colorado 3/22/2005. Presentation Outline. Background Unified Parallel C, implementations and users.

gwylan
Download Presentation

Presentation at the 4 th PMEO-PDS Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation at the 4th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University Denver, Colorado 3/22/2005

  2. Presentation Outline • Background • Unified Parallel C, implementations and users. • Previous UPC performance studies. • Experiments • Available UPC platforms • Benchmarks • Performance measurements • Conclusions

  3. UPC Overview • UPC is an extension of C for partitioned shared memory parallel programming. • A special case of shared memory programming model. • Similar languages: Co-Array Fortran, Titanium. • UPC homepage: http://www.upc.gwu.edu • Platforms supported: • Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP UX, Linux clusters, IBM SP. • UPC compilers: • Open source: MuPC, Berkeley UPC, Intrepid UPC • Commercial: HP UPC, Cray UPC • Users: • LBNL, IDA, AHPCRC, …

  4. Related UPC Performance Studies • Performance benchmark suites • UPC_Bench (GWU) • Synthetic microbenchmark based on the STREAM benchmark. • Application benchmarks: Sobel edge detection, matrix multiplication, N-Queens problem • UPC NAS Parallel Benchmarks (GWU) • Performance monitoring • Performance analysis for HP UPC compiler (GWU) • Performance of Berkeley UPC on HP AlphaServer (Berkeley) • Performance of Intrepid UPC on SGI Origin (GWU)

  5. Benchmarking UPC Systems • Extended shared memory bandwidth microbenchmarks to cover various reference patterns: • Scalar references: 11 access patterns • Block memory operations: 9 access patterns • Benchmarked six combinations of available UPC compilers and platforms using both the UPC STREAM (MTU code) and the UPC NAS Parallel Benchmarks (GWU code). • Compilers: MuPC, HP UPC, Berkeley UPC and Intrepid UPC • Platforms: Myrinet Linux cluster, HP AlphaServer SC, and T3E • The first comparison of performance for currently available UPC implementations. • The first report on MuPC performance.

  6. Benchmarks • Synthetic benchmarks: • The STREAM microbenchmark was rewritten using UPC with more diversities of shared memory access patterns: • Local shared read / write • Unit stride shared read / write / copy • Random shared read / write / copy • Stride-n shared read / write / copy • Block transfers with variations of source and sink affinities. • NAS Parallel Benchmark Suite v2.4 • The UPC version was developed at GWU. • Five cores: CG, EP, FT, IS and MG. • Two variations: Naïve version and Hand-tuned version. • Input size: Class A workload.

  7. Local Shared References • Intrepid UPC: performance is poor on local shared accesses. • HP UPC: cache state has significant effects on local shared accesses.

  8. Remote Shared References • HP UPC and MuPC: caches help unit stride remote shared accesses. • Intrepid UPC does the best for remote shared accesses.

  9. Block Memory Operations • HP UPC: performance is poor on certain string functions. • Intrepid UPC: low performance on all categories.

  10. NPB – CG • The only case that scales well: Berkeley UPC + optimized code.

  11. NPB – EP

  12. NPB – FT • HP, Berkeley and MuPC: performance is comparable.

  13. NPB – IS • HP, Berkeley and MuPC: performance is comparable.

  14. NPB – MG • MG performance is very inconsistent.

  15. Conclusions • STREAM benchmarking: • UPC language overhead reduces performance of local shared references. • Remote reference caching helps stride-1 accesses. • Copying between two locations with the same affinity to a remote thread needs optimization. • NPB benchmarking: • Some implementation failed for some benchmarks. More stable and reliable implementations are needed. • Hand-tuning techniques (e.g. prefetching) are critical in performance. • Berkeley UPC is the best at handling unstructured, fine-grained references. • MuPC experience shows that it will be more rewarding to optimize remote shared references than to improve network interconnects.

  16. Thank you! For more information: http://www.upc.mtu.edu

More Related