slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Berkeley UPC Compiler: Implementation and Performance PowerPoint Presentation
Download Presentation
The Berkeley UPC Compiler: Implementation and Performance

Loading in 2 Seconds...

play fullscreen
1 / 31

The Berkeley UPC Compiler: Implementation and Performance - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

The Berkeley UPC Compiler: Implementation and Performance. Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Kathy Yelick the LBNL/Berkeley UPC Group http://upc.lbl.gov. Outline. An Overview of UPC Design and Implementation of the Berkeley UPC Compiler

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

The Berkeley UPC Compiler: Implementation and Performance


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

The Berkeley UPC Compiler: Implementation and Performance

Wei Chen, Dan Bonachea, Jason Duell,

Parry Husbands, Costin Iancu, Kathy Yelick

the LBNL/Berkeley UPC Group

http://upc.lbl.gov

outline
Outline
  • An Overview of UPC
  • Design and Implementation of the Berkeley UPC Compiler
  • Preliminary Performance Results
  • Communication Optimizations
unified parallel c upc
Unified Parallel C (UPC)
  • UPC is an explicitly parallel global address space language with SPMD parallelism
    • An extension of C
    • Shared memory is partitioned by threads
    • One-sided (bulk and fine-grained) communication through reads/writes of shared variables
  • Collective Efforts by industry, academia, and government
    • http://upc.gwu.edu

Shared

X[0]

X[1]

X[P]

Global addressspace

Private

upc programming model features
UPC Programming Model Features
  • Block cyclically distributed arrays
  • Shared and private pointers
  • Global synchronization -- barriers
  • Pair-wise synchronization – locks
  • Parallel loops
  • dynamic shared memory allocation
  • Bulk Shared Memory accesses
  • Strict vs. Relaxed memory consistency models
overview of the berkeley upc compiler
Overview of the Berkeley UPC Compiler

TwoGoals: Portability and High-Performance

Lower UPC code into ANSI-C code

UPC Code

Translator

Shared Memory Management and pointer operations

Platform-

independent

Translator Generated C Code

Network-

independent

Berkeley UPC Runtime System

Compiler-

independent

GASNet Communication System

Language-

independent

Network Hardware

Uniform get/put interface for underlying networks

a layered design
A Layered Design
  • Portable:
    • C is our intermediate language
    • Can run on top of MPI (with performance penalty)
    • GASNet has a layered design with a small core
  • High-Performance:
    • Native C compiler optimizes serial code
    • Translator can perform high-level communication optimizations
    • GASNet can access network hardware directly
implementing the upc to c translator
Implementing the UPC to C Translator

Preprocessed File

  • Based on the Open64 compiler
  • Source to source transformation
  • Convert shared memory
  • operations into runtime library
  • calls
  • Designed to incorporate existing
  • optimization framework in open64
  • Communicate with runtime via a
  • standard API

C front end

Whirl w/ shared types

Backend lowering

Whirl w/ runtime calls

Whirl2c

ANSI-compliant C Code

shared arrays and pointers in upc
Shared Arrays and Pointers in UPC
  • Cyclic shared int A[n];
  • Block Cyclic shared [2] int B[n];
  • Indefinite shared [0] int * C =

(shared [0] int *) upc_alloc(n);

  • Use global pointers (pointer-to-shared) to access shared (possibly remote) data
    • Block size part of the pointer type
    • A generic pointer-to-shared contains:
      • Address, Thread id, Phase

A[0] A[2] A[4] …

B[0] B[1] B[4] B[5]…

C[0] C[1] C[2] …

A[1] A[3] A[5] …

B[2] B[3] B[6] B[7]…

T1

T0

accessing shared memory in upc

Address Thread Phase

Accessing Shared Memory in UPC

start of array object

start of block

Shared

Memory

block size

Phase

Thread 0

Thread 1

Thread N -1

0

addr

2

phaseless pointer to shared
Phaseless Pointer-to-Shared
  • A pointer needs a “phase” to keep track of where it is in a block
    • Source of overhead for pointer arithmetic
  • Special case for “phaseless” pointers: Cyclic + Indefinite
    • Cyclic pointers always have phase 0
    • Indefinite pointers only have one block
    • Don’t need to keep phase in pointer operations for cyclic and indefinite
    • Don’t need to update thread id for indefinite pointer arithmetic
pointer to shared representation
Pointer-to-Shared Representation
  • Pointer Size
    • Want to allow pointers to reside in a register
    • But very large machines may require a longer representation
  • Datatype
    • Use of scalar types (long) rather than a struct may improve backend code quality
      • Faster pointer manipulation, e.g., ptr+int as well as dereferencing
  • Portability and performance balance in UPC compiler
    • 8-byte scalar vs. struct format (configuration time option)
    • The pointer representation is hidden in the runtime layer
    • Modular design means easy to add new representations
performance evaluation
Performance Evaluation
  • Testbed
    • HP AlphaServer (1GHz processor), with Quadrics interconnect
    • Compaq C compiler for the translated C code
    • Compare with HP UPC 2.1
  • Cost of Language Features
    • Shared pointer arithmetic, shared memory accesses, parallel loops, etc
  • Application Benchmarks
    • EP: no communication
    • IS: large bulk memory operations

- MG: bulk memget

    • CG: fine-grained vs. bulk memput
  • Potentials of Optimizations
    • Measure the effectiveness of various communication optimizations
performance of shared pointer arithmetic
Performance of Shared Pointer Arithmetic

1 cycle =

1ns

Struct:

16 bytes

  • Phaseless pointer an important optimization
  • Packed representation also helps.
cost of shared memory access
Cost of Shared Memory Access
  • Local accesses somewhat slower than private accesses
    • Layered design does not add additional overhead
  • Remote accesses a few magnitude worse than local
parallel loops in upc
Parallel Loops in UPC
  • UPC has a “forall” construct for distributing computation

shared int v1[N], v2[N], v3[N];

upc_forall(i=0; i < N; i++; &v3[i])

v3[i] = v2[i] + v1[i];

  • Affinity tests performed on every iteration to decide if it should execute
  • Two kinds of affinity expressions:
    • Integer (compare with thread id)
    • Shared address (check the affinity of address)
nas benchmarks ep and is
NAS Benchmarks (EP and IS)
  • EP shows the backend C compiler can still successfully optimize translated C code
  • IS shows Berkeley UPC compiler is effective for communication operations
nas benchmarks cg and mg
NAS Benchmarks (CG and MG)
  • Berkeley UPC compiler scales well
performance of fine grained applications
Performance of Fine-grained Applications
  • Doesn’t scale well due to nature of the benchmark (lots of small reads)
  • HP UPC’s software caching helps performance
observations on the results
Observations on the Results
  • Acceptable worst-case overhead for shared memory access latencies
    • < 10 cycle overhead for shared local accesses
    • ~ 1.5 usec overhead compared to end-to-end network latency
  • Optimizations on pointer-to-shared representation are effective
    • Both phaseless pointers and packed 8-byte format
  • Good performance compared to HP UPC 2.1
communication optimizations
Communication Optimizations
  • Hiding Communication Latencies
    • Use of non-blocking operations
    • Possible placement analysis, by separating get(), put() as far as possible from sync()
    • Message pipelining to overlap communication with more communication
  • Optimizing Shared Memory Accesses
    • Eliminating locality test for local shared pointers: flow- and context-sensitive analysis
    • Transforming forall loops into equivalent for loops
    • Eliminate redundant pointer arithmetic for pointers with same thread and phase
more optimizations
More Optimizations
  • Message Vectorization and Aggregation
    • Scatter/gather techniques
    • Packing generally pays off for small (< 500 byte) messages
  • Software Caching and Prefetching
    • A prototype implementation:
      • Local knowledge only – no coherence messages
      • Cache remote reads and buffer outgoing writes
      • Based on the weak ordering consistency model
experimental results
Experimental Results
  • Computation/communication overlap better than communication/communication overlap, for Quadrics
  • Results likely different for other networks
experimental results1
Experimental Results
  • Neither compiler performs well for naïve version
  • Culprit: pointer-to-shared operations + affinity tests
  • Privatizing local shared accesses improves performance by an order of magnitude
compiler status
Compiler Status
  • A fully UPC 1.1 compliant public release in April
  • Supported Platforms:
    • HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC
  • Supported Networks:
    • Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI
  • A release this summer will include:
    • Pthreads/System V shared memory support
    • GASNet Infiniband support
conclusion
Conclusion
  • The Berkeley UPC Compiler achieves both portability and good performance
    • Layered, modular design
    • Effective pointer-to-shared optimizations
    • Good performance compared to commercially available UPC compiler
    • Still lots of opportunities for communication optimizations
    • Available for download at:

http://upc.lbl.gov