Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing

Performance Portability and Programmability for Heterogeneous Many-coreArchitectures (PEPPHER) Siegfried Benkner (on behalf ofthe PEPPHER Consortium) Research Group Scientific Computing Faculty of Computer Science University of Vienna Austria

EU Project PEPPHER • Performance Portability & Programmability for • Heterogeneous Manycore Architectures • ICT FP7, Computing Systems; 3 years; finished Feb. 2013 • 9 Partners, Coordinated by University of Vienna • http://www.peppher.eu • Goal: Enable portable, productive and efficient programming of single-node heterogeneous many-core systems. • Holistic Approach • Component-Based High-Level Program Development • Auto-tuned Algorithms & Data Structures • Compilation Strategies • Runtime Systems • Hardware Mechanisms

Performance.Portability.Programmability • Methodology & framework for development of performance portable code. • Execute same application efficiently on different heterogeneous architectures. • Support multiple parallel APIs: OpenMP, OpenCL, CUDA, TBB, ... Application (C/C++) PEPPHER Framework Many-core CPU CPU+GPU PEPPHERSim PePU (Movidius) Intel Xeon Phi Focus: Single-node/chipheterogeneousarchitectures • Approach • Multi-architectural, performance-aware components multiple implementation variants of functions; each with a performance model • Task-based execution model &intelligent runtime system runtime selection of best task implementation variant for given platform

Platform Descriptors (PDL) C2 C1 PEPPHER Approach Expert Programmer (Compiler/Autotuner) Target Platforms C1 C1 C1 C2 C1 Component impl. variants for different platforms, algorithms, inputs ... Feed-back of measured performance Mainstream Programmer ::: • Programmer • Annotate calls to performance-critical functions (=components) • Provide implementation variantsfor different platforms (PDL) • Provide component meta-data C1 C1 C2 ... C1 C1 ::: C2 C2 C2 Component-based application with annotations

C2 C1 PEPPHER Approach Expert Programmer (Autotuner) Target Platforms Transformation/Composition Runtime System C1 C1 C1 C2 C1 Heterogenous Task Scheduler Component impl. variants for different platforms, algorithms, inputs ... Intermediate task-based representation Dynamic selection of ”best” implementation variant Feed-back of measured performance Mainstream Programmer ::: • Programmer • Annotate calls to performance-critical functions (=components) • Provide implementation variantsfor • different platforms (PDL) • Provide component meta-data • PEPPHER framework • Management of components and implementation variants • Transformation / Composition • Implementation variant selection • Dynamic, performance-aware task scheduling (StarPU runtime) C1 C2 C1 ... C1 C1 ::: C2 C2 C2 Component-based application with annotations

PEPPHER Framework Applications Embedded General Purpose HPC C/C++ source code with annotated component calls Asnynchronouscalls, Data distribution Patterns, SkePU Skeletons High-Level Coordination/Patterns/Skeletons Components C/C++, OpenMP, CUDA, OpenCL, TBB, Offload Component implementationvariants for different core architectures... algorithms, ... Autotuned Algorithms Data Structures Autotuned Data Structures & Algorithms Componentgluecode Static variant selection (ifany) Transformation Tool Composition Tool Componenttaskgraph with explicit datadependecies PEPPHER Taskgraph Performance Models Performance-aware, data-aware dynamicschedulingof „best“ componentvariantsontofreeexecutionunits PEPPHER Run-time (StarPU) Scheduling Strategy Scheduling Strategy Drivers (CUDA, OpenCL, OpenMP) Single-nodeheterogeneousmanycore SIM = PEPPHER simulator PePU = Peppherproc. unit (Movidius) CPU GPU Xeon Phi SIM PePU

PEPPHER Components «interface» C • Component Interface • Specification of functionality • Used by mainstream programmers • Implementation Variants • Different architectures/platforms • Different algorithms/data structures • Different input characteristics • Different performance goals • Writtenby expert programmers • (orgenerated, e.g. auto-tuning • cf. EU Autotune Project) f(param-list) Interface meta-data Variant meta-data Variant meta-data «variant» C1 «variant» Cn … f(param-list){…} f(param-list){…} Component Implementation Variants • Features • Different programminglanguages(C/C++, OpenCL, Cuda, OpenMP) • Task & Data parallelism • Constraints • Noside-effects; Non-preemptive • Stateless; Compositionon CPU only

Platform Description Language (PDL) • Goal: Make platform specific information explicitfor tools and users. • Processing Units (PUs) • Master (initiates program execution) • Worker (executes delegated tasks) • Hybrid (master & worker) • Memory Regions • Express key characteristics of memory hierarchy • Can be defined for all processing units • Interconnects • describe communication facilities between PUs • Hardware and Software Properties • e.g., core-count, memory sizes, available libraries Data movement

PEPPHER Coordination Language • Component calls • asynchronous & synchronous • Patterns • e.g. pipeline pattern #pragma pphcall //read A, write B -> meta data cf1(A, N, B, M); #pragma pph call cf2(B, M); #pragma pph pipeline while(inputstream >> file) { readImage(file,image); #pragma pph stage replicate(N) { resizeAndColorConvert(image); detectFace(image,outImage); ... } #pragma pph call sync cf(A, N); • Other Features: • Specification of optimization goals (time vs. power) and execution targets • Data partitioning; array access patterns; parameter assertions; • Memory consistency control

Transformation System • Source-to-Source Transformation • based on ROSE • generates C++ with calls to coordination layer and StarPU runtime • Coordination Layer • Support for parallel patterns(pipelining) • Submission of tasks to StarPU • Heterogeneous Runtime System • Based on INRIA’s StarPU • Selection of implementation variants based on available hardware resources • Data-aware & performance-aware task scheduling onto heterogeneous PUs Application with Annotations Platform Descriptor PDL PEPPHER Component Repository Transformation Tool Coordination Layer PEPPHER Component Framework Task-based Heterogeneous Runtime Hybrid Hardware SMP GPU MIC

Performance Results • OpenCV Face Detection • 3425 images • Image resolution: 640x480 (VGA) • Different implementationvariantsformiddlestages (CPU vs. GPU) • ComparisontoplainOpenCVversionand hand-coded Intel TBB(pipeline) version • Architecture: 2 Xeon X5550 (4 core), 2 NVIDIA C2050, 1 NVIDIA C1060

Major Results of PEPPHER • Component Framework • Multi-architectural, resource- & performance-aware components; • PDL adopted by Open Community Runtime (OCR) – US XStack program • Transformation, Composition, Compilation • Transformation Tool (U. Vienna) • Composition Tool & SkePU(U. Linköping) • Offload C++ compiler used by game industry (Codeplay) • Runtime System (U. Bordeaux) • StarPUpart of Linux (Debian) distribution and MAGMA library • Superior parallel algorithms and data structures (KIT, Chalmers) • PePU Experimental Hardware Platform & Simulator (Movidius) • PeppherSIM used in industry

Backup Slides

Example Cholesky factorization FOR k = 0..TILES-1 POTRF(A[k][k]) FOR m = k+1..TILES-1 TRSM(A[k][k], A[m][k]) FOR n = k+1..TILES-1 SYRK(A[n][k], A[n][n]) FOR m = n+1..TILES-1 GEMM(A[m][k], A[n][k], A[m][n]) Utilize expert written components: BLAS kernels from MAGMA and PLASMA • Implementation variants: • multi-core CPU (PLASMA) • GPU(MAGMA) • PEPPHER component: • Interface, implementation variants + meta-data

PEPPHER Approach • Transformation/Composition • Processing of user annotations/meta data • Generation of task-based representation (DAG) • Static pre-selection of variants POTFR ... ... TRSM TRSM ... ... • Task-based Execution Model • Runtime task variant selection & scheduling • Data/topology-aware: minimize data transfer • Performance-aware: minimize make-span, or other objective (power, …) SYRK SYRK ... ... GEMM CPU-GEMM GPU-GEMM • Multi-level parallelism • Coarse-grained inter-component parallelism • Fine(r) grained intra-component parallelism • Exploit ALL execution units

Component Meta-Data • Interface Meta-Data (XML) • Parameter intent (read/write) • Supported performance apsects • (execution-time, power) • Implementation Variant Meta-Data (XML) • Supported target platforms (PDL) • Performance Model • Input data constraints (if any) • Tunable parameters (if any) • Required components (if any) • Key issues • Make platform specific optimizations/dependencies explicit. • Make components performance and resource aware. • Support runtime variant selection. • Support code transformation and auto-tuning. XML Schema for Interface Meta-Data XML Schema for Variant Meta-Data

Performance-Aware Components • Each component is associated with an abstractperformance model. • Invocation Context: captures performance-relevant information of inputdata • (problem size, data layout, etc.) • Resource Context: specifies main HW/SW characteristics (cores, memory, …) • Performance Descriptor: usually includes (relative) runtime, power estimates • Generic performance prediction function: Invocation Context Descriptors Component Performance Model Performance Descriptor Resource Context Descr. (PDL) PerfDscgetPrediction(InvocationContextDscicd, ResourceContextDscrcd)

Basic Coordination Language • Memory Consistency • flush; for ensuring consistency btw. host and workers • Component calls • implicit memory consistency across workers #pragma pph call cf1 (A, N); ... #pragma pphflush(A) // block until A has become available intfirst = A[0]; // explicit flush req. since A is accessed #pragma pph call cf1 (A, N); // A: read / write ... // implicit memory consistency on workers only ... // no explicit flush is needed here provided A ... // is not accessed within the master process #pragma pph call cf2(A, N); // A:read; actual values of A produced by cf1()

Basic Coordination Language • Parameter Assertions • influence component variant selection • Optimization Goals • specify optimization goals to be taken into account by runtime scheduler • Execution Target • specify pre-defined target library (e.g., OPENCL) or processing unit group from PDL platform descriptor #pragma pph call parameter(size < 1000) cf1(A, size); #pragma pph call optimize(TIME) cf1(A, size); ... #pragma pph call optimize(POWER < 100 && TIME < 10) cf2(A, size); #pragma pph call target(OPENCL) cf(A, size);

Basic Coordination Language • Data Partitioning • generate multiple component calls, one for each partition (cf. HPF) • Access to Array Sections • specify which array section is accessed in component call (cf. Fortran array sections) #pragma pph call partition(A(size:BLOCK(size/2))) cf1(A, size); #pragma pph call access(A(size:50:size-1)) cf(A+50, size-50);

Performance Results • Leukocyte Tracking • AdaptedfromRodiniabenchmarksuite • Different implementationvariantsforMotion Gradient Vector Flow (CPU vs. GPU) • ComparisontoOpenMPversion • Architecture: • 2 Xeon X5550 (4 core) • 2 NVIDIA C2050 • 1 NVIDIA C1060 • 4 different configurations • -> PDL descriptors SEQ OMP 8CPU cores 7CPU 1GPU 6CPU 2GPU 5CPU 3GPU Speedup PEPPHER Orig. (Rodinia)

Future Work • EU AutoTune Project • Autotuning of high-level patterns (pipeline replication factor, …) • Tunable parameters specified in component descriptors • Work on Energy Efficiency • Energy-aware components;Runtime scheduling for energy efficiency • User-specified optimization goals • Tradeoff execution time vs. energy consumption; QoS support • Extension towards Clusters • Combine with global MPI layer across nodes

Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing

Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing

Presentation Transcript

In-situ Science on the surfaces of Ganymede and Europa with Penetrators

Central California Community Colleges Committed To Change Consortium

Matteo Porro on behalf of the DSSC Consortium

DSSC - an X-ray Imager with Mega-Frame Readout Capability for the European XFEL

JERICO A

Potential Applications of Micro-Penetrators within the Solar System

Photonics-oriented data transmission network for the KM3NeT prototype detector

Mechanical design of the Pre-Production Detector Unit Model of KM3NeT

Ignacio Blanquer (Valencia University of Technology) on Behalf of the SHARE Consortium

E xtreme U niverse S pace O bservatory

Jacob Canick, PhD on behalf of the FASTER Consortium 12 th International Conference

Yves Desaubies, On behalf of the Mersea Consortium

The KM3Net Consortium

A.Cosquer P.Keller on behalf of the KM3NeT consortium

A new arrival at the VLT: the commissioning of the X-shooter spectrograph

MammoGrid

Catherine Chronaki On behalf of SRDC and the iCardea Consortium

SHARE: S tructuring and supporting H ealthgrids A ctivities and R esearch in E urope

LOFAR-UK A proposal to STFC PPRP PI Professor Rob Fender (Southampton)

Astrobiological Signatures with Penetrators on Europa

Yves Desaubies, On behalf of the Mersea Consortium

Consortium