1 / 23

Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing

Performance Portability and Programmability for Heterogeneous Many- core Architectures (PEPPHER). Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing Faculty of Computer Science University of Vienna Austria. EU Project PEPPHER.

jamese
Download Presentation

Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Portability and Programmability for Heterogeneous Many-coreArchitectures (PEPPHER) Siegfried Benkner (on behalf ofthe PEPPHER Consortium) Research Group Scientific Computing Faculty of Computer Science University of Vienna Austria

  2. EU Project PEPPHER • Performance Portability & Programmability for • Heterogeneous Manycore Architectures • ICT FP7, Computing Systems; 3 years; finished Feb. 2013 • 9 Partners, Coordinated by University of Vienna • http://www.peppher.eu • Goal: Enable portable, productive and efficient programming of single-node heterogeneous many-core systems. • Holistic Approach • Component-Based High-Level Program Development • Auto-tuned Algorithms & Data Structures • Compilation Strategies • Runtime Systems • Hardware Mechanisms

  3. Performance.Portability.Programmability • Methodology & framework for development of performance portable code. • Execute same application efficiently on different heterogeneous architectures. • Support multiple parallel APIs: OpenMP, OpenCL, CUDA, TBB, ... Application (C/C++) PEPPHER Framework Many-core CPU CPU+GPU PEPPHERSim PePU (Movidius) Intel Xeon Phi Focus: Single-node/chipheterogeneousarchitectures • Approach • Multi-architectural, performance-aware components multiple implementation variants of functions; each with a performance model • Task-based execution model &intelligent runtime system runtime selection of best task implementation variant for given platform

  4. Platform Descriptors (PDL) C2 C1 PEPPHER Approach Expert Programmer (Compiler/Autotuner) Target Platforms C1 C1 C1 C2 C1 Component impl. variants for different platforms, algorithms, inputs ... Feed-back of measured performance Mainstream Programmer ::: • Programmer • Annotate calls to performance-critical functions (=components) • Provide implementation variantsfor different platforms (PDL) • Provide component meta-data C1 C1 C2 ... C1 C1 ::: C2 C2 C2 Component-based application with annotations

  5. C2 C1 PEPPHER Approach Expert Programmer (Autotuner) Target Platforms Transformation/Composition Runtime System C1 C1 C1 C2 C1 Heterogenous Task Scheduler Component impl. variants for different platforms, algorithms, inputs ... Intermediate task-based representation Dynamic selection of ”best” implementation variant Feed-back of measured performance Mainstream Programmer ::: • Programmer • Annotate calls to performance-critical functions (=components) • Provide implementation variantsfor • different platforms (PDL) • Provide component meta-data • PEPPHER framework • Management of components and implementation variants • Transformation / Composition • Implementation variant selection • Dynamic, performance-aware task scheduling (StarPU runtime) C1 C2 C1 ... C1 C1 ::: C2 C2 C2 Component-based application with annotations

  6. PEPPHER Framework Applications Embedded General Purpose HPC C/C++ source code with annotated component calls Asnynchronouscalls, Data distribution Patterns, SkePU Skeletons High-Level Coordination/Patterns/Skeletons Components C/C++, OpenMP, CUDA, OpenCL, TBB, Offload Component implementationvariants for different core architectures... algorithms, ... Autotuned Algorithms Data Structures Autotuned Data Structures & Algorithms Componentgluecode Static variant selection (ifany) Transformation Tool Composition Tool Componenttaskgraph with explicit datadependecies PEPPHER Taskgraph Performance Models Performance-aware, data-aware dynamicschedulingof „best“ componentvariantsontofreeexecutionunits PEPPHER Run-time (StarPU) Scheduling Strategy Scheduling Strategy Drivers (CUDA, OpenCL, OpenMP) Single-nodeheterogeneousmanycore SIM = PEPPHER simulator PePU = Peppherproc. unit (Movidius) CPU GPU Xeon Phi SIM PePU

  7. PEPPHER Components «interface» C • Component Interface • Specification of functionality • Used by mainstream programmers • Implementation Variants • Different architectures/platforms • Different algorithms/data structures • Different input characteristics • Different performance goals • Writtenby expert programmers • (orgenerated, e.g. auto-tuning • cf. EU Autotune Project) f(param-list) Interface meta-data Variant meta-data Variant meta-data «variant» C1 «variant» Cn … f(param-list){…} f(param-list){…} Component Implementation Variants • Features • Different programminglanguages(C/C++, OpenCL, Cuda, OpenMP) • Task & Data parallelism • Constraints • Noside-effects; Non-preemptive • Stateless; Compositionon CPU only

  8. Platform Description Language (PDL) • Goal: Make platform specific information explicitfor tools and users. • Processing Units (PUs) • Master (initiates program execution) • Worker (executes delegated tasks) • Hybrid (master & worker) • Memory Regions • Express key characteristics of memory hierarchy • Can be defined for all processing units • Interconnects • describe communication facilities between PUs • Hardware and Software Properties • e.g., core-count, memory sizes, available libraries Data movement

  9. PEPPHER Coordination Language • Component calls • asynchronous & synchronous • Patterns • e.g. pipeline pattern #pragma pphcall //read A, write B -> meta data cf1(A, N, B, M); #pragma pph call cf2(B, M); #pragma pph pipeline while(inputstream >> file) { readImage(file,image); #pragma pph stage replicate(N) { resizeAndColorConvert(image); detectFace(image,outImage); ... } #pragma pph call sync cf(A, N); • Other Features: • Specification of optimization goals (time vs. power) and execution targets • Data partitioning; array access patterns; parameter assertions; • Memory consistency control

  10. Transformation System • Source-to-Source Transformation • based on ROSE • generates C++ with calls to coordination layer and StarPU runtime • Coordination Layer • Support for parallel patterns(pipelining) • Submission of tasks to StarPU • Heterogeneous Runtime System • Based on INRIA’s StarPU • Selection of implementation variants based on available hardware resources • Data-aware & performance-aware task scheduling onto heterogeneous PUs Application with Annotations Platform Descriptor PDL PEPPHER Component Repository Transformation Tool Coordination Layer PEPPHER Component Framework Task-based Heterogeneous Runtime Hybrid Hardware SMP GPU MIC

  11. Performance Results • OpenCV Face Detection • 3425 images • Image resolution: 640x480 (VGA) • Different implementationvariantsformiddlestages (CPU vs. GPU) • ComparisontoplainOpenCVversionand hand-coded Intel TBB(pipeline) version • Architecture: 2 Xeon X5550 (4 core), 2 NVIDIA C2050, 1 NVIDIA C1060

  12. Major Results of PEPPHER • Component Framework • Multi-architectural, resource- & performance-aware components; • PDL adopted by Open Community Runtime (OCR) – US XStack program • Transformation, Composition, Compilation • Transformation Tool (U. Vienna) • Composition Tool & SkePU(U. Linköping) • Offload C++ compiler used by game industry (Codeplay) • Runtime System (U. Bordeaux) • StarPUpart of Linux (Debian) distribution and MAGMA library • Superior parallel algorithms and data structures (KIT, Chalmers) • PePU Experimental Hardware Platform & Simulator (Movidius) • PeppherSIM used in industry

  13. Backup Slides

  14. Example Cholesky factorization FOR k = 0..TILES-1 POTRF(A[k][k]) FOR m = k+1..TILES-1 TRSM(A[k][k], A[m][k]) FOR n = k+1..TILES-1 SYRK(A[n][k], A[n][n]) FOR m = n+1..TILES-1 GEMM(A[m][k], A[n][k], A[m][n]) Utilize expert written components: BLAS kernels from MAGMA and PLASMA • Implementation variants: • multi-core CPU (PLASMA) • GPU(MAGMA) • PEPPHER component: • Interface, implementation variants + meta-data

  15. PEPPHER Approach • Transformation/Composition • Processing of user annotations/meta data • Generation of task-based representation (DAG) • Static pre-selection of variants POTFR ... ... TRSM TRSM ... ... • Task-based Execution Model • Runtime task variant selection & scheduling • Data/topology-aware: minimize data transfer • Performance-aware: minimize make-span, or other objective (power, …) SYRK SYRK ... ... GEMM CPU-GEMM GPU-GEMM • Multi-level parallelism • Coarse-grained inter-component parallelism • Fine(r) grained intra-component parallelism • Exploit ALL execution units

  16. Component Meta-Data • Interface Meta-Data (XML) • Parameter intent (read/write) • Supported performance apsects • (execution-time, power) • Implementation Variant Meta-Data (XML) • Supported target platforms (PDL) • Performance Model • Input data constraints (if any) • Tunable parameters (if any) • Required components (if any) • Key issues • Make platform specific optimizations/dependencies explicit. • Make components performance and resource aware. • Support runtime variant selection. • Support code transformation and auto-tuning. XML Schema for Interface Meta-Data XML Schema for Variant Meta-Data

  17. Performance-Aware Components • Each component is associated with an abstractperformance model. • Invocation Context: captures performance-relevant information of inputdata • (problem size, data layout, etc.) • Resource Context: specifies main HW/SW characteristics (cores, memory, …) • Performance Descriptor: usually includes (relative) runtime, power estimates • Generic performance prediction function: Invocation Context Descriptors Component Performance Model Performance Descriptor Resource Context Descr. (PDL) PerfDscgetPrediction(InvocationContextDscicd, ResourceContextDscrcd)

  18. Basic Coordination Language • Memory Consistency • flush; for ensuring consistency btw. host and workers • Component calls • implicit memory consistency across workers #pragma pph call cf1 (A, N); ... #pragma pphflush(A) // block until A has become available intfirst = A[0]; // explicit flush req. since A is accessed #pragma pph call cf1 (A, N); // A: read / write ... // implicit memory consistency on workers only ... // no explicit flush is needed here provided A ... // is not accessed within the master process #pragma pph call cf2(A, N); // A:read; actual values of A produced by cf1()

  19. Basic Coordination Language • Parameter Assertions • influence component variant selection • Optimization Goals • specify optimization goals to be taken into account by runtime scheduler • Execution Target • specify pre-defined target library (e.g., OPENCL) or processing unit group from PDL platform descriptor #pragma pph call parameter(size < 1000) cf1(A, size); #pragma pph call optimize(TIME) cf1(A, size); ... #pragma pph call optimize(POWER < 100 && TIME < 10) cf2(A, size); #pragma pph call target(OPENCL) cf(A, size);

  20. Basic Coordination Language • Data Partitioning • generate multiple component calls, one for each partition (cf. HPF) • Access to Array Sections • specify which array section is accessed in component call (cf. Fortran array sections) #pragma pph call partition(A(size:BLOCK(size/2))) cf1(A, size); #pragma pph call access(A(size:50:size-1)) cf(A+50, size-50);

  21. Performance Results • Leukocyte Tracking • AdaptedfromRodiniabenchmarksuite • Different implementationvariantsforMotion Gradient Vector Flow (CPU vs. GPU) • ComparisontoOpenMPversion • Architecture: • 2 Xeon X5550 (4 core) • 2 NVIDIA C2050 • 1 NVIDIA C1060 • 4 different configurations • -> PDL descriptors SEQ OMP 8CPU cores 7CPU 1GPU 6CPU 2GPU 5CPU 3GPU Speedup PEPPHER Orig. (Rodinia)

  22. Future Work • EU AutoTune Project • Autotuning of high-level patterns (pipeline replication factor, …) • Tunable parameters specified in component descriptors • Work on Energy Efficiency • Energy-aware components;Runtime scheduling for energy efficiency • User-specified optimization goals • Tradeoff execution time vs. energy consumption; QoS support • Extension towards Clusters • Combine with global MPI layer across nodes

More Related