Make hpc easy with domain specific languages and high level frameworks
Download
1 / 41

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Make HPC Easy with Domain-Specific Languages and High-Level Frameworks' - sani


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Make hpc easy with domain specific languages and high level frameworks

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Biagio Cosenza, Ph.D.

DPS Group, InstitutfürInformatik

Universität Innsbruck, Austria


Outline
Outline Frameworks

  • Complexity in HPC

    • Parallel hardware

    • Optimizations

    • Programming models

  • Harnessing compexity

    • Automatic tuning

    • Automatic parallelization

    • DSLs

    • Abstractions for HPC

  • Related work in Insieme


Complexity in hpc
Complexity in HPC Frameworks


Complexity in hardware
Complexity in FrameworksHardware

  • The need of parallel computing

  • Parallelism in hardware

  • Three walls

    • Power wall

    • Memory wall

    • Instruction Level Parallelism


The power wall
The Power Wall Frameworks

Power is expensive, but transistors are free

  • We can put more transistors on a chip than we have the power to turn on

  • Power efficiency challenge

    • Performance per watt is the new metric – systems are often constrained by power & cooling

  • This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance

  • Example

    • Intel Pentium 4 HT 670 (released on May 2005)

      • Clock rate 3.8 GHz

    • Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)

      • Clock rate 3.2 GHz


The memory wall
The Memory Wall Frameworks

The growing disparity of speed between CPU and memory outside the CPU chip, would become an overwhelming bottleneck

  • It change the way we optimize programs

    • Optimize for memory vs optimize computation

  • E.g. multiply is no longer considered a harming slow operation, if compared to load and store


The ilp wall
The ILP Wall Frameworks

There are diminishing returns on finding more ILP

  • Instruction Level Parallelism

    • The potential overlap among instructions

    • Many ILP techniques

      • Instruction pipelining

      • Superscalar execution

      • Out-of-order execution

      • Register renaming

      • Branch prediction

  • The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible

  • There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy



The many core challenges
The “Many-core” challenges Frameworks

Tilera TILE-Gx807

  • Many-core vs multi-core

    • Multi-core architectures and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors

    • Many-core is the future


What does it mean
What does it mean? Frameworks

  • Hardware is evolving

    • The number of cores is the new Megahertz

  • We need

    • New programming model

    • New system software

    • New supporting architecture that are naturally parallel


New challenges
New Challenges Frameworks

  • Make easy to write programs that execute efficiently on highly parallel computing systems

    • The target should be 1000s of cores per chip

    • Maximize productivity

  • Programming models should

    • be independent of the number of processors

    • support successful models of parallelism, such as task-level parallelism, word-level parallelism, and bit-level parallelism

  • “Autotuners” should play a larger role than conventional compilers in translating parallel programs


Parallel programming models
Parallel Programming Models Frameworks

Real-Time Worksop(MathWorks)

Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

Erlang

Charm(Illinois)

MPI

Cilk(MIT)

HMPP

OpenMP

OpenAcc

MapReduce(Google)

OpenCL(Khronos Group)

Brook(Stanford)

DataCutter(Maryland)

CUDA(NVidia)

NESL(CMU)

StreamIt(MIT&Microsoft)

Borealis(Brown)

HPCS Chapel(Cray)

HPCS Fortress(Sun)

Thread Building Blocks(Intel)

HPCS X10(IBM)

Sequoia(Stanford)


Parallel programming models1
Parallel Programming Models Frameworks

Real-Time Worksop(MathWorks)

Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

Erlang

Charm(Illinois)

MPI

Cilk(MIT)

HMPP

OpenMP

OpenAcc

MapReduce(Google)

OpenCL(Khronos Group)

Brook(Stanford)

DataCutter(Maryland)

CUDA(NVidia)

NESL(CMU)

StreamIt(MIT&Microsoft)

Borealis(Brown)

HPCS Chapel(Cray)

HPCS Fortress(Sun)

Thread Building Blocks(Intel)

HPCS X10(IBM)

Sequoia(Stanford)


Reconsidering
Reconsidering… Frameworks

  • Applications

    • What are common parallel kernel applications?

    • Parallel patterns

      • Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns

      • A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication

      • E.g. dense linear algebra, sparse algebra, spectral methods, …

  • Metrics

    • Scalability

      • An old belief was that less than linear scaling for a multi-processor application is failure

      • With new hardware trend, this is no longer true

        • Any speedup is OK!



Harnessing complexity1
Harnessing Complexity Frameworks

  • Compiler approaches

    • DSL, automatic parallelization, …

  • Library-based approaches


What a compiler can do for us
What a compiler can do for us? Frameworks

  • Optimize code

  • Automatic tuning

  • Automatic code generation

    • e.g. in order to support different hardware

  • Automatically parallelize code


Automatic parallelization
Automatic Parallelization Frameworks

Critical opinions on parallel programming model:

The other way:

  • Auto-parallelizing compilers

    • Sequential code => parallel code

Wen-meiHwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core systems

http://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf


Automatic parallelization1
Automatic Parallelization Frameworks

for(inti=0;i<100;i++) {

A[i] = A[i+1];

}

  • Nowadays compilers have new “tools” for analysis

    • Polyhedral model

  • …but performance

    • are still far from a manual parallelization approach

IR

  • Polyhedral extraction:

  • SCoP detection

  • Translation to polyhedral

Polyhedral Model

D: { i in N: 0 <= i < 100 }

R: A[ i] for each i in D

W: A[i+1] for each i in D

  • Code generation:

  • Generate IR code from model

Analyses & Transformations


Autotuners vs traditional compilers
Autotuners Frameworksvs Traditional Compilers

  • Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler

  • The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel

  • The resulting space of optimization is large

  • Programming model may simplify the problem

    • but not to solve it


Optimizations complexity an example
Optimizations’ Complexity FrameworksAn example

Input

  • Openmp code

  • Simple parallel codes

    • matrix multiplication, jacobi, stencil3d,…

  • Few optimizations and tuning parameters

    • Tiling 2d/3d

    • # of threads

      Goal: Optimize for performance and efficiency


Optimizations complexity an example1
Optimizations’ FrameworksComplexityAn example

  • Problem

    • Big search space

      • brute force takes year of computation

    • Analytical model fails to find the best configuration

  • Solution

    • Multi-objective search

      • Offline search of Pareto front solutions

      • Runtime selection according to the objective

    • Multi versioning

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Optimizations complexity
Optimizations’ Complexity Frameworks

Input Code

compile time

runtime

5

1

Analyzer

Backend

Multi-Versioned Code

4

2

CodeRegions

DynamicSelection

BestSolutions

6

Optimizer

Runtime System

Measure-

ments

Parallel Target Platform

Configurations

3

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Domain specific languages
Domain Specific Languages Frameworks

  • Easy of programming

    • Use of domain specific concepts

      • E.g. “color”, “pixel”, “particle”, “atom”

    • Simple interface

  • Hide complexity

    • Data structures

    • Parallelization issues

    • Optimizations’ tuning

    • Address specific parallelization pattern


Domain specific languages1
Domain Specific Languages Frameworks

  • DSL may help parallelization

    • Focus on domain concepts and abstractions

    • Language constraints may help automatic parallelization by compilers

  • 3 major benefits

    • Productivity

    • Performance

    • Portability and forward scalability


Domain specific languages glsl shader opengl
Domain Specific Languages FrameworksGLSL Shader (OpenGL)

OpenGL 4.3 Pipeline

VertexData

VertexShader

Primitive

Setup and Rasterization

FragmentShader

Blending

TessellationEvaluationShader

TessellationControlShader

GeometryShader

TextureStore

PixelData


attribute vec3 vertex; Frameworks

attribute vec3 normal;

attribute vec2 uv1;

uniform mat4 _mvProj;

uniform mat3 _norm;

varying vec2 vUv;

varying vec3 vNormal;

void main(void) {

// compute position

gl_Position = _mvProj * vec4(vertex, 1.0);

vUv = uv1;

// compute light info

vNormal= _norm * normal;

}

varying vec2 vUv;

varying vec3 vNormal;

uniform vec3 mainColor;

uniform float specularExp;

uniform vec3 specularColor;

uniform sampler2D mainTexture;

uniform mat3 _dLight;

uniform vec3 _ambient;

void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){

vec3 ecLightDir = dLight[0]; // light direction in eye coordinates

vec3 colorIntensity = dLight[1];

vec3 halfVector = dLight[2];

float diffuseContribution = max(dot(normal, ecLightDir), 0.0);

float specularContribution = max(dot(normal, halfVector), 0.0);

specular = pow(specularContribution, specularExponent);

diffuse = (colorIntensity * diffuseContribution);

}

void main(void) {

vec3 diffuse;

float spec;

getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec);

vec3 color = max(diffuse,_ambient.xyz)*mainColor;

gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);

}

vertex

fragment

VertexData

VertexShader

Primitive

Setup and Rasterization

FragmentShader

Blending

TessellationEvaluationShader

TessellationControlShader

GeometryShader

TextureStore

PixelData


Dsl examples
DSL Examples Frameworks

Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …

  • Interesting recent research work

Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, RastislavBodıkSuperconductor: A Language for Big Data Visualization LASH-C 2013

ChariseeChiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick Seltzer

Diderot: A Parallel DSL for Image Analysis and Visualization

ACM PLDI 2012

A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron

Quipper: A Scalable Quantum Programming Language

ACM PLDI 2013


Harnessing complexity2
Harnessing Complexity Frameworks

  • Compilers can do

    • Automatic parallelization

    • Optimization of (parallel) code

    • DSL and code generation

  • But well written and optimized parallel code still outperforms a compiler based approach


Harnessing complexity3
Harnessing Complexity Frameworks

  • Compiler approaches

    • DSL, automatic parallelization, …

  • Library-based approaches


Some examples
Some Examples Frameworks

  • Pattern oriented

    • MapReduce (Google)

  • Problem specific

    • FLASH, adaptive-mesh refinement (AMR) code

    • GROMACS, molecular dynamics

  • Hardware/programming model specific

    • Cactus

    • libWater*

best

performance


Insieme compiler and research
Insieme Compiler and Research Frameworks

  • Compiler infrastructure

  • Runtime support


Insieme research automatic task partitioning for heterogeneous hw
Insieme Research: Automatic Task Partitioning for Heterogeneous HW

  • Heterogeneous platforms

    • E.g. CPU + 2 GPUs

  • Input: OpenCL for single device

  • Output: OpenCL code for multiple devices

  • Automatic partitioning of work-items between multiple devices

    • Based on hw, program and input size

  • Machine-learning approach

K. Kofler, I. Grasso, B. Cosenza, T. FahringerAn Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013


Results architecture 1
Results – Architecture 1 Heterogeneous HW


Results architecture 2
Results – Architecture 2 Heterogeneous HW


Insieme research opencl on cluster of heterogeneous nodes
Insieme Research: OpenCL on Cluster of Heterogeneous Nodes Heterogeneous HW

  • libWater

  • OpenCL extensions for clusters

    • Event based, extension on OpenCL event

    • Supporting intra-deice synchronization

  • DQL

    • A DSL language for device query, management and discovery

I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer

libWater: Heterogeneous Distributed Copmuting Made EasyACM International Conference on Supercomputing, 2013


Libwater
libWater Heterogeneous HW

  • Runtime

    • OpenCL

    • pthread, opemp

    • MPI

  • DAG command event representation


Libwater dag optimizations
libWater Heterogeneous HW: DAG Optimizations

  • Dynamic Collective communication pattern Replacement (DCR)

  • Latency hiding

  • Intra-node copy optimizations


Insieme ongoing research support for dsls
Insieme Heterogeneous HW(Ongoing) Research:Support for DSLs

Library SupportRendering algorithm implementations, geometry loader, …

RuntimeSystem

Frontend

Backend

InputCodes

InputCodes

IntermediateRepresentation

DSL

OutputCodes

pthreads

OpenCLMPI

Target hardware:GPU, CPU, heterogeneous platform, compute cluster

Transformation Framework

Polyhedral model

Parallel optimizations

Stencil computation

Automatic tuning support


About insieme
About Insieme Heterogeneous HW

  • Insieme compiler

    • Research framework

    • OpenMP, Cilk, MPI, OpenCL

    • Run time, IR

    • Support for polyhedral model

    • Multi-objective optimization

    • Machine learning

    • Extensible

  • Insieme (GPL) and libWater (LGPL) soon available on GitHub


ad