Make hpc easy with domain specific languages and high level frameworks
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning

Download Presentation

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Make hpc easy with domain specific languages and high level frameworks

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Biagio Cosenza, Ph.D.

DPS Group, InstitutfürInformatik

Universität Innsbruck, Austria


Outline

Outline

  • Complexity in HPC

    • Parallel hardware

    • Optimizations

    • Programming models

  • Harnessing compexity

    • Automatic tuning

    • Automatic parallelization

    • DSLs

    • Abstractions for HPC

  • Related work in Insieme


Complexity in hpc

Complexity in HPC


Complexity in hardware

Complexity in Hardware

  • The need of parallel computing

  • Parallelism in hardware

  • Three walls

    • Power wall

    • Memory wall

    • Instruction Level Parallelism


The power wall

The Power Wall

Power is expensive, but transistors are free

  • We can put more transistors on a chip than we have the power to turn on

  • Power efficiency challenge

    • Performance per watt is the new metric – systems are often constrained by power & cooling

  • This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance

  • Example

    • Intel Pentium 4 HT 670 (released on May 2005)

      • Clock rate 3.8 GHz

    • Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)

      • Clock rate 3.2 GHz


The memory wall

The Memory Wall

The growing disparity of speed between CPU and memory outside the CPU chip, would become an overwhelming bottleneck

  • It change the way we optimize programs

    • Optimize for memory vs optimize computation

  • E.g. multiply is no longer considered a harming slow operation, if compared to load and store


The ilp wall

The ILP Wall

There are diminishing returns on finding more ILP

  • Instruction Level Parallelism

    • The potential overlap among instructions

    • Many ILP techniques

      • Instruction pipelining

      • Superscalar execution

      • Out-of-order execution

      • Register renaming

      • Branch prediction

  • The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible

  • There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy


Parallelism in hardware

Parallelism in Hardware


The many core challenges

The “Many-core” challenges

Tilera TILE-Gx807

  • Many-core vs multi-core

    • Multi-core architectures and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors

    • Many-core is the future


What does it mean

What does it mean?

  • Hardware is evolving

    • The number of cores is the new Megahertz

  • We need

    • New programming model

    • New system software

    • New supporting architecture that are naturally parallel


New challenges

New Challenges

  • Make easy to write programs that execute efficiently on highly parallel computing systems

    • The target should be 1000s of cores per chip

    • Maximize productivity

  • Programming models should

    • be independent of the number of processors

    • support successful models of parallelism, such as task-level parallelism, word-level parallelism, and bit-level parallelism

  • “Autotuners” should play a larger role than conventional compilers in translating parallel programs


Parallel programming models

Parallel Programming Models

Real-Time Worksop(MathWorks)

Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

Erlang

Charm(Illinois)

MPI

Cilk(MIT)

HMPP

OpenMP

OpenAcc

MapReduce(Google)

OpenCL(Khronos Group)

Brook(Stanford)

DataCutter(Maryland)

CUDA(NVidia)

NESL(CMU)

StreamIt(MIT&Microsoft)

Borealis(Brown)

HPCS Chapel(Cray)

HPCS Fortress(Sun)

Thread Building Blocks(Intel)

HPCS X10(IBM)

Sequoia(Stanford)


Parallel programming models1

Parallel Programming Models

Real-Time Worksop(MathWorks)

Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

Erlang

Charm(Illinois)

MPI

Cilk(MIT)

HMPP

OpenMP

OpenAcc

MapReduce(Google)

OpenCL(Khronos Group)

Brook(Stanford)

DataCutter(Maryland)

CUDA(NVidia)

NESL(CMU)

StreamIt(MIT&Microsoft)

Borealis(Brown)

HPCS Chapel(Cray)

HPCS Fortress(Sun)

Thread Building Blocks(Intel)

HPCS X10(IBM)

Sequoia(Stanford)


Reconsidering

Reconsidering…

  • Applications

    • What are common parallel kernel applications?

    • Parallel patterns

      • Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns

      • A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication

      • E.g. dense linear algebra, sparse algebra, spectral methods, …

  • Metrics

    • Scalability

      • An old belief was that less than linear scaling for a multi-processor application is failure

      • With new hardware trend, this is no longer true

        • Any speedup is OK!


Harnessing complexity

Harnessing Complexity


Harnessing complexity1

Harnessing Complexity

  • Compiler approaches

    • DSL, automatic parallelization, …

  • Library-based approaches


What a compiler can do for us

What a compiler can do for us?

  • Optimize code

  • Automatic tuning

  • Automatic code generation

    • e.g. in order to support different hardware

  • Automatically parallelize code


Automatic parallelization

Automatic Parallelization

Critical opinions on parallel programming model:

The other way:

  • Auto-parallelizing compilers

    • Sequential code => parallel code

Wen-meiHwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core systems

http://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf


Automatic parallelization1

Automatic Parallelization

for(inti=0;i<100;i++) {

A[i] = A[i+1];

}

  • Nowadays compilers have new “tools” for analysis

    • Polyhedral model

  • …but performance

    • are still far from a manual parallelization approach

IR

  • Polyhedral extraction:

  • SCoP detection

  • Translation to polyhedral

Polyhedral Model

D: { i in N: 0 <= i < 100 }

R: A[ i] for each i in D

W: A[i+1] for each i in D

  • Code generation:

  • Generate IR code from model

Analyses & Transformations


Autotuners vs traditional compilers

Autotunersvs Traditional Compilers

  • Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler

  • The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel

  • The resulting space of optimization is large

  • Programming model may simplify the problem

    • but not to solve it


Optimizations complexity an example

Optimizations’ ComplexityAn example

Input

  • Openmp code

  • Simple parallel codes

    • matrix multiplication, jacobi, stencil3d,…

  • Few optimizations and tuning parameters

    • Tiling 2d/3d

    • # of threads

      Goal: Optimize for performance and efficiency


Optimizations complexity an example1

Optimizations’ ComplexityAn example

  • Problem

    • Big search space

      • brute force takes year of computation

    • Analytical model fails to find the best configuration

  • Solution

    • Multi-objective search

      • Offline search of Pareto front solutions

      • Runtime selection according to the objective

    • Multi versioning

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Optimizations complexity

Optimizations’ Complexity

Input Code

compile time

runtime

5

1

Analyzer

Backend

Multi-Versioned Code

4

2

CodeRegions

DynamicSelection

BestSolutions

6

Optimizer

Runtime System

Measure-

ments

Parallel Target Platform

Configurations

3

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Domain specific languages

Domain Specific Languages

  • Easy of programming

    • Use of domain specific concepts

      • E.g. “color”, “pixel”, “particle”, “atom”

    • Simple interface

  • Hide complexity

    • Data structures

    • Parallelization issues

    • Optimizations’ tuning

    • Address specific parallelization pattern


Domain specific languages1

Domain Specific Languages

  • DSL may help parallelization

    • Focus on domain concepts and abstractions

    • Language constraints may help automatic parallelization by compilers

  • 3 major benefits

    • Productivity

    • Performance

    • Portability and forward scalability


Domain specific languages glsl shader opengl

Domain Specific LanguagesGLSL Shader (OpenGL)

OpenGL 4.3 Pipeline

VertexData

VertexShader

Primitive

Setup and Rasterization

FragmentShader

Blending

TessellationEvaluationShader

TessellationControlShader

GeometryShader

TextureStore

PixelData


Make hpc easy with domain specific languages and high level frameworks

attribute vec3 vertex;

attribute vec3 normal;

attribute vec2 uv1;

uniform mat4 _mvProj;

uniform mat3 _norm;

varying vec2 vUv;

varying vec3 vNormal;

void main(void) {

// compute position

gl_Position = _mvProj * vec4(vertex, 1.0);

vUv = uv1;

// compute light info

vNormal= _norm * normal;

}

varying vec2 vUv;

varying vec3 vNormal;

uniform vec3 mainColor;

uniform float specularExp;

uniform vec3 specularColor;

uniform sampler2D mainTexture;

uniform mat3 _dLight;

uniform vec3 _ambient;

void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){

vec3 ecLightDir = dLight[0]; // light direction in eye coordinates

vec3 colorIntensity = dLight[1];

vec3 halfVector = dLight[2];

float diffuseContribution = max(dot(normal, ecLightDir), 0.0);

float specularContribution = max(dot(normal, halfVector), 0.0);

specular = pow(specularContribution, specularExponent);

diffuse = (colorIntensity * diffuseContribution);

}

void main(void) {

vec3 diffuse;

float spec;

getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec);

vec3 color = max(diffuse,_ambient.xyz)*mainColor;

gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);

}

vertex

fragment

VertexData

VertexShader

Primitive

Setup and Rasterization

FragmentShader

Blending

TessellationEvaluationShader

TessellationControlShader

GeometryShader

TextureStore

PixelData


Dsl examples

DSL Examples

Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …

  • Interesting recent research work

Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, RastislavBodıkSuperconductor: A Language for Big Data Visualization LASH-C 2013

ChariseeChiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick Seltzer

Diderot: A Parallel DSL for Image Analysis and Visualization

ACM PLDI 2012

A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron

Quipper: A Scalable Quantum Programming Language

ACM PLDI 2013


Harnessing complexity2

Harnessing Complexity

  • Compilers can do

    • Automatic parallelization

    • Optimization of (parallel) code

    • DSL and code generation

  • But well written and optimized parallel code still outperforms a compiler based approach


Harnessing complexity3

Harnessing Complexity

  • Compiler approaches

    • DSL, automatic parallelization, …

  • Library-based approaches


Some examples

Some Examples

  • Pattern oriented

    • MapReduce (Google)

  • Problem specific

    • FLASH, adaptive-mesh refinement (AMR) code

    • GROMACS, molecular dynamics

  • Hardware/programming model specific

    • Cactus

    • libWater*

best

performance


Insieme compiler and research

Insieme Compiler and Research

  • Compiler infrastructure

  • Runtime support


Insieme research automatic task partitioning for heterogeneous hw

Insieme Research: Automatic Task Partitioning for Heterogeneous HW

  • Heterogeneous platforms

    • E.g. CPU + 2 GPUs

  • Input: OpenCL for single device

  • Output: OpenCL code for multiple devices

  • Automatic partitioning of work-items between multiple devices

    • Based on hw, program and input size

  • Machine-learning approach

K. Kofler, I. Grasso, B. Cosenza, T. FahringerAn Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013


Results architecture 1

Results – Architecture 1


Results architecture 2

Results – Architecture 2


Insieme research opencl on cluster of heterogeneous nodes

Insieme Research: OpenCL on Cluster of Heterogeneous Nodes

  • libWater

  • OpenCL extensions for clusters

    • Event based, extension on OpenCL event

    • Supporting intra-deice synchronization

  • DQL

    • A DSL language for device query, management and discovery

I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer

libWater: Heterogeneous Distributed Copmuting Made EasyACM International Conference on Supercomputing, 2013


Libwater

libWater

  • Runtime

    • OpenCL

    • pthread, opemp

    • MPI

  • DAG command event representation


Libwater dag optimizations

libWater: DAG Optimizations

  • Dynamic Collective communication pattern Replacement (DCR)

  • Latency hiding

  • Intra-node copy optimizations


Insieme ongoing research support for dsls

Insieme (Ongoing) Research:Support for DSLs

Library SupportRendering algorithm implementations, geometry loader, …

RuntimeSystem

Frontend

Backend

InputCodes

InputCodes

IntermediateRepresentation

DSL

OutputCodes

pthreads

OpenCLMPI

Target hardware:GPU, CPU, heterogeneous platform, compute cluster

Transformation Framework

Polyhedral model

Parallel optimizations

Stencil computation

Automatic tuning support


About insieme

About Insieme

  • Insieme compiler

    • Research framework

    • OpenMP, Cilk, MPI, OpenCL

    • Run time, IR

    • Support for polyhedral model

    • Multi-objective optimization

    • Machine learning

    • Extensible

  • Insieme (GPL) and libWater (LGPL) soon available on GitHub


  • Login