Make hpc easy with domain specific languages and high level frameworks
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on
  • Presentation posted in: General

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning

Download Presentation

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Biagio Cosenza, Ph.D.

DPS Group, InstitutfürInformatik

Universität Innsbruck, Austria


Outline

  • Complexity in HPC

    • Parallel hardware

    • Optimizations

    • Programming models

  • Harnessing compexity

    • Automatic tuning

    • Automatic parallelization

    • DSLs

    • Abstractions for HPC

  • Related work in Insieme


Complexity in HPC


Complexity in Hardware

  • The need of parallel computing

  • Parallelism in hardware

  • Three walls

    • Power wall

    • Memory wall

    • Instruction Level Parallelism


The Power Wall

Power is expensive, but transistors are free

  • We can put more transistors on a chip than we have the power to turn on

  • Power efficiency challenge

    • Performance per watt is the new metric – systems are often constrained by power & cooling

  • This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance

  • Example

    • Intel Pentium 4 HT 670 (released on May 2005)

      • Clock rate 3.8 GHz

    • Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)

      • Clock rate 3.2 GHz


The Memory Wall

The growing disparity of speed between CPU and memory outside the CPU chip, would become an overwhelming bottleneck

  • It change the way we optimize programs

    • Optimize for memory vs optimize computation

  • E.g. multiply is no longer considered a harming slow operation, if compared to load and store


The ILP Wall

There are diminishing returns on finding more ILP

  • Instruction Level Parallelism

    • The potential overlap among instructions

    • Many ILP techniques

      • Instruction pipelining

      • Superscalar execution

      • Out-of-order execution

      • Register renaming

      • Branch prediction

  • The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible

  • There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy


Parallelism in Hardware


The “Many-core” challenges

Tilera TILE-Gx807

  • Many-core vs multi-core

    • Multi-core architectures and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors

    • Many-core is the future


What does it mean?

  • Hardware is evolving

    • The number of cores is the new Megahertz

  • We need

    • New programming model

    • New system software

    • New supporting architecture that are naturally parallel


New Challenges

  • Make easy to write programs that execute efficiently on highly parallel computing systems

    • The target should be 1000s of cores per chip

    • Maximize productivity

  • Programming models should

    • be independent of the number of processors

    • support successful models of parallelism, such as task-level parallelism, word-level parallelism, and bit-level parallelism

  • “Autotuners” should play a larger role than conventional compilers in translating parallel programs


Parallel Programming Models

Real-Time Worksop(MathWorks)

Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

Erlang

Charm(Illinois)

MPI

Cilk(MIT)

HMPP

OpenMP

OpenAcc

MapReduce(Google)

OpenCL(Khronos Group)

Brook(Stanford)

DataCutter(Maryland)

CUDA(NVidia)

NESL(CMU)

StreamIt(MIT&Microsoft)

Borealis(Brown)

HPCS Chapel(Cray)

HPCS Fortress(Sun)

Thread Building Blocks(Intel)

HPCS X10(IBM)

Sequoia(Stanford)


Parallel Programming Models

Real-Time Worksop(MathWorks)

Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

Erlang

Charm(Illinois)

MPI

Cilk(MIT)

HMPP

OpenMP

OpenAcc

MapReduce(Google)

OpenCL(Khronos Group)

Brook(Stanford)

DataCutter(Maryland)

CUDA(NVidia)

NESL(CMU)

StreamIt(MIT&Microsoft)

Borealis(Brown)

HPCS Chapel(Cray)

HPCS Fortress(Sun)

Thread Building Blocks(Intel)

HPCS X10(IBM)

Sequoia(Stanford)


Reconsidering…

  • Applications

    • What are common parallel kernel applications?

    • Parallel patterns

      • Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns

      • A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication

      • E.g. dense linear algebra, sparse algebra, spectral methods, …

  • Metrics

    • Scalability

      • An old belief was that less than linear scaling for a multi-processor application is failure

      • With new hardware trend, this is no longer true

        • Any speedup is OK!


Harnessing Complexity


Harnessing Complexity

  • Compiler approaches

    • DSL, automatic parallelization, …

  • Library-based approaches


What a compiler can do for us?

  • Optimize code

  • Automatic tuning

  • Automatic code generation

    • e.g. in order to support different hardware

  • Automatically parallelize code


Automatic Parallelization

Critical opinions on parallel programming model:

The other way:

  • Auto-parallelizing compilers

    • Sequential code => parallel code

Wen-meiHwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core systems

http://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf


Automatic Parallelization

for(inti=0;i<100;i++) {

A[i] = A[i+1];

}

  • Nowadays compilers have new “tools” for analysis

    • Polyhedral model

  • …but performance

    • are still far from a manual parallelization approach

IR

  • Polyhedral extraction:

  • SCoP detection

  • Translation to polyhedral

Polyhedral Model

D: { i in N: 0 <= i < 100 }

R: A[ i] for each i in D

W: A[i+1] for each i in D

  • Code generation:

  • Generate IR code from model

Analyses & Transformations


Autotunersvs Traditional Compilers

  • Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler

  • The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel

  • The resulting space of optimization is large

  • Programming model may simplify the problem

    • but not to solve it


Optimizations’ ComplexityAn example

Input

  • Openmp code

  • Simple parallel codes

    • matrix multiplication, jacobi, stencil3d,…

  • Few optimizations and tuning parameters

    • Tiling 2d/3d

    • # of threads

      Goal: Optimize for performance and efficiency


Optimizations’ ComplexityAn example

  • Problem

    • Big search space

      • brute force takes year of computation

    • Analytical model fails to find the best configuration

  • Solution

    • Multi-objective search

      • Offline search of Pareto front solutions

      • Runtime selection according to the objective

    • Multi versioning

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Optimizations’ Complexity

Input Code

compile time

runtime

5

1

Analyzer

Backend

Multi-Versioned Code

4

2

CodeRegions

DynamicSelection

BestSolutions

6

Optimizer

Runtime System

Measure-

ments

Parallel Target Platform

Configurations

3

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Domain Specific Languages

  • Easy of programming

    • Use of domain specific concepts

      • E.g. “color”, “pixel”, “particle”, “atom”

    • Simple interface

  • Hide complexity

    • Data structures

    • Parallelization issues

    • Optimizations’ tuning

    • Address specific parallelization pattern


Domain Specific Languages

  • DSL may help parallelization

    • Focus on domain concepts and abstractions

    • Language constraints may help automatic parallelization by compilers

  • 3 major benefits

    • Productivity

    • Performance

    • Portability and forward scalability


Domain Specific LanguagesGLSL Shader (OpenGL)

OpenGL 4.3 Pipeline

VertexData

VertexShader

Primitive

Setup and Rasterization

FragmentShader

Blending

TessellationEvaluationShader

TessellationControlShader

GeometryShader

TextureStore

PixelData


attribute vec3 vertex;

attribute vec3 normal;

attribute vec2 uv1;

uniform mat4 _mvProj;

uniform mat3 _norm;

varying vec2 vUv;

varying vec3 vNormal;

void main(void) {

// compute position

gl_Position = _mvProj * vec4(vertex, 1.0);

vUv = uv1;

// compute light info

vNormal= _norm * normal;

}

varying vec2 vUv;

varying vec3 vNormal;

uniform vec3 mainColor;

uniform float specularExp;

uniform vec3 specularColor;

uniform sampler2D mainTexture;

uniform mat3 _dLight;

uniform vec3 _ambient;

void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){

vec3 ecLightDir = dLight[0]; // light direction in eye coordinates

vec3 colorIntensity = dLight[1];

vec3 halfVector = dLight[2];

float diffuseContribution = max(dot(normal, ecLightDir), 0.0);

float specularContribution = max(dot(normal, halfVector), 0.0);

specular = pow(specularContribution, specularExponent);

diffuse = (colorIntensity * diffuseContribution);

}

void main(void) {

vec3 diffuse;

float spec;

getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec);

vec3 color = max(diffuse,_ambient.xyz)*mainColor;

gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);

}

vertex

fragment

VertexData

VertexShader

Primitive

Setup and Rasterization

FragmentShader

Blending

TessellationEvaluationShader

TessellationControlShader

GeometryShader

TextureStore

PixelData


DSL Examples

Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …

  • Interesting recent research work

Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, RastislavBodıkSuperconductor: A Language for Big Data Visualization LASH-C 2013

ChariseeChiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick Seltzer

Diderot: A Parallel DSL for Image Analysis and Visualization

ACM PLDI 2012

A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron

Quipper: A Scalable Quantum Programming Language

ACM PLDI 2013


Harnessing Complexity

  • Compilers can do

    • Automatic parallelization

    • Optimization of (parallel) code

    • DSL and code generation

  • But well written and optimized parallel code still outperforms a compiler based approach


Harnessing Complexity

  • Compiler approaches

    • DSL, automatic parallelization, …

  • Library-based approaches


Some Examples

  • Pattern oriented

    • MapReduce (Google)

  • Problem specific

    • FLASH, adaptive-mesh refinement (AMR) code

    • GROMACS, molecular dynamics

  • Hardware/programming model specific

    • Cactus

    • libWater*

best

performance


Insieme Compiler and Research

  • Compiler infrastructure

  • Runtime support


Insieme Research: Automatic Task Partitioning for Heterogeneous HW

  • Heterogeneous platforms

    • E.g. CPU + 2 GPUs

  • Input: OpenCL for single device

  • Output: OpenCL code for multiple devices

  • Automatic partitioning of work-items between multiple devices

    • Based on hw, program and input size

  • Machine-learning approach

K. Kofler, I. Grasso, B. Cosenza, T. FahringerAn Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013


Results – Architecture 1


Results – Architecture 2


Insieme Research: OpenCL on Cluster of Heterogeneous Nodes

  • libWater

  • OpenCL extensions for clusters

    • Event based, extension on OpenCL event

    • Supporting intra-deice synchronization

  • DQL

    • A DSL language for device query, management and discovery

I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer

libWater: Heterogeneous Distributed Copmuting Made EasyACM International Conference on Supercomputing, 2013


libWater

  • Runtime

    • OpenCL

    • pthread, opemp

    • MPI

  • DAG command event representation


libWater: DAG Optimizations

  • Dynamic Collective communication pattern Replacement (DCR)

  • Latency hiding

  • Intra-node copy optimizations


Insieme (Ongoing) Research:Support for DSLs

Library SupportRendering algorithm implementations, geometry loader, …

RuntimeSystem

Frontend

Backend

InputCodes

InputCodes

IntermediateRepresentation

DSL

OutputCodes

pthreads

OpenCLMPI

Target hardware:GPU, CPU, heterogeneous platform, compute cluster

Transformation Framework

Polyhedral model

Parallel optimizations

Stencil computation

Automatic tuning support


About Insieme

  • Insieme compiler

    • Research framework

    • OpenMP, Cilk, MPI, OpenCL

    • Run time, IR

    • Support for polyhedral model

    • Multi-objective optimization

    • Machine learning

    • Extensible

  • Insieme (GPL) and libWater (LGPL) soon available on GitHub


  • Login