Copperhead a python like data parallel language compiler
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Copperhead: A Python-like Data Parallel Language & Compiler PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley. Copperhead: A Python-like Data Parallel Language & Compiler. Universal Parallel Computing Research Center University of California, Berkeley. Intro to CUDA. Overview Multicore/Manycore SIMD

Download Presentation

Copperhead: A Python-like Data Parallel Language & Compiler

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Copperhead a python like data parallel language compiler

Bryan Catanzaro, UC Berkeley

Michael Garland, NVIDIA Research

Kurt Keutzer, UC Berkeley

Copperhead: A Python-like Data Parallel Language & Compiler

Universal Parallel Computing Research Center

University of California, Berkeley


Intro to cuda

Intro to CUDA

Overview

Multicore/Manycore

SIMD

Programming with millions of threads


The cuda programming model

The CUDA Programming Model

  • CUDA is a recent programming model, designed for

    • Manycore architectures

    • Wide SIMD parallelism

    • Scalability

  • CUDA provides:

    • A thread abstraction to deal with SIMD

    • Synchronization & data sharing between small groups of threads

  • CUDA programs are written in C + extensions

  • OpenCL is inspired by CUDA, but HW & SW vendor neutral

    • Programming model essentially identical


Multicore and manycore

Multicore and Manycore

Multicore

Manycore

  • Multicore: yoke of oxen

    • Each core optimized for executing a single thread

  • Manycore: flock of chickens

    • Cores optimized for aggregate throughput, deemphasizing individual performance


Multicore manycore cont

Multicore & Manycore, cont.

Core i7

GTX285


Simd neglected parallelism

SIMD: Neglected Parallelism

  • It is difficult for a compiler to exploit SIMD

  • How do you deal with sparse data & branches?

    • Many languages (like C) are difficult to vectorize

    • Fortran is somewhat better

  • Most common solution:

    • Either forget about SIMD

      • Pray the autovectorizer likes you

    • Or instantiate intrinsics (assembly language)

    • Requires a new code version for every SIMD extension


What to do with simd

What to do with SIMD?

4 way SIMD

16 way SIMD

  • Neglecting SIMD in the future will be more expensive

    • AVX: 8 way SIMD, Larrabee: 16 way SIMD

  • This problem composes with thread level parallelism


Copperhead a python like data parallel language compiler

CUDA

  • CUDA addresses this problem by abstracting both SIMD and task parallelism into threads

  • The programmer writes a serial, scalar thread with the intention of launching thousands of threads

  • Being able to launch 1 Million threads changes the parallelism problem

    • It’s often easier to find 1 Million threads than 32: just look at your data & launch a thread per element

  • CUDA is designed for Data Parallelism

    • Not coincidentally, data parallelism is the only way for most applications to scale to 1000(+) way parallelism


Hello world

Hello World


Cuda summary

CUDA Summary

CUDA is a programming model for manycoreprocessors

It abstracts SIMD, making it easy to use wide SIMD vectors

It provides good performance on today’s GPUs

In the near future, CUDA-like approaches will map well to many processors & GPUs

CUDA encourages SIMD friendly, highly scalable algorithm design and implementation


A parallel scripting language

A Parallel Scripting Language

  • What is a scripting language?

    • Lots of opinions on this

    • I’m using an informal definition:

      • A language where performance is happily traded for productivity

    • Weak performance requirement of scalability

      • “My code should run faster tomorrow”

  • What is the analog of today’s scripting languages for manycore?


Data parallelism

Data Parallelism

Assertion: Scaling to 1000 cores requires data parallelism

Accordingly,manycore scripting languages will be data parallel

They should allow the programmer to express data parallelism naturally

They should compose and transform the parallelism to fit target platforms


Warning evolving project

Warning: Evolving Project

Copperhead is still in embryo

We can compile a few small programs

Lots more work to be done in both language definition and code generation

Feedback is encouraged


Copperhead cu python

Copperhead = Cu + python

  • Copperhead is a subset of Python, designedfor data parallelism

  • Why Python?

    • Extant, well accepted high level scripting language

      • Free simulator(!!)

    • Already understands things like map and reduce

    • Comes with a parser & lexer

  • The current Copperhead compiler takes a subset of Python and produces CUDA code

    • Copperhead is not CUDA specific, but current compiler is


Copperhead is not pure python

Copperhead is not Pure Python

Python

Copperhead

  • Copperhead is not for arbitrary Python code

    • Most features of Python are unsupported

  • Copperhead is compiled, not interpreted

  • Connecting Python code & Copperhead code will require binding the programs together, similar to Python-C interaction

  • Copperhead is statically typed


Saxpy hello world

Saxpy: Hello world

defsaxpy(a, x, y):

returnmap(lambda xi, yi: a*xi + yi, x, y)

  • Some things to notice:

    • Types are implicit

      • The Copperhead compiler uses a Hindley-Milner type system with typeclasses similar to Haskell

      • Typeclasses are fully resolved in CUDA via C++ templates

    • Functional programming:

      • map, lambda (or equivalent in list comprehensions)

      • you can pass functions around to other functions

      • Closure: the variable ‘a’ is free in the lambda function, but bound to the ‘a’ in its enclosing scope


Type inference cont

Type Inference, cont.

c = a + b

+ : (Num0, Num0) > Num0

A145

A52

A207

c = a + b

Num52

Num52

Num52

Copperhead includes function templates for intrinsics likeadd, subtract, map, scan, gather

Expressions are mapped against templates

Every variable starts out with a unique generic type, then types are resolved by union find on the abstract syntax tree

Tuple and function types are also inferred


Data parallelism1

Data parallelism

  • Copperhead computations are organized around data parallel arrays

  • map performs a “forall” for each element in an array

    • Accesses must be local

  • Accessing non-local elements is done explicitly

    • shift, rotate, or gather

  • No side effects allowed


Copperhead primitives

Copperhead primitives

  • map

  • reduce

  • Scans:

    • scan, rscan, segscan, rsegscan

    • exscan, exrscan, exsegscan, exrsegscan

  • Shuffles:

    • shift, rotate, gather, scatter


Implementing copperhead

Implementing Copperhead

Module(

None,

Stmt(

Function(

None,

'saxpy',

['a', 'x', 'y'],

0,

None,

Stmt(

Return(

CallFunc(

Name('map'),

Lambda(

['xi', 'yi'],

0,

Add(

Mul(

Name('a'),

Name('xi')

),

Name('yi')

)

),

Name('x'),

Name('y'),

None,

None

)

)

)

)

)

)

defsaxpy(a, x, y):

returnmap(lambda xi, yi: a*xi + yi, x, y)

The Copperhead compiler is written in Python

Python provides its own Abstract Syntax Tree

Type inference, code generation, etc. is done by walking the AST


Compiling copperhead to cuda

Compiling Copperhead to CUDA

  • Every Copperhead function creates at least one CUDAdevice function

  • Top level Copperhead functions create a CUDAglobalfunction, which orchestrates thedevicefunction calls

  • Theglobalfunction takes care of allocating shared memory and returning data (storing it to DRAM)

  • Global synchronizations are implemented through multiple phases

    • All intermediate arrays & plumbing handled by Copperhead compiler


Saxpy revisited

Saxpy Revisited

defsaxpy(a, x, y):

returnmap(lambda xi, yi: a*xi + yi, x, y)

template<typename Num> __device__ Num lambda0(Num xi, Num yi, Num a) {

return ((a * xi) + yi);

}

template<typename Num>__device__ void saxpy0Dev(Array<Num> x, Array<Num> y, Num a, uint _globalIndex, Num& _returnValueReg) {

Num _xReg, yReg;

if (_globalIndex < x.length) _xReg= x[_globalIndex];

if (_globalIndex < y.length) _yReg= y[_globalIndex];

if (_globalIndex < x.length) _returnValueReg= lambda0<Num>(_xReg, _yReg, a);

}

template<typename Num>__global__ void saxpy0(Array<Num> x, Array<Num> y, Num a, Array<Num> _returnValue) {

uint _blockMin = IMUL(blockDim.x, blockIdx.x);

uint _blockMax = _blockMin + blockDim.x;

uint _globalIndex = _blockMin + threadIdx.x;

Num _returnValueReg;

saxpy0Dev(x, y, a, _globalIndex, _returnValueReg);

if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg;

}


Phases

Phases

phase 0

phase 1

  • Scan

phase 0

phase 1

phase 2

Reduction


Copperhead to cuda cont

Copperhead to CUDA, cont.

B = reduce(map(A))

D = reduce(map(C))

phase 0

phase 1

  • Compiler schedules computations into phases

    • Right now, this composition is done greedily

    • Compiler tracks global and local availability of all variables and creates a phase boundary when necessary

    • Fusing work into phases is important for performance


Copperhead to cuda cont1

Copperhead to CUDA, cont.

  • Shared memory used only for communicating between threads

    • Caching unpredictable accesses (gather)

    • Accessing elements with a uniform stride (shift & rotate)

  • Each device function returns its intermediate results through registers


Split

Split

defsplit(input, value):

flags = map(lambda a: 1 if a <= value else 0, input)

notFlags = map(lambda a: not a, flags)

leftPositions = exscan(lambda a, b: a + b, 0, flags)

rightPositions= exrscan(lambda a, b: a + b, 0, notFlags)

positions = map(lambda a, b, flag: a if flag elselen(input) - b - 1, leftPositions, rightPositions, flags)

returnscatter(input, positions)

0

0

0-2

phases

0-2

2

2

  • This code is decomposed into 3 phases

  • Copperhead compiler takes care of intermediate variables

  • Copperhead compiler uses shared memory for temporaries used in scans here

    • Everything else is in registers


Interpreting to copperhead

Interpreting to Copperhead

If the interpreter harvested dynamic type information, it could use the Copperhead compiler as a backend

Fun project – see what kinds of information could be gleaned from the Python interpreter at runtime to figure out what should be compiled via Copperhead to a manycore chip


Future work

Future Work

  • Finish support for the basics

  • Compiler transformations

    • Nested data parallelism flattening

      • segmented scans

  • Retargetability

    • Thread Building Blocks/OpenMP/OpenCL

  • Bridge Python and Copperhead

  • Implement real algorithms with Copperhead

    • Vision/Machine Learning, etc.


  • Login