Jasm a java library for the generation and scheduling of ptx assembly
Download
1 / 27

JASM: A Java Library for the Generation and Scheduling of PTX Assembly - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

JASM: A Java Library for the Generation and Scheduling of PTX Assembly. Christos Kartsaklis christos.kartsaklis@ichec.ie ICHEC. Purpose. NVIDIA GPUs Many self-conflicting parameters affect performance. Some not nvcc -tunable . JASM Similar to a compiler back-end but programmable itself.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' JASM: A Java Library for the Generation and Scheduling of PTX Assembly' - alton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Jasm a java library for the generation and scheduling of ptx assembly

JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Christos Kartsaklis

christos.kartsaklis@ichec.ie

ICHEC


Purpose
Purpose PTX Assembly

  • NVIDIA GPUs

    • Many self-conflicting parameters affect performance.

    • Some not nvcc-tunable.

  • JASM

    • Similar to a compiler back-end but programmable itself.

    • Different constructs to generate variants of the same kernel.

    • Explore the optimisations strategy space faster.

    • For CUDA programmers and instruction bottlenecks.


Structure
Structure PTX Assembly

  • Introduction

    • Features absent from nvcc

    • Code compilation

    • JASM

  • Library Features

    • Dependencies

    • Aliasing

    • Predication

    • Snippets

    • Reverse compilation

  • Summary


Features absent from nvcc 1 3
Features absent from PTX Assemblynvcc – 1/3

  • What is PTX?

    • Virtual machine ISA for NVIDIA’s GPUs.

    • Generated by nvcc (-ptxas) and compiled by ptxas.

  • nvcc limitation

    • Cannot write inlined PTX in CUDA C/C++.

    • Some extra #pragmas needed.


Features absent from nvcc 2 3
Features absent from PTX Assemblynvcc – 2/3

  • Predication

    • Not exposed; would like to have:

    • predicate p = ...; // some condition

    • #pragma predicate using p

    • if (p) { ...

  • Address space derivation

    • nvcc must be able to determine what space pointers refer to; cannot do:

    • *(__shared__ float* someint) = ...

    • double d = *(__constant__ double*) foo;


Features absent from nvcc 3 3
Features absent from PTX Assemblynvcc – 3/3

  • Latency Hints

    • Hard for the compiler to determine (non)coalesced accesses; ideally:

    • #pragma !coalescedfoo = bar[i];

  • Reflection

    • __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);


Code compilation 1 2
Code compilation – 1/2 PTX Assembly

  • Instruction Generation

    • Compiler does it for you

      • High-level code  intermediary form (I/F).

      • Transform the I/F.

      • Generate machine code from the transformed I/F.

    • How good the generated code is?

      • Need to manually inspect it.


Code compilation 2 2
Code compilation – 2/2 PTX Assembly

  • Instruction Scheduling

    • Core part of any compiler.

      • Determines the order that instructions will execute in.

    • Purpose

      • Correctness, latency hiding and ILP.

    • Problems

      • Hard to steer from a high-level language.

      • Compiler often generates its own code.

      • #pragmadirectives & compiler options.


Jasm 1 3
JASM – 1/3 PTX Assembly

  • Dedicated tools, such as BAGEL

    • User selects instructions and BAGEL schedules.

    • Generating code written in C using the BAGEL lib.

    • Uses the notion of “abstract instructions”.

  • JASM

    • Similar philosophy – enhanced functionality.

  • Focus:

    • Reflection.

    • User-Programmable Instruction Scheduling.


Jasm 2 3
JASM – 2/3 PTX Assembly

  • Basic Block (BB)

    • A bunch of instructions that do not change the flow of execution.

  • Control Flow Graph (CFG)

    • A directed graph of BBs where edges indicate changes in execution flow.

  • Instructions Stream

    • The order of instructions in memory.

    • Each instruction is “notionally” part of a BB.


Jasm 3 3
JASM – 3/3 PTX Assembly

  • Examples

    • Append an instruction to a basic block:

      • lBB.append(LDSHAREDS32, lArg1, 4, lArg2);

    • Branching:

      • lSource.branchTo(lTargetBB, false);

    • Reorder:

      • lBB = lBB.reorder(BASIC_COST_FUNCTION, DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING);

    • Predicate:

      • lBB = lBB.predicate(lP, lRegisterFile);

    • Obtain macro:

      • SnippetDescriptorlSD = CToHLA.obtain(“x*x / (y-z)”,...);BasicBlocklNewBB = lSD.instantiate(lArg1, ...);


Structure1
Structure PTX Assembly

  • Introduction

    • Features absent from nvcc

    • Code compilation

    • JASM

  • Library Features

    • Dependencies

    • Aliasing

    • Predication

    • Snippets

    • Reverse compilation

  • Summary


Dependencies 1 2
Dependencies – 1/2 PTX Assembly

  • All contribute to the final instructions stream.

    • What is the ideal layout?

  • What complicates the compiler

    • Not enough information to distinguish between true and false dependencies.

    • Variable-latency instructions

      • E.g. coalesced vs non-coalesced accesses.


Dependencies 2 2
Dependencies – 2/2 PTX Assembly

  • JASM determines instruction order based on:

    • Dependency Analysis (DA) modules & Cost Function.

  • Full space exploration – no heuristics:

    • DAs constrain instructions’ motion in the stream.

    • Cost function estimates execution time for any stream.

    • Scheduling done by external constraints solver.

  • Only applicable to basic blocks.


Aliasing 1 3
Aliasing – 1/3 PTX Assembly

  • PTX is not the final thing.

    • Further optimised by ptxas before machine code generation.

  • Want to specify exactly what is and what’s not aliased.

    • No #pragma aliased / disjoint in PTX.

  • Goal:

    • Simplify declaration of aliasing/disjoint accesses.

    • Handle all memory spaces.


Aliasing 2 3
Aliasing – 2/3 PTX Assembly

  • JASM Addressable Memory Regions (AMRs)

  • An n-ary tree of AMRs where:

    • Root nodes represent spaces (e.g. shared, global)

    • Each node is a memory region and a sub-region of its parent’s.

    • Siblings are disjoint regions, collectively making their parent’s

  • Instructions are associated with AMRs.

    • AMRs predefined for CUDA memory spaces.


Aliasing 3 3
Aliasing – 3/3 PTX Assembly

  • Example

  • Generally:

    • No need for pointer grouping (a la “#pragma disjoint” etc.)

    • We work with instructions, not pointers.

global mem AMR

01: st.global.f32 [%r0], %f5

02: ld.global.f64 [%r1], %lf8

03: st.global.f32 [%r3], %f4

04: st.global.s32 [%r8], %s1


Predication 1 3
Predication – 1/3 PTX Assembly

  • Conditional execution

    • if (foo==0) bar=5;

    • Thread divergence.

  • Predicated execution

    • setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5;

    • Non-divergent cycle-burner.

  • Fine line between the two.

  • Cannot predicate code explicitly in CUDA.


Predication 2 3
Predication – 2/3 PTX Assembly

  • Explicit

    • Can allocate predicates, modify them and predicate instructions.

  • Example:

    • Direct:

      Register lP = lRegFile.allocate(BT.PRED, “%p1”);

      // @%p1 mov.s32 bar, 5

      lBB.append(PMOVS32, lP, lBar, new Immediate(5));

    • By reflection:

      Instruction lT = new InstructionImpl(

      MOVS32, lBar, new Immediate(5));

      lBB.append(lT.descriptor().toPredicatedVersion(),

      lP, lArg1, lArg2);


Predication 3 3
Predication – 3/3 PTX Assembly

  • Any basic block can be predicated.

    • Including already-predicated instructions.

  • Example:

    • %p mov.s32 %d, %s; // if (%p) %d = %s;

    • Predicate by %q

  • Output:

    • @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;


Snippets 1 3
Snippets – 1/3 PTX Assembly

  • Problem:

    • Certain tasks require knowing in advance how the compiler treats a piece of code.

  • Software pipelining

    • template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(inti=0 ; i<N ; i++)aDst[i] = aSrc2[i] * aSrc2[i];}


Snippets 2 3
Snippets – 2/3 PTX Assembly

  • Consider H/W vs S/W instructions

    • Tradeoff between pipeline stall & register pressure.

  • However, register pressure:

    • Is also a function of the # of thread blocks.

  • Ideally

    • Want to generate pipelined code for a variable number of dependencies.

  • Solution:

    • Encapsulate function in a reflective macro

      • Reflect instructions & dependencies.


Snippets 3 3
Snippets – 3/3 PTX Assembly

  • Consider the complex “multiplication”

    • (a+ib)*(c+id)  (ac+bd)+i(ad+bc)

    • 2 stages: 2 muls, then 2 madds.

  • Snippet descriptor organisation:

    • Group 0:

      • mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d

    • Group 1:

      • mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c

    • ?* items are parameters.

  • Anybasic block can be “snippetised”.


Reverse compilation 1 3
Reverse compilation – 1/3 PTX Assembly

  • What to do with legacy CUDA code?

    • Option: Manually re-write in JASM.

  • No. Any PTX file can be loaded in JASM.

    • Not just loaded in.

    • Organised in basic blocks within a Control Flow Graph.

    • Malleable from thereon like every JASM code.


Reverse compilation 2 3
Reverse compilation – 2/3 PTX Assembly

  • Inlined C in JASM

    • Idea: obtain a snippet from a C function.

    • Opposite of “inlined assembly in C”.

    • Why?

      • Reuse what nvcc makes available.

      • Enjoy the benefits that come with snippets.


Reverse compilation 3 3
Reverse compilation – 3/3 PTX Assembly

  • At the moment, we can do the following:

    • Code:

      • SnippetDescriptorlSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”);

    • r is the return parameter; x & y are arguments.

    • Equivalent to:

      • int r = (x*y) % 3;

  • Now we can write:

    • if(lSD.numberOfRegisters() > 5) { ...


Summary
Summary PTX Assembly

  • NVIDIA GPUs

    • Many self-conflicting parameters affect performance.

    • Some not nvcc-tunable.

  • JASM

    • Similar to a compiler back-end but programmable itself.

    • Different constructs to generate variants of the same kernel.

    • Explore the optimisations strategy space faster.

  • The optimisations are expressed as a function of the code.


ad