jasm a java library for the generation and scheduling of ptx assembly
Download
Skip this Video
Download Presentation
JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Loading in 2 Seconds...

play fullscreen
1 / 27

JASM: A Java Library for the Generation and Scheduling of PTX Assembly - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

JASM: A Java Library for the Generation and Scheduling of PTX Assembly. Christos Kartsaklis [email protected] ICHEC. Purpose. NVIDIA GPUs Many self-conflicting parameters affect performance. Some not nvcc -tunable . JASM Similar to a compiler back-end but programmable itself.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' JASM: A Java Library for the Generation and Scheduling of PTX Assembly' - alton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
purpose
Purpose
  • NVIDIA GPUs
    • Many self-conflicting parameters affect performance.
    • Some not nvcc-tunable.
  • JASM
    • Similar to a compiler back-end but programmable itself.
    • Different constructs to generate variants of the same kernel.
    • Explore the optimisations strategy space faster.
    • For CUDA programmers and instruction bottlenecks.
structure
Structure
  • Introduction
    • Features absent from nvcc
    • Code compilation
    • JASM
  • Library Features
    • Dependencies
    • Aliasing
    • Predication
    • Snippets
    • Reverse compilation
  • Summary
features absent from nvcc 1 3
Features absent from nvcc – 1/3
  • What is PTX?
    • Virtual machine ISA for NVIDIA’s GPUs.
    • Generated by nvcc (-ptxas) and compiled by ptxas.
  • nvcc limitation
    • Cannot write inlined PTX in CUDA C/C++.
    • Some extra #pragmas needed.
features absent from nvcc 2 3
Features absent from nvcc – 2/3
  • Predication
    • Not exposed; would like to have:
    • predicate p = ...; // some condition
    • #pragma predicate using p
    • if (p) { ...
  • Address space derivation
    • nvcc must be able to determine what space pointers refer to; cannot do:
    • *(__shared__ float* someint) = ...
    • double d = *(__constant__ double*) foo;
features absent from nvcc 3 3
Features absent from nvcc – 3/3
  • Latency Hints
    • Hard for the compiler to determine (non)coalesced accesses; ideally:
    • #pragma !coalescedfoo = bar[i];
  • Reflection
    • __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);
code compilation 1 2
Code compilation – 1/2
  • Instruction Generation
    • Compiler does it for you
      • High-level code  intermediary form (I/F).
      • Transform the I/F.
      • Generate machine code from the transformed I/F.
    • How good the generated code is?
      • Need to manually inspect it.
code compilation 2 2
Code compilation – 2/2
  • Instruction Scheduling
    • Core part of any compiler.
      • Determines the order that instructions will execute in.
    • Purpose
      • Correctness, latency hiding and ILP.
    • Problems
      • Hard to steer from a high-level language.
      • Compiler often generates its own code.
      • #pragmadirectives & compiler options.
jasm 1 3
JASM – 1/3
  • Dedicated tools, such as BAGEL
    • User selects instructions and BAGEL schedules.
    • Generating code written in C using the BAGEL lib.
    • Uses the notion of “abstract instructions”.
  • JASM
    • Similar philosophy – enhanced functionality.
  • Focus:
    • Reflection.
    • User-Programmable Instruction Scheduling.
jasm 2 3
JASM – 2/3
  • Basic Block (BB)
    • A bunch of instructions that do not change the flow of execution.
  • Control Flow Graph (CFG)
    • A directed graph of BBs where edges indicate changes in execution flow.
  • Instructions Stream
    • The order of instructions in memory.
    • Each instruction is “notionally” part of a BB.
jasm 3 3
JASM – 3/3
  • Examples
    • Append an instruction to a basic block:
      • lBB.append(LDSHAREDS32, lArg1, 4, lArg2);
    • Branching:
      • lSource.branchTo(lTargetBB, false);
    • Reorder:
      • lBB = lBB.reorder(BASIC_COST_FUNCTION, DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING);
    • Predicate:
      • lBB = lBB.predicate(lP, lRegisterFile);
    • Obtain macro:
      • SnippetDescriptorlSD = CToHLA.obtain(“x*x / (y-z)”,...);BasicBlocklNewBB = lSD.instantiate(lArg1, ...);
structure1
Structure
  • Introduction
    • Features absent from nvcc
    • Code compilation
    • JASM
  • Library Features
    • Dependencies
    • Aliasing
    • Predication
    • Snippets
    • Reverse compilation
  • Summary
dependencies 1 2
Dependencies – 1/2
  • All contribute to the final instructions stream.
    • What is the ideal layout?
  • What complicates the compiler
    • Not enough information to distinguish between true and false dependencies.
    • Variable-latency instructions
      • E.g. coalesced vs non-coalesced accesses.
dependencies 2 2
Dependencies – 2/2
  • JASM determines instruction order based on:
    • Dependency Analysis (DA) modules & Cost Function.
  • Full space exploration – no heuristics:
    • DAs constrain instructions’ motion in the stream.
    • Cost function estimates execution time for any stream.
    • Scheduling done by external constraints solver.
  • Only applicable to basic blocks.
aliasing 1 3
Aliasing – 1/3
  • PTX is not the final thing.
    • Further optimised by ptxas before machine code generation.
  • Want to specify exactly what is and what’s not aliased.
    • No #pragma aliased / disjoint in PTX.
  • Goal:
    • Simplify declaration of aliasing/disjoint accesses.
    • Handle all memory spaces.
aliasing 2 3
Aliasing – 2/3
  • JASM Addressable Memory Regions (AMRs)
  • An n-ary tree of AMRs where:
    • Root nodes represent spaces (e.g. shared, global)
    • Each node is a memory region and a sub-region of its parent’s.
    • Siblings are disjoint regions, collectively making their parent’s
  • Instructions are associated with AMRs.
    • AMRs predefined for CUDA memory spaces.
aliasing 3 3
Aliasing – 3/3
  • Example
  • Generally:
    • No need for pointer grouping (a la “#pragma disjoint” etc.)
    • We work with instructions, not pointers.

global mem AMR

01: st.global.f32 [%r0], %f5

02: ld.global.f64 [%r1], %lf8

03: st.global.f32 [%r3], %f4

04: st.global.s32 [%r8], %s1

predication 1 3
Predication – 1/3
  • Conditional execution
    • if (foo==0) bar=5;
    • Thread divergence.
  • Predicated execution
    • setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5;
    • Non-divergent cycle-burner.
  • Fine line between the two.
  • Cannot predicate code explicitly in CUDA.
predication 2 3
Predication – 2/3
  • Explicit
    • Can allocate predicates, modify them and predicate instructions.
  • Example:
    • Direct:

Register lP = lRegFile.allocate(BT.PRED, “%p1”);

// @%p1 mov.s32 bar, 5

lBB.append(PMOVS32, lP, lBar, new Immediate(5));

    • By reflection:

Instruction lT = new InstructionImpl(

MOVS32, lBar, new Immediate(5));

lBB.append(lT.descriptor().toPredicatedVersion(),

lP, lArg1, lArg2);

predication 3 3
Predication – 3/3
  • Any basic block can be predicated.
    • Including already-predicated instructions.
  • Example:
    • %p mov.s32 %d, %s; // if (%p) %d = %s;
    • Predicate by %q
  • Output:
    • @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;
snippets 1 3
Snippets – 1/3
  • Problem:
    • Certain tasks require knowing in advance how the compiler treats a piece of code.
  • Software pipelining
    • template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(inti=0 ; i<N ; i++)aDst[i] = aSrc2[i] * aSrc2[i];}
snippets 2 3
Snippets – 2/3
  • Consider H/W vs S/W instructions
    • Tradeoff between pipeline stall & register pressure.
  • However, register pressure:
    • Is also a function of the # of thread blocks.
  • Ideally
    • Want to generate pipelined code for a variable number of dependencies.
  • Solution:
    • Encapsulate function in a reflective macro
      • Reflect instructions & dependencies.
snippets 3 3
Snippets – 3/3
  • Consider the complex “multiplication”
    • (a+ib)*(c+id)  (ac+bd)+i(ad+bc)
    • 2 stages: 2 muls, then 2 madds.
  • Snippet descriptor organisation:
    • Group 0:
      • mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d
    • Group 1:
      • mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c
    • ?* items are parameters.
  • Anybasic block can be “snippetised”.
reverse compilation 1 3
Reverse compilation – 1/3
  • What to do with legacy CUDA code?
    • Option: Manually re-write in JASM.
  • No. Any PTX file can be loaded in JASM.
    • Not just loaded in.
    • Organised in basic blocks within a Control Flow Graph.
    • Malleable from thereon like every JASM code.
reverse compilation 2 3
Reverse compilation – 2/3
  • Inlined C in JASM
    • Idea: obtain a snippet from a C function.
    • Opposite of “inlined assembly in C”.
    • Why?
      • Reuse what nvcc makes available.
      • Enjoy the benefits that come with snippets.
reverse compilation 3 3
Reverse compilation – 3/3
  • At the moment, we can do the following:
    • Code:
      • SnippetDescriptorlSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”);
    • r is the return parameter; x & y are arguments.
    • Equivalent to:
      • int r = (x*y) % 3;
  • Now we can write:
    • if(lSD.numberOfRegisters() > 5) { ...
summary
Summary
  • NVIDIA GPUs
    • Many self-conflicting parameters affect performance.
    • Some not nvcc-tunable.
  • JASM
    • Similar to a compiler back-end but programmable itself.
    • Different constructs to generate variants of the same kernel.
    • Explore the optimisations strategy space faster.
  • The optimisations are expressed as a function of the code.
ad