html5-img
1 / 50

Overview of Ocelot: architecture

Overview of Ocelot: architecture. Overview. GPU Ocelot overview Building, configuring, and executing Ocelot programs Ocelot Device Interface and CUDA Runtime API Ocelot PTX Internal Representation PTX Pass Manager. 2. Ocelot: Multiplatform Dynamic Compilation.

kemp
Download Presentation

Overview of Ocelot: architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of Ocelot: architecture

  2. Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 2

  3. Ocelot: Multiplatform Dynamic Compilation Just-in-time code generation and optimization for data intensive applications esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) • Environment for i) compiler research, ii) architecture research, and iii) productivity tools 3

  4. NVIDIA’s Compute Unified Device Architecture (CUDA) • Integrate the concept of a compute kernel called from standard languages • Multithreaded host programs • The compute kernel specifies data parallel computation as thousands of threads • An accelerator model of computing • Explicit functions for off-loading computation to GPUs • Data movement explicitly managed by the programmer 4

  5. NVIDIA’s Compute Unified Device Architecture (CUDA) Host GPU • For access to CUDA tutorials http://developer.nvidia.com/cuda-education-training 5

  6. Structure of a Compute Kernel Parallel Thread Execution (PTX) instruction set architecture • Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs) • Barrier synchronization • Mapped to single instruction stream multiple data stream (SIMD) processor 6

  7. NVIDIA Fermi GF 100 • 4 Global Processing Clusters (GPCs) containing 4 SMs each • Each SM has 32 ALUs, 4 SFUs, and 16 LS units • Each ALU has access to 1024 32bit registers (total of 128kB per SM) • Each SM has its own Shared Memory/L1 cache (64kB total) • Unified L2 cache (768kB) • Six 64bit Memory Controllers (total 384bit wide) Streaming multiprocessor (SM) ALU 7

  8. Ocelot Structure1 PTX Kernel CUDA Application nvcc • Ocelot is built with nvcc and the LLVM backend • Structured around a PTX IR LLVM IR Translator • Compile stock CUDA applications without modification • 1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .

  9. CUDA to PTX • PTX modules stored as string literals in fat binary • We ignore accompanying binary image (GPU native binary) 9

  10. Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 10

  11. Dependencies • Software • C++ Compiler (GCC 4.5.x) • LexLexer Generator (Flex 2.5.35) • YACC Parser Generator (Bison 2.4.1) • Scons (Python 2.7) • LLVM (3.1) • Libraries • boost_system (1.46) • boost_filesystem (1.46) • boost_serialization (1.46) • GLEW (optional for GL interop) (1.5) • GL (for NVIDIA GPU Devices) • Library headers • Boost (1.46) http://code.google.com/p/gpuocelot/wiki/Installation 11

  12. Ocelot Source Code • Freely available via Google Code project site (New BSD License) • ocelot/ • analysis/ -- analysis passes • api/ -- Ocelot-specific API extensions • cuda/ -- implements CUDA runtime • executive/ -- Device interface and backend implementations • ir/ -- internal representations (PTX, LLVM, AMD IL) • parser/ -- parser (to PTX) • tools/ -- standalone applications using Ocelot • trace/ -- trace generation and analysis tools • translator/ -- translators from PTX to LLVM and AMD IL • transforms/ -- program transformations http://code.google.com/p/gpuocelot/ svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only 12

  13. Building GPU Ocelot • Obtain source code • svn checkout  http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only • Compile with Scons • sudo ./build.py –install • Build and execute unit tests • sudo ./build.py –test=full • Output appears in .release_build • libocelot.so • OcelotConfig • Tests • Installation directory: • /usr/local/include/ocelot • /usr/local/lib http://code.google.com/p/gpuocelot/wiki/Installation

  14. Configuring Ocelot trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], } } • configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • trace specifies which trace generators are initially attached • executive controls device properties • trace: • memoryChecker – ensures • raceDetector - enforces synchronized access to .shared • debugger - interactive debugger • executive: • devices: • List of Ocelot backend devices that are enabled • nvidia- NVIDIA GPU backend • emulated – Ocelot PTX emulator (trace generators) • llvm – efficient execution of PTX on multicore CPU • amd – translation to AMD IL for PTX on AMD RADEON GPU 14

  15. Building and Executing CUDA Programs • nvcc -c example.cu -arch sm_23 • g++ -o example example.o `OcelotConfig -l` • `OcelotConfig -l` expands to ‘-locelot’ • libocelot.so replaces libcudart.so

  16. Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 16

  17. CUDA Runtime API • Ocelot implements CUDA Runtime API • Transparent hooks into existing CUDA applications • override methods of cuda::CudaDeviceInterface • Maps CUDA RT onto Ocelot device interface abstraction • cuda::CudaRuntime • Extended through custom Ocelot API • e.g. ocelot::registerPTXModule( ); 17

  18. Ocelot CUDA Runtime Overview • A reimplementation of the CUDA Runtime API • Compatible with existing applications • Link against libocelot.so instead of libcudart R. Domingo & D. Kaeli (NEU) Kernels execute anywhere  Key to portability!

  19. Ocelot CUDA Runtime • Clean device abstraction • All back-ends implement same interface • Ocelot API Extensions • Add/remove trace generators • Compile/launch kernels directly in PTX • Device memory sharing among host threads • Device switching 19

  20. Ocelot Source Code: CUDA Runtime API • ocelot/ • analysis/ -- analysis passes • api/ -- Ocelot-specific API extensions • cuda/ -- implements CUDA runtime • interface/CudaRuntimeInterface.h • interface/CudaRuntime.h • interface/CudaRuntimeContext.h • interface/FatBinaryContext.h • interface/CudaDriverFrontend.h • executive/ -- Device interface and backend implementations • ir/ -- internal representations (PTX, LLVM, AMD IL) • parser/ -- parser (to PTX) • tools/ -- standalone applications using Ocelot • trace/ -- trace generation and analysis tools • translator/ -- translators from PTX to LLVM and AMD IL • transforms/ -- program transformations 20

  21. Ocelot CUDA Runtime API Implementation • Implement interface defined by cuda::CudaRuntimeInterface • ocelot/cuda/interface/CudaRuntime.h • ocelot/cuda/implementation/CudaRuntime.cpp • class cuda::CudaRuntime • cuda::CudaRuntime members • Host thread contexts • Ocelot devices • Registered modules, textures, kernels • Fat binaries • Global mutex • CUDA Runtime API functions • eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(), • Additional functions • eg. _lock(), _unlock(), _registerModule()

  22. Ocelot Source Code: Device Interface • ocelot/ • executive/ -- Device interface and backend implementations • interface/Device.h • interface/EmulatorDevice.h • interface/NVIDIAGPUDevice.h • interface/MulticoreCPUDevice.h • interface/ATIGPUDevice.h 22

  23. Ocelot Device Interface • class executive::Device • Succinct interface for device objects • Module registration • Memory management • Kernel configuration and launching • Global variable and texture management • OpenGL interoperability • Streams and Events • Trace generators • Minimal set of APIs for device-oriented programming model • 57 functions (versus CUDA Runtime’s 120+) • Capture device state: • Memory allocations, global variables, textures, graphics interoperability • Facilitate creation of backend execution targets • Implement Device interface • Enable multiple API front ends • Implement front ends targeting Device interface

  24. Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 24

  25. Ocelot PTX Intermediate Representation (IR) • Backend compiler framework for PTX • Full-featured PTX IR • Class hierarchy for PTX instructions/directives • PTX control flow graph • Static single-assignment form • Dataflow/dominance analysis • Enables PTX optimization • IR to IR translation • From PTX to other IRs • LLVM (x86/PowerPC/ARM) • CAL (AMD GPUs) PTX Kernel 25

  26. Ocelot Source Code: Intermediate Representation • ocelot/ • ir/ -- internal representations (PTX, LLVM, AMD IL) • interface/Module.h • interface/PTXInstruction.h • interface/PTXOperand.h • interface/PTXKernel.h • interface/ControlFlowGraph.h • interface/ILInstruction.h • interface/LLVMInstruction.h • parser/ -- parser (to PTX) • interface/PTXParser.h 26

  27. Ocelot PTX Internal Representation • C++ classes representing PTX module • ir::PTXModule • ir::PTXKernel • ir::PTXInstruction • ir::PTXOperand • ir::GlobalVariable • ir::LocalVariable • ir::Parameter • Ocelot PTX Parser target, Emitter source • ir::PTXInstruction::valid( ) • Translator source • PTX to LLVM • PTX to AMD IL • Suitable for analysis and transformation • Executable representation • PTX Emulator

  28. Ocelot PTX IR: Kernels ir::Module .global .f32 globalVariable; .entry sequence ( .param .u64 __cudaparm_sequence_A, .param .s32 __cudaparm_sequence_N) { .reg .u32 %r<11>; .reg .u64 %rd<6>; .local u32 %rp0; . . . . . . $LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . . $Lt_0_1026: exit; $LDWend_sequence: } // sequence ir::Global ir::Kernel ir::Parameter ir::Local ir::BasicBlock 28

  29. Ocelot PTX IR: Instructions ir::BasicBlock ir::PTXInstruction add.s32 %r7, %r5, 1; ld .param .u64 %rd1, [__cudaparm_sequence_A]; cvt.s64.s32 %rd2, %r5; mul.wide.s32 %rd3, %r5, 4; add.u64 %rd4, %rd1, %rd3; st .global .s32 [ %rd4 + 0 ], %r7; @%p1 bra $Lt_0_6146; ir::PTXOperand addressMode: address opcodeaddressSpacedataType d a addressMode: register addressMode: immediate addressMode: indirect Guard predicate addressMode: label 29

  30. Control and Data-Flow Graphs • Data structure for representing kernels • Basic blocks • fall-through and branch edges • instruction vector • label • Traversals: • pre-order, topological, post-order • iterator visits blocks • Data-flow graph overlays CFG • definition-use chains explicit • to and from SSA form • CFG Transformations: • split blocks, edges • DFG Transformations: • insert and remove values • iterate over def-use 30

  31. Example: Control-Flow Graphs // example: splits basic blocks containing barriers // for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks unsignedint n = 0; ir::BasicBlock::InstructionList::iteratorinst_it; for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it constir::PTXInstruction *inst = static_cast< constir::PTXInstruction *>(*inst_it); if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) { std::string label = (bb_it)->label + "_bar"; kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it) } // end for (bb_it) 31

  32. Example: Spilling Live Values // ocelot/analysis/implementation/RemoveBarrierPass.cpp // voidRemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, constDataflowGraph::Block::RegisterSet& alive ) { unsigned intbytes = 0; ir::PTXInstruction move ( ir::PTXInstruction::Mov ); move.type = ir::PTXOperand::u64; move.a.identifier = "__ocelot_remove_barrier_pass_stack"; move.a.addressMode = ir::PTXOperand::Address; move.a.type = ir::PTXOperand::u64; move.d.reg = _kernel->dfg()->newRegister(); move.d.addressMode = ir::PTXOperand::Register; move.d.type = ir::PTXOperand::u64; _kernel->dfg()->insert( block, move, block->instructions().size() - 1 ); ...

  33. Example: Spilling Live Values ... for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) { ir::PTXInstruction save( ir::PTXInstruction::St ); save.type = reg->type; save.addressSpace = ir::PTXInstruction::Local; save.d.addressMode = ir::PTXOperand::Indirect; save.d.reg = move.d.reg; save.d.type = ir::PTXOperand::u64; save.d.offset = bytes; bytes += ir::PTXOperand::bytes( save.type ); save.a.addressMode = ir::PTXOperand::Register; save.a.type = reg->type; save.a.reg = reg->id; _kernel->dfg()->insert( block, save, block->instructions().size() - 1 ); } _spillBytes = std::max( bytes, _spillBytes ); }

  34. IR for AMD and LLVM AMD Backend: R. Domingo & D. Kaeli (NEU) • LLVM IR • Implements all of the LLVM instruction set • Decouples translator with LLVM project • Easier to construct than LLVM’s actual IR • AMD IL • Supports translation from PTX to AMD interface • Emitters construct parseablestring representations of modules 34

  35. Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 35

  36. PTX PassManager • Orchestrates analysis and transformation passes • Derived from LLVM model • Analysis Passes generate meta-data • Meta-data consumed by transformations • Transformation Passes modify the IR

  37. Using the Pass Manager • Passes added to a manager • Schedules execution • Manages analysis meta-data • Ensures meta-data available • Up to date; not redundantly computed

  38. Analysis Passes • Analysis runs over the PTX IR • Generates meta-data • Modifies PTX IR • Possibly updates or invalidates existing meta-data • Examples • Data-flow graph • Dominator and Post-dominator trees • Thread frontiers

  39. Analysis Passes – Supported Analaysis Structures • Control Flow Graph • ir/interface/ControlFlowGraph.h • Data Flow Graph • analysis/interface/DataflowGraph.h • Dominator and Post-Dominator Trees • analysis/interface/DominatorTree.h • analysis/interface/PostDominatorTree.h • Superblock Analysis • analysis/interface/SuperblockAnalysis.h • Divergence Graph • analysis/interface/DivergenceGraph.h • Thread Frontiers • analysis/interface/ThreadFrontiers.h 39

  40. Transformation Passes • Modify the PTX IR • Consume meta-data • Examples: • Dead-code elimination • transforms/interface/DeadCodeEliminationPass.h • Control-flow structuring • transforms/interface/StructuralTransform.h • Sync elimination • transforms/interface/SyncElimination.h • Dynamic instrumentation

  41. Example: Dead Code Elimination Transformation Pass

  42. Dead Code Elimination • Approach • Run once on each kernel • Consume data-flow analysis meta-data • Delete instructions producing values with no users • Implementation • transforms/interface/DeadCodeEliminationPass.h • transforms/implementation/DeadCodeEliminationPass.cpp

  43. Dead Code Elimination (1 of 5) • Setup pass dependencies DeadCodeEliminationPass::DeadCodeEliminationPass() : KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass") { }

  44. Dead Code Elimination (2 of 5) • Run pass void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k) { • Get analysis metadata Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis); assert(dfgAnalysis != 0); // cast up analysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis); assert(dfg.ssa());

  45. Dead Code Elimination (3 of 5) • Loop until change BlockSet blocks; for (iterator block = dfg.begin(); block != dfg.end(); ++block) { report(" Queueing up BB_" << block->id()); blocks.insert(block); } while(!blocks.empty()) { iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block); }

  46. Dead Code Elimination (4 of 5) • Remove unused live-out values AliveKillListaliveOutKillList; for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut!= block->aliveOut().end(); ++aliveOut) { if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); } } for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed) { block->aliveOut().erase(*killed); }

  47. Dead Code Elimination (5 of 5) • Check if an instruction can be removed • if (ptx.hasSideEffects()) return false; • for (RegisterPointerVector::iterator reg = instruction->d.begin(); • reg!= instruction->d.end(); ++reg) { • // the reg is alive outside the block • if (block->aliveOut().count(*reg) != 0) return false; • InstructionVector::iterator next = instruction; • for (++next; next != block->instructions().end(); ++next) { • for (RegisterPointerVector::iterator source = next->s.begin(); • source != next->s.end(); ++source) { • // found a user in the block • if (*source->pointer == *reg->pointer) return false; • } • } • }

  48. Dead Code Elimination • Repeat for • phi instructions • Other instructions • alive-in values • Ensures meta-data is valid

  49. Running Passes on PTX • Static optimizer • PTXOptimizer • Runs passes on PTX assembly files • ocelot/tools/PTXOptimizer.cpp • JIT optimization • Runs passes before kernels are launched • ocelot/api/implementation/OcelotRuntime.cpp

  50. Questions • GPU Ocelot • Google Code site: http://code.google.com/p/gpuocelot • Research Project site: http://gpuocelot.gatech.edu • Mailing list: gpuocelot@googlegroups.com • Contributors • Gregory Diamos, Rodrigo Dominguez, NailaFarooqui, Andrew Kerr, AshwinLele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, SudhakarYalamanchili • Sponsors • AMD, IBM, Intel, LogicBlox, NSF, NVIDIA

More Related