Overview of Ocelot: architecture

Overview of Ocelot: architecture

Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 2

Ocelot: Multiplatform Dynamic Compilation Just-in-time code generation and optimization for data intensive applications esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) • Environment for i) compiler research, ii) architecture research, and iii) productivity tools 3

NVIDIA’s Compute Unified Device Architecture (CUDA) • Integrate the concept of a compute kernel called from standard languages • Multithreaded host programs • The compute kernel specifies data parallel computation as thousands of threads • An accelerator model of computing • Explicit functions for off-loading computation to GPUs • Data movement explicitly managed by the programmer 4

NVIDIA’s Compute Unified Device Architecture (CUDA) Host GPU • For access to CUDA tutorials http://developer.nvidia.com/cuda-education-training 5

Structure of a Compute Kernel Parallel Thread Execution (PTX) instruction set architecture • Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs) • Barrier synchronization • Mapped to single instruction stream multiple data stream (SIMD) processor 6

NVIDIA Fermi GF 100 • 4 Global Processing Clusters (GPCs) containing 4 SMs each • Each SM has 32 ALUs, 4 SFUs, and 16 LS units • Each ALU has access to 1024 32bit registers (total of 128kB per SM) • Each SM has its own Shared Memory/L1 cache (64kB total) • Unified L2 cache (768kB) • Six 64bit Memory Controllers (total 384bit wide) Streaming multiprocessor (SM) ALU 7

Ocelot Structure1 PTX Kernel CUDA Application nvcc • Ocelot is built with nvcc and the LLVM backend • Structured around a PTX IR LLVM IR Translator • Compile stock CUDA applications without modification • 1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .

CUDA to PTX • PTX modules stored as string literals in fat binary • We ignore accompanying binary image (GPU native binary) 9

Dependencies • Software • C++ Compiler (GCC 4.5.x) • LexLexer Generator (Flex 2.5.35) • YACC Parser Generator (Bison 2.4.1) • Scons (Python 2.7) • LLVM (3.1) • Libraries • boost_system (1.46) • boost_filesystem (1.46) • boost_serialization (1.46) • GLEW (optional for GL interop) (1.5) • GL (for NVIDIA GPU Devices) • Library headers • Boost (1.46) http://code.google.com/p/gpuocelot/wiki/Installation 11

Ocelot Source Code • Freely available via Google Code project site (New BSD License) • ocelot/ • analysis/ -- analysis passes • api/ -- Ocelot-specific API extensions • cuda/ -- implements CUDA runtime • executive/ -- Device interface and backend implementations • ir/ -- internal representations (PTX, LLVM, AMD IL) • parser/ -- parser (to PTX) • tools/ -- standalone applications using Ocelot • trace/ -- trace generation and analysis tools • translator/ -- translators from PTX to LLVM and AMD IL • transforms/ -- program transformations http://code.google.com/p/gpuocelot/ svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only 12

Building GPU Ocelot • Obtain source code • svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only • Compile with Scons • sudo ./build.py –install • Build and execute unit tests • sudo ./build.py –test=full • Output appears in .release_build • libocelot.so • OcelotConfig • Tests • Installation directory: • /usr/local/include/ocelot • /usr/local/lib http://code.google.com/p/gpuocelot/wiki/Installation

Configuring Ocelot trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], } } • configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • trace specifies which trace generators are initially attached • executive controls device properties • trace: • memoryChecker – ensures • raceDetector - enforces synchronized access to .shared • debugger - interactive debugger • executive: • devices: • List of Ocelot backend devices that are enabled • nvidia- NVIDIA GPU backend • emulated – Ocelot PTX emulator (trace generators) • llvm – efficient execution of PTX on multicore CPU • amd – translation to AMD IL for PTX on AMD RADEON GPU 14

Building and Executing CUDA Programs • nvcc -c example.cu -arch sm_23 • g++ -o example example.o `OcelotConfig -l` • `OcelotConfig -l` expands to ‘-locelot’ • libocelot.so replaces libcudart.so

CUDA Runtime API • Ocelot implements CUDA Runtime API • Transparent hooks into existing CUDA applications • override methods of cuda::CudaDeviceInterface • Maps CUDA RT onto Ocelot device interface abstraction • cuda::CudaRuntime • Extended through custom Ocelot API • e.g. ocelot::registerPTXModule( ); 17

Ocelot CUDA Runtime Overview • A reimplementation of the CUDA Runtime API • Compatible with existing applications • Link against libocelot.so instead of libcudart R. Domingo & D. Kaeli (NEU) Kernels execute anywhere  Key to portability!

Ocelot CUDA Runtime • Clean device abstraction • All back-ends implement same interface • Ocelot API Extensions • Add/remove trace generators • Compile/launch kernels directly in PTX • Device memory sharing among host threads • Device switching 19

Ocelot Source Code: CUDA Runtime API • ocelot/ • analysis/ -- analysis passes • api/ -- Ocelot-specific API extensions • cuda/ -- implements CUDA runtime • interface/CudaRuntimeInterface.h • interface/CudaRuntime.h • interface/CudaRuntimeContext.h • interface/FatBinaryContext.h • interface/CudaDriverFrontend.h • executive/ -- Device interface and backend implementations • ir/ -- internal representations (PTX, LLVM, AMD IL) • parser/ -- parser (to PTX) • tools/ -- standalone applications using Ocelot • trace/ -- trace generation and analysis tools • translator/ -- translators from PTX to LLVM and AMD IL • transforms/ -- program transformations 20

Ocelot CUDA Runtime API Implementation • Implement interface defined by cuda::CudaRuntimeInterface • ocelot/cuda/interface/CudaRuntime.h • ocelot/cuda/implementation/CudaRuntime.cpp • class cuda::CudaRuntime • cuda::CudaRuntime members • Host thread contexts • Ocelot devices • Registered modules, textures, kernels • Fat binaries • Global mutex • CUDA Runtime API functions • eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(), • Additional functions • eg. _lock(), _unlock(), _registerModule()

Ocelot Source Code: Device Interface • ocelot/ • executive/ -- Device interface and backend implementations • interface/Device.h • interface/EmulatorDevice.h • interface/NVIDIAGPUDevice.h • interface/MulticoreCPUDevice.h • interface/ATIGPUDevice.h 22

Ocelot Device Interface • class executive::Device • Succinct interface for device objects • Module registration • Memory management • Kernel configuration and launching • Global variable and texture management • OpenGL interoperability • Streams and Events • Trace generators • Minimal set of APIs for device-oriented programming model • 57 functions (versus CUDA Runtime’s 120+) • Capture device state: • Memory allocations, global variables, textures, graphics interoperability • Facilitate creation of backend execution targets • Implement Device interface • Enable multiple API front ends • Implement front ends targeting Device interface

Ocelot PTX Intermediate Representation (IR) • Backend compiler framework for PTX • Full-featured PTX IR • Class hierarchy for PTX instructions/directives • PTX control flow graph • Static single-assignment form • Dataflow/dominance analysis • Enables PTX optimization • IR to IR translation • From PTX to other IRs • LLVM (x86/PowerPC/ARM) • CAL (AMD GPUs) PTX Kernel 25

Ocelot Source Code: Intermediate Representation • ocelot/ • ir/ -- internal representations (PTX, LLVM, AMD IL) • interface/Module.h • interface/PTXInstruction.h • interface/PTXOperand.h • interface/PTXKernel.h • interface/ControlFlowGraph.h • interface/ILInstruction.h • interface/LLVMInstruction.h • parser/ -- parser (to PTX) • interface/PTXParser.h 26

Ocelot PTX Internal Representation • C++ classes representing PTX module • ir::PTXModule • ir::PTXKernel • ir::PTXInstruction • ir::PTXOperand • ir::GlobalVariable • ir::LocalVariable • ir::Parameter • Ocelot PTX Parser target, Emitter source • ir::PTXInstruction::valid( ) • Translator source • PTX to LLVM • PTX to AMD IL • Suitable for analysis and transformation • Executable representation • PTX Emulator

Ocelot PTX IR: Kernels ir::Module .global .f32 globalVariable; .entry sequence ( .param .u64 __cudaparm_sequence_A, .param .s32 __cudaparm_sequence_N) { .reg .u32 %r<11>; .reg .u64 %rd<6>; .local u32 %rp0; . . . . . . $LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . . $Lt_0_1026: exit; $LDWend_sequence: } // sequence ir::Global ir::Kernel ir::Parameter ir::Local ir::BasicBlock 28

Ocelot PTX IR: Instructions ir::BasicBlock ir::PTXInstruction add.s32 %r7, %r5, 1; ld .param .u64 %rd1, [__cudaparm_sequence_A]; cvt.s64.s32 %rd2, %r5; mul.wide.s32 %rd3, %r5, 4; add.u64 %rd4, %rd1, %rd3; st .global .s32 [ %rd4 + 0 ], %r7; @%p1 bra $Lt_0_6146; ir::PTXOperand addressMode: address opcodeaddressSpacedataType d a addressMode: register addressMode: immediate addressMode: indirect Guard predicate addressMode: label 29

Control and Data-Flow Graphs • Data structure for representing kernels • Basic blocks • fall-through and branch edges • instruction vector • label • Traversals: • pre-order, topological, post-order • iterator visits blocks • Data-flow graph overlays CFG • definition-use chains explicit • to and from SSA form • CFG Transformations: • split blocks, edges • DFG Transformations: • insert and remove values • iterate over def-use 30

Example: Control-Flow Graphs // example: splits basic blocks containing barriers // for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks unsignedint n = 0; ir::BasicBlock::InstructionList::iteratorinst_it; for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it constir::PTXInstruction *inst = static_cast< constir::PTXInstruction *>(*inst_it); if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) { std::string label = (bb_it)->label + "_bar"; kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it) } // end for (bb_it) 31

Example: Spilling Live Values // ocelot/analysis/implementation/RemoveBarrierPass.cpp // voidRemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, constDataflowGraph::Block::RegisterSet& alive ) { unsigned intbytes = 0; ir::PTXInstruction move ( ir::PTXInstruction::Mov ); move.type = ir::PTXOperand::u64; move.a.identifier = "__ocelot_remove_barrier_pass_stack"; move.a.addressMode = ir::PTXOperand::Address; move.a.type = ir::PTXOperand::u64; move.d.reg = _kernel->dfg()->newRegister(); move.d.addressMode = ir::PTXOperand::Register; move.d.type = ir::PTXOperand::u64; _kernel->dfg()->insert( block, move, block->instructions().size() - 1 ); ...

Example: Spilling Live Values ... for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) { ir::PTXInstruction save( ir::PTXInstruction::St ); save.type = reg->type; save.addressSpace = ir::PTXInstruction::Local; save.d.addressMode = ir::PTXOperand::Indirect; save.d.reg = move.d.reg; save.d.type = ir::PTXOperand::u64; save.d.offset = bytes; bytes += ir::PTXOperand::bytes( save.type ); save.a.addressMode = ir::PTXOperand::Register; save.a.type = reg->type; save.a.reg = reg->id; _kernel->dfg()->insert( block, save, block->instructions().size() - 1 ); } _spillBytes = std::max( bytes, _spillBytes ); }

IR for AMD and LLVM AMD Backend: R. Domingo & D. Kaeli (NEU) • LLVM IR • Implements all of the LLVM instruction set • Decouples translator with LLVM project • Easier to construct than LLVM’s actual IR • AMD IL • Supports translation from PTX to AMD interface • Emitters construct parseablestring representations of modules 34

PTX PassManager • Orchestrates analysis and transformation passes • Derived from LLVM model • Analysis Passes generate meta-data • Meta-data consumed by transformations • Transformation Passes modify the IR

Using the Pass Manager • Passes added to a manager • Schedules execution • Manages analysis meta-data • Ensures meta-data available • Up to date; not redundantly computed

Analysis Passes • Analysis runs over the PTX IR • Generates meta-data • Modifies PTX IR • Possibly updates or invalidates existing meta-data • Examples • Data-flow graph • Dominator and Post-dominator trees • Thread frontiers

Analysis Passes – Supported Analaysis Structures • Control Flow Graph • ir/interface/ControlFlowGraph.h • Data Flow Graph • analysis/interface/DataflowGraph.h • Dominator and Post-Dominator Trees • analysis/interface/DominatorTree.h • analysis/interface/PostDominatorTree.h • Superblock Analysis • analysis/interface/SuperblockAnalysis.h • Divergence Graph • analysis/interface/DivergenceGraph.h • Thread Frontiers • analysis/interface/ThreadFrontiers.h 39

Transformation Passes • Modify the PTX IR • Consume meta-data • Examples: • Dead-code elimination • transforms/interface/DeadCodeEliminationPass.h • Control-flow structuring • transforms/interface/StructuralTransform.h • Sync elimination • transforms/interface/SyncElimination.h • Dynamic instrumentation

Example: Dead Code Elimination Transformation Pass

Dead Code Elimination • Approach • Run once on each kernel • Consume data-flow analysis meta-data • Delete instructions producing values with no users • Implementation • transforms/interface/DeadCodeEliminationPass.h • transforms/implementation/DeadCodeEliminationPass.cpp

Dead Code Elimination (1 of 5) • Setup pass dependencies DeadCodeEliminationPass::DeadCodeEliminationPass() : KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass") { }

Dead Code Elimination (2 of 5) • Run pass void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k) { • Get analysis metadata Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis); assert(dfgAnalysis != 0); // cast up analysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis); assert(dfg.ssa());

Dead Code Elimination (3 of 5) • Loop until change BlockSet blocks; for (iterator block = dfg.begin(); block != dfg.end(); ++block) { report(" Queueing up BB_" << block->id()); blocks.insert(block); } while(!blocks.empty()) { iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block); }

Dead Code Elimination (4 of 5) • Remove unused live-out values AliveKillListaliveOutKillList; for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut!= block->aliveOut().end(); ++aliveOut) { if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); } } for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed) { block->aliveOut().erase(*killed); }

Dead Code Elimination (5 of 5) • Check if an instruction can be removed • if (ptx.hasSideEffects()) return false; • for (RegisterPointerVector::iterator reg = instruction->d.begin(); • reg!= instruction->d.end(); ++reg) { • // the reg is alive outside the block • if (block->aliveOut().count(*reg) != 0) return false; • InstructionVector::iterator next = instruction; • for (++next; next != block->instructions().end(); ++next) { • for (RegisterPointerVector::iterator source = next->s.begin(); • source != next->s.end(); ++source) { • // found a user in the block • if (*source->pointer == *reg->pointer) return false; • } • } • }

Dead Code Elimination • Repeat for • phi instructions • Other instructions • alive-in values • Ensures meta-data is valid

Running Passes on PTX • Static optimizer • PTXOptimizer • Runs passes on PTX assembly files • ocelot/tools/PTXOptimizer.cpp • JIT optimization • Runs passes before kernels are launched • ocelot/api/implementation/OcelotRuntime.cpp

Questions • GPU Ocelot • Google Code site: http://code.google.com/p/gpuocelot • Research Project site: http://gpuocelot.gatech.edu • Mailing list: gpuocelot@googlegroups.com • Contributors • Gregory Diamos, Rodrigo Dominguez, NailaFarooqui, Andrew Kerr, AshwinLele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, SudhakarYalamanchili • Sponsors • AMD, IBM, Intel, LogicBlox, NSF, NVIDIA

Overview of Ocelot: architecture

Overview of Ocelot: architecture

Presentation Transcript

Architecture Review Board

Chapter 5 Case Study: MVC Architecture for Web Applications

Software Architecture

GSM Protocol Architecture

Fibonacci Numbers in Architecture

The MonetDB Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE

DESIGN OF SOFTWARE ARCHITECTURE

Applications of SOA and Web Services in Grid Computing

Introduction to Architecture

Training Software Version v2.2

V615 CeBS 1.x - Detailed Architecture

Instruction Set Architecture

Software Architecture

Router Architecture

Apache httpd basics

Chapter 5 Case Study: MVC Architecture for Web Applications

-Early Islamic Architecture- -Moorish Architecture-

Training Software Version v2.2

Introduction of Revit Architecture, Structure, and System