380 likes | 546 Views
Mantle for developers. Johan Andersson – technical director Frostbite Electronic arts. Mantle?. Simplify advanced development Improve performance Enable developers to innovate Challenge the status quo. Developer impact areas. GPU performance. CPU performance. Control. Platforms.
E N D
Mantle for developers Johan Andersson – technical director Frostbite Electronic arts
Mantle? • Simplify advanced development • Improve performance • Enable developers to innovate • Challenge the status quo
Developer impact areas GPU performance CPU performance Control Platforms Programmability
Control New model Traditional Model: Black Box Explicit Model: Mantle • Middle-ground abstraction – compromise between performance & “usability” • Hidden resource memory & state • Resource CPU access tied to device context • Driver analyzes & synchronizes implicitly • Thin low-level abstraction to expose how hardware works • App explicit memory management • Resources are globally accessible • App explicit resource state transitions
Control App responsibility • Tell when render target will be used as a texture • And many more resource state transitions • Don’t destroy resources that GPU is using • Keep track with fences or frames • Manual dynamic resource renaming • No DISCARD for driver resource renaming • Resource memory tiling • Powerful validation layer will help!
Control Explicit control enables • App high-level decisions & optimizations • Has full scene information • Easier to optimize performance & memory • Flexible & efficient memory management • Linear frame allocators • Memory pools • Pinned memory • Reduced development time • For advanced game engines & apps • Easier to get to target performance & robustness
Control Explicit control enables • Light-weight driver • Easier to develop & maintain • Reduced CPU draw call overhead • Transient resources • Alias render targets within frame • Major memory savings • No need to pre-allocate everything
CPU performance Control
CPU perf Core concepts • Descriptor sets • Monolithic pipelines • Command buffers
Example 1: Single simple dynamic descriptor set Bind everything you need for a single draw call Close to DX/GL model but share between stages CPU perf Descriptor sets • Table with resource references to bind to graphics or compute pipeline • Replaces traditional resource stage binding • Major performance & flexibility advantage • Closer to how the hardware works • App managed - lots of strategies possible! • Tiny vs huge sets • Single vs multiple • Static vs semi-static vs dynamic Image Memory Link Sampler Dynamic descriptor set VertexBuffer (VS) Texture0 (VS+PS) Constants (VS) Texture1 (PS) Texture2 (PS) Sampler0 (VS+PS)
Example 2: Reuse static set with nesting Reduce update time & memory usage CPU perf Descriptor sets • Table with resource references to bind to graphics or compute pipeline • Replaces traditional resource stage binding • Major performance & flexibility advantage • Closer to how the hardware works • App managed - lots of strategies possible! • Tiny vs huge sets • Single vs multiple • Static vs semi-static vs dynamic Image Memory Dynamic descriptor set Link Sampler Static descriptor set Constants (VS) VertexBuffer (VS) Link Texture0 (VS+PS) Texture1 (PS) Texture2 (PS) Texture3 (PS) Texture4 (PS) Sampler0 (VS+PS) Sampler1 (PS)
CPU perf Monolithic pipelines • Shader stages & select graphics state combined into single object • No runtime compilation or patching needed! • Significantly less runtime overhead to use • Supports parallel building & caching • Fast loading times • Usage & management up to the app • Static vs dynamic creation • Amount of pipelines • State usage Pipeline state DB IA VS HS DS GS RS PS CB Tessellator
CPU perf Command buffers • Issue pipelined graphics & compute commands into a command buffer • Bind graphics state, descriptor sets, pipeline • Draw calls • Render targets • Clears • Memory transfers • NOT: resource mapping • Fully independent objects • Create multiple every frame • Or pre-build up front and reuse
CPU perf DX/GL parallelism Game Game Game CPU 0 Render Render Render CPU 1 Driver Render CPU 2 • Automatically extracts parallelism out of most apps • Doesn’t scale beyond 2-3 cores • Additional latency • Driver thread often bottleneck – can collide app threads
CPU perf Parallel dispatch with Mantle Game Game Game CPU 0 Render Render Render CPU 1 Render Render Render CPU 2 Render Render Render CPU 3 Render Render Render CPU 4 • App can go fully wide with its rendering – minimal latency • Close to linear scaling with CPU cores • No driver threads – no overhead – no contention • Frostbite’s approach on all consoles – and on PC with Mantle!
GPU performance CPU performance
Resource states Gives driver a lot more knowledge & flexibility Apps can avoid expensive/redundant transitions, such as surface decompression Expose existing GPU functionality Quad & Rect-lists HW-specific MSAA & depth data access Programmable sample patterns And more.. GPU perf GPU optimizations • Thanks to improved CPU performance – CPU will rarely be a bottleneck for the GPU • CPU could help GPU more: • Less brute force rendering • Improve culling • Shader pipeline object – driver optimizations • Can optimize with pipeline state knowledge • Can optimize across all shader stages
GPU perf Queues • Modern GPUs are heterogeneous machines with multiple engines • Graphics pipeline • Compute pipeline(s) • DMA transfer • Video encode/decode • More… • Mantle exposes queues for the engines + synchronization primitives Graphics Compute DMA . . . Queues GPU
GPU perf Queues Graphics Compute DMA . . . Queues GPU
GPU perf Queue use cases • Async DMA transfers • Copy resources in parallel with graphics or compute Copy DMA Render Other render Use copy Graphics
GPU perf Queue use cases • Async DMA transfers • Copy resources in parallel with graphics or compute • Async compute together with graphics • ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Compute Graphics Non-shadowed lighting GBuffer Shadowmap 0 Shadowmap 1 Final lighting
Multiple compute kernels collaborating Can be faster than über-kernel Example: Compute geometry backend & compute rasterizer GPU perf Queue use cases • Async DMA transfers • Copy resources in parallel with graphics or compute • Async compute together with graphics • ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Compute Geometry Compute 0 Compute Rasterizer Compute 1 Ordinary Rendering Graphics
Multiple compute kernels collaborating Can be faster than über-kernel Example: Compute geometry backend & compute rasterizer Compute as frontend for graphics pipeline Compute runs asynchronously ahead and prepares & optimizes geometry for graphics pipeline GPU perf Queue use cases • Async DMA transfers • Copy resources in parallel with graphics or compute • Async compute together with graphics • ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Process1 Process0 Compute Graphics Process0 • Game engines will build large GPU job graphs • Move away from single sequential submission • Just as we already have done on CPU Draw0 Draw1 Draw2
GPU performance Programmability
Programmability Explicit Multi-GPU • Explicit control of GPU queues and synchronization, finally! • Implement your own Alternate-Frame-Rendering • Or something more exotic.. • Use case: Workstation rendering with 4-8 GPUs • Super high-quality rendering & simulation • Load balance graphics & compute job graphs across GPUs • 20-40 TFlops in a single machine! • Use case: Low-latency rendering • Important for VR and competitive games • Latency optimized GPU job graph scheduling • VR: Simultaneously drive 2 GPUs (1 per eye)
Write occlusion query results into GPU buffer No CPU roundtrip needed Can drive predicated rendering Or use results directly in shaders (lens flares) Programmability New mechanisms • Command buffer predication & flow control • GPU affecting/skipping submitted commands • Go beyond DrawIndirect / DispatchIndirect • Advanced variable workloads • Advanced culling optimizations
Examples Performance optimizations – less data to update Logic & data structures that live fully on the GPU Scene culling & rendering Material representations Deferred shading Raytracing Programmability Bindless resources • Mantle supports bindless resources • Shaders can select resources to use instead of static binding from CPU • Extension of the descriptor set support • Key component that will open up a lot of opportunities!
Platforms Programmability
Platforms Today • Mantle gives us strong benefits on Windows today • Console-like performance & programmability on both Windows 7 and Windows 8 • For us, well worth the dev time! • DX & GL are the industry standards • Needed for platforms that do not support Mantle • Needed by devs who do not want/need more control • Have to have fallback paths for GL/DX, but not limit oneself to it • Mantle and PlayStation 4 will drive our future Frostbite designs & optimizations • PS4 graphics API has great programmability & performance as well • Share concepts, methods & optimization strategies
Platforms Linux & Mac • Want to see Mantle on Linux and Mac! • Would enable support for our full engine & rendering • Significantly easier to do efficient renderer with Mantle than with OpenGL • Use cases: • Workstations • R&D • Not limited by WDDM • Games • Mantle + SteamOS = powerful combination!
Platforms Mobile • Mobile architectures are getting closer in capabilities to desktop GPUs • Want graphics API that allows apps to fully utilize the hardware • Power efficient • High performance • Programmable • Major opportunity with Mantle – leap frog GL4, DX11 • For mobile SoC vendors • For Google and Apple
Platforms Multi-vendor? • Mantle is designed to be a thin hardware abstraction • Not tied to AMD’s GCN architecture • Forward compatible • Extensions for architecture- and platform-specific functionality • Mantle would be a much more efficient graphics API for other vendors as well • Most Mantle functionality can be supported on today’s modern GPUs • Want to see future version of Mantle supported on all platforms and on all modern GPUs! • Become an active industry standard with IHVs and ISVs collaborating • Enable us developers to innovate with great performance & programmability everywhere
Frostbite Battlefield 4 • Mantle support is in development • Core renderer (closer to PS4 than DX11) • Implement all rendering techniques used in BF4 (many!) • CPU optimizations (parallel dispatch, descriptor sets) • GPU optimizations (minimize transitions, MSAA) • R&D for advanced GPU optimizations • Memory management • Multi-GPU support • ~2 months of work • Update targeting late December
Frostbite Plants vs Zombies: Garden Warfare • Very different rendering compared to BF4 • Frostbite Mantle renderer will work out of the box • Focus on APU performance
Frostbite Future • All Frostbite games designed with Mantle • 15 games in development across all of EA • Advanced Mantle rendering & use cases • Lots of exciting R&D opportunities! • Want multi-vendor & multi-platform support!
Email: repi@dice.se • Web: http://frostbite.com • Twitter: @repi The end