Allegorithmic substance
Sponsored Links
This presentation is the property of its rightful owner.
1 / 38

Allegorithmic Substance PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Allegorithmic Substance. Threaded Middleware. Procedural textures on multi-core. Other than framerate and features, what else can you do with extra CPU power ? We’ll look at Allegorithmic’s middleware, Substance. Procedural textures are valuable for modern games. Have a LOT of textures.

Download Presentation

Allegorithmic Substance

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Threaded Middleware

Procedural textures on multi-core

  • Other than framerate and features, what else can you do with extra CPU power ?

  • We’ll look at Allegorithmic’s middleware, Substance

Procedural textures are valuable for modern games

  • Have a LOT of textures.

  • Want shorter loading times‏‏ (faster starts, teleportations or zooms)‏.

  • Need to reduce texture memory on a disc, for download, and/or in RAM.

  • Can benefit from more flexible and reusable assets.

Introducing Substance

  • In Q2 2007 Allegorithmic started a complete reengineering of ProFX2, authoring tool and engine, named Substance.

  • Unit tests were done very early to ensure that Substance could target streaming.

  • Cross-platform : PC, PS3, XBOX, etc.

  • Expected linear multi-thread scalability.

What is Substance ?

  • Substance is a middleware product composed of two elements.

  • Substance Authoring Tool lets you

    • create procedural textures

    • create texture packages of a few kilobytes !

    • A cooker compiles generic data into binaries optimized for a specific platform or user.

  • Substance Engine

    • generates bitmap textures on the fly.

Less FPS ?

  • More textures, not less FPS

    • Substance consumes idle cycles, not frames

  • Graphics bitrates follow Moore's law

    • Higher poly count → bigger worlds

    • Higher filter rate → larger textures

    • Desired texture volume grows faster than RAM

  • Streaming is a necessity

    • But HDD net bitrate does not follow. Bottleneck !

  • Modern gameplay entails sudden bitrate bursts

    • This is worsened by HDD seeks and entails stalls.

No, a stable and high FPS.

  • Even masked, a stall is actually a FPS drop

  • Substance works in Random Access Memory

  • The gamer zooms or teleports:

    • Give 4 cores and a GPU to Substance

    • Sacrifice 1 or 2 frames

    • Substance gen. & cache 1-2M new texels.

    • The stall does not hinder game play.

  • Substance diminishes stalls

  • Substance helps to maintain a high FPS.

Performance issue:streaming in games

  • DVD or HDD net bitrate is 2 or 6 MB/s

  • Our aim: add a stable 4MB/s without the GPU

  • Requires billions of intermediate pixels/s.

  • Can CPUs compete with GPUs ?

  • Opportunity: cores are still under-exploited in most game engines.

  • Texture processing is privileged in the new multi-core architectures.

The architecture was designed with these issues in mind:

  • Homogeneous CPU and GPU versions

  • Streaming (~1-10 CPU cycles per pixel)‏

  • SIMD & MT for the multi-core generations

  • No cache nor threading pollution

  • Fine grained jobs and lockless sync.

  • Low memory footprint

The theoretical benefit was calculated

  • New architectures come with enhanced SIMD. Expected x10 compared to std C++

  • Tricks and algorithmic changes could give another x10 on some filters, like DXT

  • We were confident that our image processes could be well threaded. Partly because we generate textures asynchronously

  • Hence the CPU version of ProFX2 could be accelerated by a factor x25-x100

This is the approach taken to address the issue:

  • Simple innerloop tests actually showed that optimized SSE2-4 code could give a boost of x10

  • Find a data layout coherent with micro parallelism (SIMD and pipeline), low level threading, cache and memory handling.

  • OpenMP is then used to test strategies before designing a specific MT HAL

Here’s the code that was developed to make this possible:

  • A SIMD HAL is ready for PC, Xbox, PS3.

  • OpenMP easily gives a 85% MT linearity.

  • Our MT HAL is converging towards a model of lockless synchronization, 95% expected.

  • The cooker precomputes data that will help synchronization and MT efficiency.

  • Our API exposes asynchronous commands. Perfect to share cores with a game loop !

The compositing graph,node based image processing

  • Authoring Tool: non linear editing

  • Engine: efficient high level structure

  • Graph (DAG) contains 3 types of nodes:

    • Sources: procedural noise, bitmaps, SVGs

    • Filters: blend, HSL, TRS, warp, blur, etc.

    • Outputs: coherent diffuse & normal maps, etc.

  • Main advantages:

    • Libraries, capsules: instanciation of subgraphs

    • Complex variants: fast to create and compute

    • Dynamic custom branches (ex: aging textures)‏

The compositing graph,node based image processing

Threading strategies

  • High level threading:

    • Task decomposition : 1 node (filter) per thread

    • Graph splitting ensures task independency

  • Low level threading:

    • Data decomposition : 1 strip of blocks per thread

    • Dispatcher ensures non conflicting areas

    • Pixel to pixel filters are concatenated.

    • Streamed R/W, no L2 cache pollution

    • Temporary blocks in private L1 double buffers

    • Intermediate images never allocated

    • Lockless reactive sync and cache friendly

Threading sub graphs (1/11)by nodes (high level)‏

Threading sub graphs (2/11)by nodes, caching

Threading sub graphs (3/11)by nodes

Threading sub graphs (4/11)by strips (low level)‏

Threading sub graphs (5/11)remove from cache

Threading sub graphs (6/11)by strips

Threading sub graphs (7/11)remove from cache

Threading sub graphs (8/11)by strips

Threading sub graphs (9/11)remove from cache

Threading sub graphs (10/11)by strips

Threading sub graphs (11/11)update cache, and finished

Expect more streaming bandwidth

  • Substance generates 4MB/s of compressed textures per second

  • Cumulate this with classical streaming

  • 50+ MB/s loading with 4 cores and 1 GPU

Here’s how close we got to the theoretical best performance:

  • DXT compression at 2G pixels/s (same as what hi-end GPUs can do in 2007).

  • 8 bits SVG (cooked) rendering at 20G/s. 8G/s anti-aliasing with 4 sub-samples.

  • In most cases 4 cores give a x3.8 boost

  • Some filters are more problematic, but solutions have been imagined in details, and will be implemented between Q2 and Q4 2008.

Here’s the new performance profile:

  • Substance and ProFX2 figures are for one core.

  • 4 cores: 3.8 times more fillrate.

  • ProFX2: SVG GPU

  • Substance: SVG CPU

  • SVG AA: 2G pixels/s per core

This is future-proofed

  • The cooker precomputes whatever helps to linearise computations.

  • Scalable code: SSE4 added in one day thanks to the SIMD HAL

  • Scalable threading: our two strategies scale

  • A few functions dispatch virtual CPU "shaders"

  • 64-cores ready ↔ code a new dispatcher ?

  • Multiplatform design.

What’s next?

Procedural diffuse map

Coherent procedural normal map

Complex procedural environment map

This scene is made entirely of proceduraltextures

Future sources of bandwidth

  • SIMD code can be better pipelined in ASM.

  • Our cooker can optimize a lot of things.

  • Authoring tool will have a RT profiler

  • Artists gaining experience with Substance will also optimize their packages better.

  • Artist feedback will also help us to improve the expressiveness of each filter

  • ~30-50 filters per texture, main perf. divisor.

Here’s how you can best take advantage of procedural textures

  • Anticipate texture generation requests.

  • Predict visibility (HOM, PVS)‏.

  • Create mipmaps. Access levels JIT.

  • Cache the useful texels.

  • Adapt texture resolution to workload.

  • Use texture variants, less tiling textures or details. Show a higher texel/pixel ratio.

What do you think?

  • Have you tried something like this?

  • Have you rejected trying something like this?

  • Login