Parallel graphics apis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Parallel Graphics APIs PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on
  • Presentation posted in: General

Parallel Graphics APIs. Gregory S. Johnson [email protected] Topics. Problem: Host / Graphics Performance Mismatch Conventional Solutions Parallelism IRIS Performer (Rohlf and Helman, 1994) Stanford Parallel API (Igehy et al., 1998). Problem.

Download Presentation

Parallel Graphics APIs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parallel graphics apis

Parallel Graphics APIs

Gregory S. Johnson

[email protected]


Topics

Topics

  • Problem: Host / Graphics Performance Mismatch

  • Conventional Solutions

  • Parallelism

  • IRIS Performer (Rohlf and Helman, 1994)

  • Stanford Parallel API (Igehy et al., 1998)


Problem

Problem

  • graphics subsystems can process graphics primitives faster than a single-CPU host can deliver the related sequence of commands

  • when a single-CPU host is busy with non-graphics related tasks (I/O, OS, etc.), the graphics subsystem idles

OpenGL command issued

OpenGL* command processed


Bottlenecks igehy et al

Bottlenecks (Igehy et al.)

  • overhead associated with encoding API commands

  • data bandwidth from the API host

  • data bandwidth into the graphics subsystem

  • overhead associated with decoding API commands


Solution directions

Solution Directions

  • utilize the given resources more effectively

  • add more hardware resources


Conventional solutions packed primitive arrays

Conventional SolutionsPacked Primitive Arrays

  • arrays of primitives stored in system memory which can be issued to the graphics system via a small number of API calls

  • the use of primitives arrays can result in reduced API overhead and increased bandwidth utilization via DMA

glVertexPointer(2, GL_FLOAT, 0, verts);

glEnableClientState(GL_VERTEX_ARRAY);

glColorPointer(3, GL_FLOAT, 0, colors);

glEnableClientState(GL_COLOR_ARRAY);

/* “strip” points into an array with triangle strip connectivity */

/* based on the vertices in the “verts” array */

glDrawElements(GL_TRIANGLE_STRIP, length, GL_UNSIGNED_INT, strip);


Conventional solutions display lists

Conventional SolutionsDisplay Lists

  • a display list is a set of graphics commands (low level equivalents) stored on the graphics subsystem and typically used as a macro

  • useful in cases where geometry in a scene is drawn repeatedly

  • even more useful if the geometry fits on the graphics card itself

/* create a "vane" for the tail of the arrow */

glNewList(VANE, GL_COMPILE);

glBegin(GL_QUADS);

glColor3f(1.0, 1.0, 1.0);

glVertex3fv(v1); glVertex3fv(v2);

glVertex3fv(v3); glVertex3fv(v4);

...

glEnd();

glEndList();


Conventional solutions compression

Conventional SolutionsCompression

  • encoding of scene geometry by the host CPU and decoding by the graphics subsystem

  • compression of graphics data can reduce inter-subsystem bandwidth requirements at the expense of decoding time

GL_SUNX_geometry_compression


Parallelism

Parallelism

  • inherent parallelism: not all graphics-related commands need be issued in strict order (e.g. drawing opaque primitives on Z-buffer equipped hardware)

  • parallelism to cover latency: the graphics subsystem is faster at processing commands than the host CPU is at generating them

OpenGL commands issued

OpenGL* commands processed


Terminology

Terminology

  • context is the scope within which graphics state is affected by graphics commands issued (in some sense a binding between graphics state and issued graphics commands)


Parallel graphics apis

IRIS Performer: A High Performance Multiprocessing Toolkit for Real-Time 3D GraphicsJohn Rohlf, James HelmanSilicon Graphics Computer Systems (1994)


Summary

Summary

  • discusses the design and implementation of a pair of libraries for developing high performance graphics applications easily

  • a low-level library to provide high performance rendering via specialized graphics primitives and efficient state management

  • a high-level library for multiprocessing which utilizes pipeline parallelism for traversing, culling, and issuing elements of a hierarchically organized scene graph


A tale of two libraries

A Tale of Two Libraries

  • libpr provides efficient graphics primitives, state management, and basic mechanisms in support of efficient rendering

  • libpf provides support for multiprocessing and hierarchical organization of scene elements


Libpr pfgeoset

libpr: pfGeoSet

  • a “primitives array” like data structure which holds homogeneous graphics primitives and associated coloring, normal, and texture mapping (coordinates) data


Libpr state management

libpr: State Management

  • libpr provides 3 mechanisms for setting graphics state

  • immediate mode: a “state stack” helps reduce unnecessary state changes and is typically used to set global state

  • display list mode: typically used by libpf to capture a full frame’s worth of data for purposes of multiprocessing

  • encapsulated mode: motivated by the observation that most state applies to the bulk of a scene; is typically used to tie a small number of state changes to specific geometry


Libpr multiprocessing support

libpr: Multiprocessing Support

  • libpr doesn’t implement multiprocessing itself

  • libpr does provide support for shared data including synchronized access

  • includes “multibuffered” arrays which can be thought of as multiple copies of an array, each at a different stage of processing

  • multibuffering solves the problems of data exclusion and synchronization


Libpf scene graphs

libpf: Scene Graphs

  • libpf organizes scene elements into scene graphs, for increased modeling, access, and processing efficiency

  • a scene graph is a tree-like structure containing nodes which correspond to geometry, lights, cameras, coloration, texture, transformations, etc.


Libpf scene graph hierarchy

libpf: Scene Graph Hierarchy

  • scene graphs promote top-down state inheritance

  • the top-down inheritance restriction enables parallel traversal and processing of the scene graph tree

  • scene graphs also encode a hierarchy of bounding volumes, simplifying intersection testing and culling


Libpf scene graph traversal

libpf: Scene Graph Traversal

  • intersection traversal: application-driven collision detection

  • culling traversal: precedes drawing traversals, culling geometry with bounding spheres which fall outside of the view frustrum, and placing the remaining geometry in a (possibly sorted) display list

  • draw traversal: traverses the display list generated during the culling phase and issues the appropriate commands to the graphics subsystem


Libpf optimizations

libpf: Optimizations

  • pfFlatten: reduce the number of transformations

  • pfLOD: level-of-detail based on geometry of varying complexity

  • pfSequence: animated sequences

  • pfBillboard: special representation of axially symmetric shapes


Libpf multiprocessing

libpf: Multiprocessing

  • a pipelined approach to multiprocessing, whereby different processors execute different stages of the APP -> CULL -> DRAW and APP-> ISECT pipelines


Parallel graphics apis

The Design of a Parallel Graphics InterfaceHoman Igehy, Gordon Stoll, Pat HanrahanStanford University (1998)


Summary1

Summary

  • discuss several issues (state, mode, order) influencing the design of a graphics API

  • propose a swank parallel API composed of a small number of extensions to OpenGL

  • present an implementation of the API within a custom software graphics pipeline

  • examine the performance of the implementation applied to a pair of graphics-related applications


Parallelism via existing opengl constructs

Thread 1

DrawPrimitives(opaq[1..256])

appBarrier(appBarrierVar)

DrawPrimitives(tran[1..256])

glFinish()

appBarrier(appBarrierVar)

Thread 2

DrawPrimitives(opaq[257..512])

glFinish()

appBarrier(appBarrierVar)

appBarrier(appBarrierVar)

DrawPrimitives(tran[257..512])

Parallelism via Existing OpenGL Constructs

  • consider a pair of application threads each with its own graphics context, issuing OpenGL commands for a single framebuffer

  • recall that a stream of OpenGL commands is issued by the host CPU(s) and later executed by the graphics subsystem


Addition of a wait construct

Thread 1

DrawPrimitives(opaq[1..256])

appBarrier(appBarrierVar)

glpWaitContext(Thread2Ctx)

DrawPrimitives(tran[1..256])

appBarrier(appBarrierVar)

Thread 2

DrawPrimitives(opaq[257..512])

appBarrier(appBarrierVar)

appBarrier(appBarrierVar)

glpWaitContext(Thread1Ctx)

DrawPrimitives(tran[257..512])

Addition of a Wait Construct

  • glFinish() commands force the issuing threads to wait for the previously issued graphics commands to complete (on the graphics subsystem)

  • but synchronization between the application threads in this example is only needed to insure that the graphics commands are issued in order


Improved synchronization

Thread 1

DrawPrimitives(opaq[1..256])

glpBarrier(glpBarrierVar)

DrawPrimitives(tran[1..256])

glpBarrier(glpBarrierVar)

Thread 2

DrawPrimitives(opaq[257..512])

glpBarrier(glpBarrierVar)

glpBarrier(glpBarrierVar)

DrawPrimitives(tran[257..512])

Improved Synchronization

  • synchronization of the graphics command streams in the previous example is performed by the application threads, stalling them

  • graphics subsystem-level barriers (many-to-many) and semaphores (point-to-point) synchronization mechanisms are introduced


Example marching cubes

Serial

for (i=0; i<M; i++)

for (j=0; j<N; j++)

ExtractAndRender(grid[i,j])

Parallel (Unordered)

for (i=0; i<M; i++)

for (j=(myProc+i)%P; j<N; j+=P)

ExtractAndRender(grid[i,j])

Parallel (Ordered)

for (i=0; i<M; i++)

for (j=(myProc+i)%P; j<N; j+=P)

if (i>0) glpPSema(sema[i-1,j])

if (j>0) glpPSema(sema[i,j-1])

ExtractAndRender(grid[i,j])

if (i<M-1) glpVSema(sema[i,j])

if (j<N-1) glpVSema(sema[i,j])

Example: Marching Cubes


Implementation argus

Implementation: Argus

InfiniteReality pipeline

Argus pipeline


Performance

Performance

  • Argus software pipeline on a SGI Origin SMP applied to Nurbs (patch tessellator - embarrassingly parallel) and March (parallel marching cubes)

6, 7, 8

5

4

3

2

1


Convergence

Convergence

  • the Performer approach utilizes pipeline parallelism while the Stanford approach utilizes multithreaded parallelism

  • the authors note that the role of their API is complimentary to that of IRIS Performer which utilizes pipeline parallelism, but is constrained by placing one processor in charge of issuing graphics commands


The end

The End


  • Login