operating system abstractions for gpu programming n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Operating System Abstractions for GPU Programming PowerPoint Presentation
Download Presentation
Operating System Abstractions for GPU Programming

Loading in 2 Seconds...

play fullscreen
1 / 34

Operating System Abstractions for GPU Programming - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Operating System Abstractions for GPU Programming. Chris Rossbach , Microsoft Research Emmett Witchel , University of Texas at Austin September 23 2010. Motivation. GPU application domains limited CUDA Rich APIs/abstractions Language integration  familiar environment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Operating System Abstractions for GPU Programming' - oistin


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
operating system abstractions for gpu programming

Operating System Abstractions for GPU Programming

Chris Rossbach, Microsoft Research

Emmett Witchel, University of Texas at Austin

September 23 2010

motivation
Motivation
  • GPU application domains limited
  • CUDA
    • Rich APIs/abstractions
    • Language integration  familiar environment
  • Composition/integration into systems
    • Programmer-visible abstractions expressive
    • Realization/implementations unsatisfactory
  • Better OS-level abstractions are required
    • (IP/Business concerns aside)
traditional os level abstractions
Traditional OS-Level abstractions

int main(argc, argv) {

FILE * fp = fopen(“quack”, “w”);

if(fp == NULL)

fprintf(stderr, “failure\n”);

return 0;

}

// How do I program just a CPU and a disk?

programmer-

visible interface

OS-level

abstractions

Hardware

interface

gpu abstractions
GPU Abstractions

programmer-

visible interface

1 OS-level

abstraction!

The programmer gets to work with great abstractions…

Why is this a problem?

poor os abstractions limit gpus
Poor OS abstractions limit GPUs

Doing fine without OS support:

    • Gaming/Graphics
      • Shader Languages
      • DirectX, OpenGL
    • “GPU Computing” nee “GPGPU”
      • user-mode/batch
      • scientific algorithms
      • Latency-tolerant
      • CUDA
  • The application ecosystem is more diverse
    • No OS abstractions  no traction
interactive applications
Interactive Applications
  • Gestural Interface
  • Brain-Computer Interface
  • Spatial Audio
  • Image Recognition
  • Processing user input:
  • need low latency, concurrency
  • must be multiplexed by OS
  • protection/isolation
gestural interface
Gestural Interface

Raw images

“Hand”

events

Point cloud

HID

InputOS

Image Filtering

Geometric Transform

Gesture Recognition

  • High data rates
  • Noisy input
  • Data-parallel algorithms

Ack! Noise!

what i wish i could do
What I wish I could do

#> catusb | xform | detect | hidinput &

  • catusb:captures image data from usb
  • xform:
    • Noise filtering
    • Geometric transformation
  • detect:extract gestures from point cloud
  • hidinput: send mouse events (or whatever)

#> catusb | xform | detect | hidinput &

#> catusb | xform | detect | hidinput &

Data

parallel

Inherently sequential

Could parallelize on a CMP, but…

so use the gpu na ve approach
So use the GPU! (naïve approach)

#> catusb | xform | detect | hidinput &

  • Run catusbon CPU
  • Run xformuses GPU
  • Run detect uses GPU
  • Run hidinput: on CPU

Use CUDA to write xform and detect!

running a program on a gpu
Running a program on a GPU
  • GPUs cannot run OS: different ISA
  • Disjoint memory space, no coherence*
  • Host CPU must manage execution
    • Program inputs explicitly bound at runtime

CPU

Main memory

User-mode apps

must implement

Copy inputs

Send commands

Copy outputs

GPU memory

GPU

technology stack view
Technology Stack View
  • 12 kernel crossings
  • 6 copy_to_user
  • 6 copy_from_user
  • Performance tradeoffs for
  • runtime/abstractions

xform

detect

detect

xform

CUDA Runtime

catusb

catusb

hidinput

hidinput

User Mode Drivers (DXVA)

user

kernel

OS Executive

Kernel Mode Drivers

HAL

Run

GPU Kernel

USB

CPU

GPU

so big deal do it all in the kernel
So, big deal…do it all in the kernel

catusb

xform

detect

hidinput

user

kernel

OS Executive

Kernel Mode Drivers

HAL

USB

CPU

GPU

  • No CUDA, no high level abstractions
  • If you’re MS and/or nVidia, this might be tenable…
  • Solution is specialized
  • but there is still a data migration problem…
hardware view
Hardware View
  • We’d prefer:
  • catusb: USB bus  GPU memory
  • xform, detect: no transfers
  • hidinput: single GPUmainmem transfer
  • if GPUs become coherent with main memory…
  • The machine can do this, where are the interfaces?

Cache pollution

Wasted bandwidth

Wasted power

CPU

FSB

hidinput

GPU

DIMM

Northbridge

DDR2/3

PCI-e

xform

detect

DIMM

DDR2/3

catusb

DMI

Current task:

Southbridge

USB 2.0

catusb

xform

outline
Outline
  • Motivation
  • Problems with lack of OS abstractions
  • Can CUDA solve these problems?
    • CUDA Streams
    • Asynchrony
    • GPUDirect™
    • OS guarantees
  • New OS abstractions for GPUs
  • Related Work
  • Conclusion
doesn t cuda address these problems
Doesn’t CUDA address these problems?

CPU

CUDA streams, async:

(Overlap capture/xfer/exec)

“Write-combining memory”

(uncacheable)

FSB

GPU

DIMM

Northbridge

DDR2/3

PCI-e

DIMM

DDR2/3

DMI

Page-locked host memory

(faster DMA)

Portable Memory

(share page-locked)

GPUDirect™

Mapped Memory

(map mem into GPU space)

(transparent xfer app-level upcalls)

Southbridge

USB 2.0

cuda streams
CUDA Streams

Overlap Communication with Computation

Stream Y

Stream X

Compute Engine

Copy Engine

Copy X0

Copy Y0

Copy X0

Kernel Xa

Copy Y0

Kernel Y

Kernel Xa

Kernel Xb

Kernel Xb

Copy Y1

Kernel Y

Copy X1

Copy X1

Copy Y1

streams explicitly scheduled
Streams: explicitly scheduled

Compute Engine

Copy Engine

CudaMemcpyAsync(X0…);

KernelXa<<<…>>>();

KernelXb<<<…>>>();

CudaMemcpyAsync(X1…)

CudaMemcpyAsync(Y0);

KernelY<<<…>>>();

CudaMemcpyAsync(Y1);

Copy X0

Kernel Xa

Kernel Xb

Copy X1

Copy Y0

Kernel Y

Copy Y1

Each stream proceeds serially, different streams overlap

Naïve programming eliminates potential concurrency

reorder code better schedule
Reorder Codebetter schedule

Compute Engine

Copy Engine

CudaMemcpyAsync(X0…);

KernelXa<<<…>>>();

KernelXb<<<…>>>();

CudaMemcpyAsync(Y0);

KernelY<<<…>>>();

CudaMemcpyAsync(X1…)

CudaMemcpyAsync(Y1);

Copy X0

Kernel Xa

Copy Y0

Kernel Xb

Kernel Y

  • Our design can’t use this anyway!
  • … xform | detect …
  • CUDA Streams in xform, detect
    • different processes
    • different address spaces
    • require additional IPC coordination

Copy X1

Copy Y1

  • Order sensitive
  • Applications must statically determine order
  • Couldn’t a scheduler with a global view do a better job dynamically?
cuda asynchrony
CUDA Asynchrony

OS-supported

frames per second

CUDA+streams

CUDA+async

Higher is

better

CUDA

  • Windows 7 x64 8GB RAM
  • Intel Core 2 Quad 2.66GHz
  • nVidiaGeForce GT230

HD: Host-to-Device only

HD: Device-to-Host only

H D: duplex communication

gpudirect
GPUDirect™
  • “Allows 3rd party devices to access CUDA memory”: (eliminates data copy)
  • Great! but:
  • requires per-driver support
    • not just CUDA support!
  • no programmer-visible interface
  • OS can generalize
the elephant in the room
The Elephant in the Room

Traditional OS guarantees:

  • Fairness
  • Isolation
  • No user-space runtime can provide these!
      • Can support…
      • Cannot guarantee
cpu bound processes hurt gpus
CPU-bound processes hurt GPUs

frames per second

Higher is

better

CPU scheduler and GPU scheduler not integrated!

  • Windows 7 x64 8GB RAM
  • Intel Core 2 Quad 2.66GHz
  • nVidiaGeForce GT230

HD: Host-to-Device only

HD: Device-to-Host only

H D: duplex communication

gpu bound processes hurt cpus
GPU-bound processes hurt CPUs
  • Windows 7 x64 8GB RAM
  • Intel Core 2 Quad 2.66GHz
  • nVidiaGeForce GT230

Flatter lines

Are better

meaningful gpu computing implies gpus should be managed like cpus
Meaningful “GPU Computing” impliesGPUs should be managed like CPUs
  • Process API analogues
  • IPC API analogues
  • Scheduler hint analogues
  • Must integrate with existing interfaces
    • CUDA/DXGI/DirectX
    • DRI/DRM/OpenGL
outline1
Outline
  • Motivation
  • Problems with lack of OS abstractions
  • Can CUDA solve these problems?
  • New OS abstractions for GPUs
  • Related Work
  • Conclusion
proposed os abstractions
Proposed OS abstractions
  • ptask
    • Like a process, thread, can exist without user host process
    • OS abstraction…not a full CPU-process
    • List of mappable input/output resources
  • endpoint
    • Globally named kernel object
    • Can be mapped to ptask input/output resources
    • A data source or sink (e.g. buffer in GPU memory)
  • channel
    • Similar to a pipe
    • Connect arbitrary endpoints
    • 1:1, 1:M, M:1, N:M
    • Generalization of GPUDirect™ mechanism
  • Expand system call interface:
  • process API analogues
  • IPC API analogues
  • scheduler hints
revised technology stack
Revised technology stack
  • 1-1 correspondence between programmer and OS abstractions
  • existing APIs can be built on top of new OS abstractions
gestural interface revisited
Gestural interface revisited

rawimg

g_input

process:

catusb

ptask:

xform

ptask:

detect

cloud

hands

usbsrc

hid_in

process:

hidinput

= process

= ptask

= endpoint

  • Computation expressed as a graph
  • Synthesis[Masselin 89] (streams, pumps)
  • Dryad [Isard 07]
  • SteamIt[Thies 02]
  • Offcodes[Weinsberg 08]
  • others…

= channel

gestural interface revisited1
Gestural interface revisited

rawimg

g_input

USBGPU mem

process:

catusb

ptask:

xform

ptask:

detect

cloud

hands

usbsrc

hid_in

process:

hidinput

= process

= ptask

= endpoint

GPU mem GPU mem

= channel

  • Eliminate unnecessary communication…
gestural interface revisited2
Gestural interface revisited

New data triggers new computation

rawimg

g_input

process:

catusb

ptask:

xform

ptask:

detect

cloud

hands

usbsrc

hid_in

process:

hidinput

= process

= ptask

= endpoint

= channel

  • Eliminates unnecessary communication
  • Eliminates u/k crossings, computation
early results potential benefit
Early Results: potential benefit

10x

3.9x

frames per second

Higher is

better

  • Windows 7 x64 8GB RAM
  • Intel Core 2 Quad 2.66GHz
  • NvidiaGeForce GT230

HD: Host-to-Device only

HD: Device-to-Host only

H D: duplex communication

outline2
Outline
  • Motivation
  • Problems with lack of OS abstractions
  • Can CUDA solve these problems?
  • New OS abstractions for GPUs
  • Related Work
  • Conclusion
related work
Related Work
  • OS support for Heterogeneous arch:
    • Helios [Nightingale 09]
    • BarrelFish[Baumann 09]
    • Offcodes[Weinsberg 08]
  • Graph-based programming models
    • Synthesis [Masselin 89]
    • Monsoon/Id [Arvind]
    • Dryad [Isard 07]
    • StreamIt[Thies 02]
    • DirectShow
    • TCP Offload [Currid 04]
  • GPU Computing
    • CUDA, OpenCL
conclusions
Conclusions
  • CUDA: programming interface is right
    • but OS must get involved
    • Current interfaces waste data movement
    • Current interfaces inhibit modularity/reuse
    • Cannot guarantee fairness, isolation
  • OS-level abstractions are required

Questions?