Rsvm a region based software virtual memory for gpu
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

RSVM: a Region-based Software Virtual Memory for GPU PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

RSVM: a Region-based Software Virtual Memory for GPU. Feng Ji *, Heshan Lin†, Xiaosong Ma *‡ * North Carolina State Univeristy † Virginia Tech ‡Oak Ridge National Lab PACT 2013. GPU Presence Today. Compute. Graphics. GPU Computing Challenge. Parallel computing Memory management.

Download Presentation

RSVM: a Region-based Software Virtual Memory for GPU

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Rsvm a region based software virtual memory for gpu

RSVM: a Region-based Software Virtual Memory for GPU

FengJi*, Heshan Lin†, XiaosongMa*‡

* North Carolina State Univeristy

† Virginia Tech

‡Oak Ridge National Lab

PACT 2013

Gpu presence today

GPU Presence Today



Gpu computing challenge

GPU Computing Challenge

  • Parallel computing

  • Memory management




Core 0

Core 1



Core 2

Core 3

L1/ Shared memory






Device memory

Main memory

Problematic manual gpu memory m anagement

Problematic Manual GPU Memory Management

  • Malloc()

    • Resource limit

  • Memcpy()

    • Hardcoded

  • Working set

Device Code




Host Code

Matrix MulC = A x B

Cpu gpu m emory management

CPU-GPU Memory Management

  • State of the art: host-side memory management

    • GPU compilers [Jablin:PLDI11, Jabin:CGO12; Pai:PACT12]

    • Task scheduling runtimes [Rossbach: SOSP11]

    • GPU ADSM [Gelado: ASPLOS10]

  • Limitations

    • Memory management action before/after GPU kernel

    • No fine-grained control in GPU code

    • Cannot leverage GPU online data access


Existing solution: host side controlling GPU memory

CPU-GPU memory management

Host-side management

Our solution: Enabling memory management in GPU kernel

Host Code

Device Code


GPU Runtime









Region based software virtual memory

Region-based software virtual memory

  • Match host-side virtual memory

    • Abstract memory domains

    • Automate data movement on demand

    • Swap out device memory

  • Unique challenges of CPU-GPU heterogeneous memory system

    • No existing architecture support for VM

      • Solution: software-based mechanisms

    • GPU processing massively parallel

      • Solution: building on GPU atomic operations

    • GPU and drivers beingblack boxes

      • Solution: implementing using standard GPGPU APIs

    • CPU-GPU synchronization expensive

      • Solution: asynchronous runtimes, relaxed consistency

    • GPU-initiated communication difficult

      • Solution: GPU callback

  • Region

    • Repeated idea (CRL [Johnson:SOS95], ADSM [Gelado: ASPLOS10], etc.)

    • Finer granularity



  • Introduction

  • RSVM: region-based software virtual memory for GPU

    • Region API

    • Design: region table, transparent GPU swap

  • Evaluation results

  • Conclusion

Region based software virtual memory for gpu rsvm

Region-based Software Virtual Memory for GPU (RSVM)

  • User specifies via Region API

    • Defining RSVM managed data unit (create)

    • Annotating data unit access code block (map/unmap)

  • RSVM manages both CPU and GPU memory

    • Moving data on-demand across CPU and GPU

    • Intra-kernel data fetching to GPU

    • Transparent GPU memory swapping to host memory

      • For GPU kernels with excessive memory requirement

Rsvm design

RSVM Design

Host Code

Device Code


Region API

Region API

Region Manager

Region Manager

Region Table

Region Table


Callback Server

Callback RPC



GPU Runtime









R egion as basic data unit

Region as basic data unit

  • Decide system-managed basic data unit

    • Page? One size fit all?

  • Region

    • User-defined data block

    • Linear or multi-D (CUDA supports 3D memory layout): <width, height, stride>

  • Benefit

    • Abstracts CPU/GPU memory domain

    • Allows optimization of PCIe efficiency by varying region’s definition

      • developers know it better than system

    • No false sharing




Region api 1 define region and region collection

Region API 1: define region and region collection

rgn_idr_A = rgn_create_cpu (size_A);

rgn_coll_idrc_B = rgn_coll_create (size_B, num_rgns_in_B, B_row_length, //stride threadBlock.y, //width

B_rows); //height

rgn_coll_idrc_C= rgn_coll_create (size_C, num_rgns_in_C, C_row_length,



  • Region Collection: a set of regions.

  • Rgn_coll_create: iteratively create all regions in this set.




Region api 2 use region in the host

Region API 2: use region in the host

  • Region Collection Meta: metadata of region collection.

  • An array of rgn_ID’s.

  • Implemented as a region Itself.

float *A = rgn_map_cpu (r_A, rgn_op_writeonly, NULL, NULL);

rgn_coll_metameta_B = rgn_coll_get_meta_cpu (rc_B);

for(i: 0 to blockDim.y) {

float *pB = rgn_map_cpu (meta_B->rgns[i], rgn_op_writeonly, NULL, NULL);






Region api 3 exchange information

Region API 3: exchange information


mmKernel<<<NB, NT>>> (rc_C, r_A, rc_B);

  • rsvm_sync(): host-side API.

  • Exchange information across CPU-GPU.

  • GPU side runtime knows r_A, rc_B, and rc_C exist.




Region api 4 use region in gpu kernel

Region API 4: use region in GPU kernel

int complete, req;

float *dA = rgn_map_gpu (r_A, rgn_op_readonly, &complete, &req);

if (!complete) dA = rgn_wait_map_gpu(r_A, req);


Asynchronously mapping in the background.

useful work:

e.g. map other regions




Region api 5 use region collection in gpu kernel

Region API 5: use region collection in GPU kernel

rgn_coll_metameta_B = rgn_coll_get_meta_gpu (rc_B);

float *pdB= rgn_map_gpu(meta_B->rgns[blockIdx.y], rgn_op_readonly, &complete, &req);

if (!complete) pdB= rgn_wait_map_gpu(meta_B->rgns[blockIdx.y], req);





Region states in rsvm

Region States in RSVM

  • Relaxed consistency

    • Host and Device runtimes asynchronously drive state change until rsvm_sync

  • Protocol: MSI adapted protocol

Region table

Region Table

  • Table replicated on CPU and GPU

    • Relaxed consistency: merge at sync

  • Challenge: local operation vs. avoid conflict

  • Table partitioned

    • 4096 entries / segment, owned by one side

    • New region from unused entry in one’s own segment: local op.

    • Allocating a new segment: synchronous op.

  • Software TLB in the shared memory

  • Consistency

CPU – Owner


Gpu region fault asynchronous map

GPU region fault: asynchronous map

  • Rgn_map_GPU(rgn, op, *complete, *req)

  • Rgn_wait_map_GPU(rgn, req)

Callback RPC

Callback Server

Region Manager

Device Code



Return (req)

PCIe data


to GPU


Return (complete,


Set callback flag on GPU

  • Call back:

  • Host-side polling [Stuart:Europarw10]

  • Avoid PCIe traffic jam

  • Novel collectivecallback



Gpu transparent memory swap

GPU Transparent Memory swap

  • Challenge: no specialized GPU thread

    • Solution: embedding swap in map/unmapops

  • Split operations, triggered by low memory

    • Operation 1: Swap

      • GPU requests CPU to fetch dirty regions

    • Operation 2: Reclaim

      • GPU frees clean buffers

  • Not-Frequent-Used (NFU) counter of each region

    • Updated by GPU in every map op.

    • Sorted by CPU during swap

  • Swap made re-entrant: concurrent swap requester will

    • Back off seeing ongoing swap

    • Prepare candidates list from previous completed swap

  • Trec





    • Test bed

      • Intel x86 Xeon E5507, 6 GB main mem

      • Nvidia GTX480 (15 SM, 1.5 GB devmem), PCIe 2.0

      • Ubuntu 10.04 LTS, linux 2.6.32, CUDA 5.0rc

    • Benchmark workload

      • Benchmark from CUDA SDK, Rodinia [Uva:rodinia]

      • Case study: MatrixMul , BFS [ORNL:SHOC]

    Benchmarks fit in gpu

    Benchmarks fit in GPU

    • MatrixMul

      • Computation-intensive, scale well with GPU cores

      • Overhead: device library code compiled into GPU kernel

      • Register file pressure:

        • (# of reg / thread ): 25 -> 60

        • Occupancy (active threads / SM ): 1024 -> 512

    Discussion gpu register file for rsvm device library code

    Discussion: GPU register file for RSVM device library code

    • GPU register assignment to threads

      • Static, equally to each thread

        • Compiler reports max register count requirement for each thread

        • Runtime calculates occupancy, kernel launch success/fail

    • GPU register file not enough for RSVM

    • Not all threads run into RSVM library code path concurrently

    • Possible way of over-subscribing threads for register file usage?

      • Dynamically managing registers among threads?

    Case study graph breadth first search

    Case study: Graph Breadth-first Search

    • Iteration (kernel) by BFS distance

    • Metric: traversed edges/sec (TEPS)

    • Dynamic memory access patter – input dependent

      • DIMACS challenge [DIMACS]

      • GTgraph [Gtgraph]

    Bfs input

    BFS Input

    • m/n – edge factor, number of edge/vector

    (N and M in 10^6.)

    Bfs parallelism

    BFS parallelism

    • Graph: nodes + adjacent list (edges)

      • Warp -> each node to visit in current BFS iteration

      • Thread -> each neighbor of this node

    • RSVM’s overhead

      • setup in each kernel (BFS iteration)

      • map nodes’ region, and then

      • map adjacent list’s region

    • Overhead decreases with increased edge factor

    Large graphs

    Large Graphs

    UVA performs better than Manual

    • Manual:

      • Partition graphs

      • Manual swapping between GPU buffers and host buffers in each BFS iteration

      • Local data access

      • Depend on used data in each data partition

    • CUDA Unified Virtual Address (UVA):

      • Use host-side 0-copy buffers

      • Access only needed data

      • PCIe bottleneck in traffic jam

    Manual performs better than UVA

    • RSVM Improvement due to

    • Caching in GPU memory

    • Batched PCIe data transfer

    • Additional advantage

    • Single code base



    • Virtual memory for CPU-GPU heterogeneous system involving GPU-side runtime is possible

      • GPU as computation engine, rather than co-processor

      • Novel designs: region table, asynchronous region API, CPU assisted GPU swap, software TLB in GPU shared memory

      • Insight: register file pressure

      • Benefit dynamic memory accesses (e.g. Graph)

    Thank you

    Thank you!



    • [Augonnet:ICPADS10] C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst. Data-Aware Task Scheduling on Multi-accelerator Based Platforms. Parallel and Distributed Systems, International Conference on, 0:291–298, 2010.


    • [Diamos:HPDC08] G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th international symposium on High performance distributed computing, HPDC ’08, pages 197–200, New York, NY, USA, 2008. ACM.

    • [DIMACS] 10th DIMACS Implementation Challenge - Graph Partitioning and Graph Clustering.

    • [Eichenberger:PACT05] A. E. Eichenberger, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05, pages 161–172, Washington, DC, USA, 2005. IEEE Computer Society.

    • [Fatahalian:SC06] K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM.

    • [Gelado: ASPLOS10] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS ’10, pages 347–358, New York, NY, USA, 2010. ACM.

    • [GTgraph] GTgraph: A suite of synthetic random graph generators. madduri/software/GTgraph/index.html.

    • [HSA] The HSA Foundation.



    • [Jablin:CGO12] T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pages 165–174, New York, NY, USA, 2012. ACM.

    • [Jablin:PLDI11] T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI ’11, pages 142–151, New York, NY, USA, 2011. ACM.

    • [Johnson:SOS95] K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: high performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP ’95), pages 213–226, Copper Mountain Resort, Colorado, December 1995.

    • [Kato:USENIX12] S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (ATC), June 2012.

    • [Linderman:ASPLOS08] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 287–296, New York, NY, USA, 2008. ACM.

    • [Luk:MICRO09] C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45–55, New York, NY, USA, 2009. ACM.

    • [Menon:ISCA12] J. Menon, M. De Kruijf, and K. Sankaralingam. iGPU: exception support and speculative execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.

    • [ORNL:SHOC] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU ’10, pages 63–74, New York, NY, USA, 2010. ACM.



    • [Pai:PACT’12] S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT’12, pages 33–42, New York, NY, USA, 2012. ACM.

    • [Rossbach: SOSP11] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, pages 233–248, New York, NY, USA, 2011. ACM.

    • [Saha:PLDI09] B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 431–440, New York, NY, USA, 2009. ACM.

    • [Silberstein:ASPLOS13] M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a File System with GPUs. In Proceedings of ASPLOS 2013, 2013.

    • [Uva:Rodinia] Rodinia benchmark. skadron/wiki/rodinia/ index.php/Main Page.

    • [Yan:OSR11] S. Yan, X. Zhou, Y. Gao, H. Chen, G. Wu, S. Luo, and B. Saha. Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform. SIGOPS Oper. Syst. Rev., 45:92–100, February 2011.

    • [Stuart:Europarw10] J. Stuart, M. Cox, and J. Owens. GPU-to-CPU Callbacks. In M. Guarracino, F. Vivien, J. Trff, M. Cannatoro, M. Danelutto, A. Hast, F. Perla, A. Knpfer, B. Di Martino, and M. Alexander, editors, Euro-Par 2010 Parallel Processing Workshops.

    Backup slides

    Backup slides

    • Start here….

    Related work

    Related Work

    • Memory hierarchy of accelerator

      • Specialized programming model

        • StarPU [Augonnet:ICPADS10], Harmony [Diamos:HPDC08], Sequoia [Fatahalian:SC06], Merge [Linderman:ASPLOS08], Qilin [Luk:MICRO09]

      • Transparent Software Caching

        • CellBE: [Eichenberger:PACT05]

        • Larrabee: [Saha:PLDI09, Yan:OSR11]

    Related work cont d

    Related Work (cont’d)

    • Compiler assisted CPU-GPU communication

      • ADSM [Gelado: ASPLOS10]

      • CGCM [Jablin:PLDI11], DyManD [Jablin: CGO12]

      • AMM for X10 [Pai:PACT12]

    • OS support for GPGPU

      • Gdev [Kato:USENIX12]

      • Ptask [Rossbach:SOSP11]

      • GPUfs[Silberstein:ASPLOS13]

    Related work cont d1

    Related Work (cont’d)

    • Distributed shared memory

      • ADSM [Gelado:ASPLOS10]

      • CRL [Johnson:SOSP95]

    • GPU virtual memory architecture

      • HSA hUMA [HSA]

      • GPU Exception [Menon:ISCA12]

    Transparent or m anual

    Transparent or Manual?

    • Ideal: transparent & good performance

    • In practice: making compromise






    easy to control


    hard to reason

    Region s state protocol

    Region’s State Protocol

    Region s state protocol1

    Region’s State Protocol

    Region s state protocol2

    Region’s State Protocol

    Region s state protocol3

    Region’s State Protocol

    Software tlb for region t able on gpu

    Software TLB for Region Table on GPU

    TLB in shared memory

    • Shared memory (TLB) consistency with device memory (Region table)

      • Write through

    • Shared memory (TLB) of two SMs

      • A safe cache line: cache hit

        • Define: sharing/modifying

        • Some other warp has cached it

        • Can safely use it

      • Otherwise: cache miss

        • Prepare TLB

        • AtomInc/Dec ref. count

      • Fully-associative

        • warp parallelism

      • Cache line reuse

        • Shared/modified:Refcnt 0

        • Number configurable

    Region Table in dev memory

    Gpu callback

    GPU callback

    • Host-side callback server thread polling a flag [Stuart:Europarw10]

      • GPU code remotely sets flag (in host-side 0-copy memory)

    • Challenge: GPU parallelism

    • Avoid PCIe traffic jam

      • Novel collective callback: non-parameterized requests

        • GPU code detects and sends one signal for all calling threads

      • Host-side callback server batches PCIe data transfers for multiple concurrent callback requests

        • Both incoming parameters and returning values

    Gpu callbacks in rsvm

    GPU callbacks in RSVM

    • Handling region fault

      • non-collective, asynchronous, and parameterized callback

    • Getting new region segment

      • collective, synchronous, and non-parameterized callback

    • Starting swap

      • collective, asynchronous, re-entrant, and non-parameterized callback

    Case 1 matrix multiplication

    Case 1: Matrix Multiplication

    • Matrix A: single region

    • Matrix B: 2-d regions

    • 1280 MB GPU devmem managed by RSVM

    • RSVM: ~70% efficiency

    • Swap: <10% overhead

    Small graph bfs

    Small Graph BFS

    • TEPS

      • Traversed edges/ sec

    • Iteration (kernel) by BFS distance

    • Parallelism

      • Warp -> each visiting node

      • Thread -> each neighbor of the visiting node

    • Overhead

      • RSVM mapping regions of each visit node’s adjacent list

      • RSVM setup each kernel

    Future work

    Future Work

    • RSVM improvement

      • Region table merging optimization

      • CPU callback server optimization

      • Multiple GPU support

      • Multiple process support

    • Compiler assisted region identification

      • Remove manual region creation/deletion

    • Leverage vendor support for GPU faulting

      • Remove manual map/unmap


    Gpu transparent memory swap1

    GPU Transparent Memory swap

    Callback RPC

    Callback Server

    Region Manager



    Available devmem

    resource low



    PCIe data


    from GPU

    buffer to

    host mem

    Set callback flag on GPU



    Rgn states to shared,

    Form a candidate list

    Return (swapped


    Available devmem

    resource keep decreasing.

    Reclaim candidate rgn’s


  • Login