slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mesh Layouts for Block-Based Caches PowerPoint Presentation
Download Presentation
Mesh Layouts for Block-Based Caches

Loading in 2 Seconds...

play fullscreen
1 / 66

Mesh Layouts for Block-Based Caches - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Mesh Layouts for Block-Based Caches. Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory. Goal. Provide cache-coherent layouts of meshes and graphs Derive metrics that measure cache-coherence of layouts Generality Simplicity Efficiency Accuracy. Cache-Coherent Metrics.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mesh Layouts for Block-Based Caches' - nay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Mesh Layouts

for Block-Based Caches

Sung-Eui Yoon

Peter Lindstrom

Lawrence Livermore National Laboratory

slide2
Goal
  • Provide cache-coherent layouts of meshes and graphs
  • Derive metrics that measure cache-coherence of layouts
    • Generality
    • Simplicity
    • Efficiency
    • Accuracy
cache coherent metrics
Cache-Coherent Metrics
  • Measure the expected number of cache misses of a layout given block-based caches
    • Should correlate well with the observed number of cache misses
  • Cache-aware metrics
    • Measure cache-coherence given known cache parameters (e.g., block size)
  • Cache-oblivious metrics
    • Consider all possible cache parameters
motivation
Motivation
  • Lower growth rate of data access speed

130X

Accumulated

growth rate

during 1993 – 2004

(log scale)

46X

20X

1.5X

during 99 - 04

Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/

memory hierarchies and block based caches
Memory Hierarchies and Block-Based Caches

Fast memory

or cache

Slow memory

Block

transfer

Disk

CPU

10-2 sec

10-8 sec

10-7 sec

Access time:

main contributions
Main Contributions
  • Propose novel and practical cache-aware and cache-oblivious metrics
    • Derive metrics given block-based caches
    • Propose efficient cache-coherent layout constructions
    • Apply to different applications
related work
Related Work
  • Computation reordering
  • Data layout optimization
computational reordering
Computational Reordering
  • Cache-aware [Coleman and McKinley 95, Vitter 01, Sen et al. 02]
  • Cache-oblivious [Frigo et al. 99, Arge et al. 04]

Focus on specific problems such as

sorting and linear algebra computations

data layout optimization
Data Layout Optimization
  • Graph and matrix layout [Diaz et al. 02]
    • Minimum linear arrangement (MLA), bandwidth, and wavefront, etc.
  • Space-filling curves
    • [Sagan 94, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]
  • Rendering and processing sequences
    • [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02, Isenburg and Lindstrom 05]
  • Cache-oblivious mesh layout
    • [Yoon et al. 05]
outline
Outline
  • Computation models
  • Cache-aware and cache-oblivious metrics
  • Results
outline11
Outline
  • Computation models
  • Cache-aware and cache-oblivious metrics
  • Results
general framework of layout computation
General Framework of Layout Computation

na

Input directed

graph, G (N, A)

nb

nd

nc

Cache-coherent metric

Layout algorithm, φ

nd

nb

na

nc

……..

1D layout, φ(N)

two level i o model aggarwal and vitter 88

nd

na

nc

Two-Level I/O Model [Aggarwal and Vitter 88]

na

Input directed

graph

nb

nd

nc

M cache blocks, whose size is B

Layout algorithm

nb

Cache

nd

nb

na

nc

……..

1D layout with block size = 3

graph representation
Graph Representation
  • Directed graph, G = (N, A)
    • Represent access patterns between nodes
  • Nodes, N
    • Data element
    • (e.g., mesh vertex or mesh triangle)
  • Directed arcs, A
    • Connects two nodes if they are accessed sequentially

na

nb

nd

nc

weights of nodes and arcs
Weights of Nodes and Arcs
  • Indicate probabilities that each element will be accessed
  • Computed in an equilibrium status during infinite random walks
    • Assume that applications infinitely access the data according to the input graph
    • Correspond to eigen-values of the probability transition matrix
cache coherence of a layout given block based caches
Cache-Coherence of a Layout given Block-Based Caches
  • Expected number of cache misses of a layout
    • Probability accessing a node from another node by traversing an arc
    • Conditional probability that we will have a cache miss given the above access pattern

na

nb

nd

nc

specialization to meshes
Specialization to Meshes
  • Expected number of cache misses of a layout
    • Probability accessing a node from another node by traversing an arc
    • Conditional probability that we will have a cache miss given the above access pattern

= constant

na

na

1. Two opposite directed

arcs

2. Uniform distribution to

access adjacent nodes

given a node

nb

nd

nb

nd

nc

nc

Implicitly

derived graph

An input mesh

outline18
Outline
  • Computation models
  • Cache-aware and cache-oblivious metrics
  • Results
four different cases
Four Different Cases

Cache-aware case

single cache block,

M=1

Cache-oblivious case

single cache block,

M=1

Cache-aware case

multiple cache blocks,

M>1

Cache-oblivious case

multiple cache blocks,

M>1

cache aware single cache block m 1
Cache-Aware: Single Cache Block, M=1

na

Input directed

graph

nb

nd

nc

Straddling arcs

Cache, whose block size is B

nd

nb

na

nc

……..

1D layout with block size = 3

cache aware multiple cache blocks m 1
Cache-Aware: Multiple Cache Blocks, M>1

na

Input directed

graph

nb

nd

nc

Straddling arcs

Cache

nd

nb

na

nc

……..

1D layout with block boundary

final cache aware metric
Final Cache-Aware Metric
  • Counts the number of straddling arcs of the layout given a block size B

: block index containing the node, i

where

: Unit step function, 1 if x > 0

0 otherwise.

high accuracy of cache aware metric
High Accuracy of Cache-Aware Metric

Tested block size = 4KB

Z-curve on a uniform grid

Tested layouts:

Z-curve, Hilbert curve, H-order,

minimum linear arrangement layout,

βΩ-layout, geometric CO layout,

(bi or uni) row-by-row,

(bi or uni) diagnoal layouts

four different cases24
Four Different Cases

Cache-aware case

single cache block,

M=1

Cache-oblivious case

single cache block,

M=1

Cache-aware case

multiple cache blocks,

M>1

Cache-oblivious case

multiple cache blocks,

M>1

cache oblivious single cache block m 1
Cache-Oblivious: Single Cache Block, M=1

Does not assume a particular block size:

Then, what are good representatives

for block sizes?

Cache

two possible block size progressions
Two Possible Block Size Progressions
  • Arithmetic progression
    • 1, 2, 3, 4, …
  • Geometric progression
    • 20 , 21 , 22 ,23 , …
    • Well reflects current caching architectures
    • E.g., L1: 32B, L2: 64B, Page: 4KB, etc.
probability that an arc is a straddling arc
Probability that an Arc is a Straddling Arc

Computed as a probability

as a function of arc length,l

Is an arc straddling

given a block size?

Arc length, l, = 2

nd

nb

na

nc

……..

two cache oblivious metrics
Two Cache-Oblivious Metrics
  • Arithmetic cache-oblivious metric,
  • Geometric cache-oblivious metric,

MLA metric,

Arithmetic mean

Arc length of arc (i, j)

Geometric mean

of arc lengths

validation for cache oblivious co metrics
Validation for Cache-Oblivious (CO) Metrics

73% of tested

power-of-two block sizes

97% of tested

block sizes

  • Geometric cache-oblivious metric
    • Practical and useful

The number of

cache misses

when M = 1

(log scale)

Geometric

CO layout

Arithmetic

CO layout

correlations between metrics and observed number of cache misses
Correlations between Metrics and Observed Number of Cache Misses

Tested block size = 4KB

Tested with 10 different layouts on a uniform grid

efficient layout computation for our metrics
Efficient Layout Computation for Our Metrics
  • Cache-aware layouts
    • Optimized with cache-aware metric given a block size B
    • Computed from the graph partitioning
  • Geometric cache-oblivious metric
    • Very efficient
    • Can be used in different layout methods
layout computation with geometric cache oblivious metric

Layout Computation with Geometric Cache-Oblivious Metric
  • Multi-level construction method
    • Partition an input mesh into k different sets
    • Layout partitions based on our metric
  • Generalized layout method
  • for unstructured meshes

1. Partition

2. Lay out

outline33
Outline
  • Computation models
  • Cache-aware and cache-oblivious metrics
  • Results
applications
Applications
  • Isosurface extraction
  • View-dependent rendering
iso surface extraction
Iso-Surface Extraction
  • Uses contour tree [van Kreveld et al. 97]
    • Runtime is dominated by the traversal of iso-surface
  • Layout graph
    • Use an input tetrahedral mesh

Spx model

(140K vertices)

high correlation with number of cache misses
High Correlation with Number of Cache Misses

Tested block size = 4KB

Tested with 8 different layouts:

our geometric CO, our cache-aware, breadth-first (and depth-first) layouts, spectral [Juvan and Mohar 92], cache-oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis sorted layouts

high correlation with runtime performance
High Correlation with Runtime Performance

Memory access time

is major bottleneck

Disk I/O time is

major bottleneck

comparison with other layouts
Comparison with Other Layouts

The first iso-surface extraction time (sec)

8% - 77%

improvement and

very close to

the cache-aware

performance

view dependent rendering
View-Dependent Rendering
  • Layout vertices and triangles of progressive meshes
    • Used in an efficient VDR system [Yoon et al. 04]
    • Reduce misses in GPU vertex cache
cache miss ratio on bunny model
Cache Miss Ratio on Bunny Model

Universal rendering seq.

[Bogomjakov and Gotsman 02]

GPU vertex

cache miss

ratio

Hoppe [Hoppe 99]

Theoretical

lower bound

[Bar-Yehuda and

Gotsman 96]

Geometric

CO layout

Vertex cache size

cache miss ratio on power plant model
Cache Miss Ratio on Power Plant Model

GPU vertex

cache miss

ratio

Z-curve

COML

[Yoon et al. 05]

Hoppe’s rendering

seq. [Hoppe 99]

Theoretical

lower bound

[Bar-Yehuda and

Gotsman 96]

Geometric

CO layout

Vertex cache size

conclusion
Conclusion
  • Novel cache-aware and cache-oblivious metrics to evaluate layouts
    • Derived metrics based on two-level I/O model
    • Improved the performance of applications without modifying codes

OpenCCL, open source library

http:://gamma.cs.unc.edu/COL/OpenCCL

ongoing and future work
Ongoing and Future Work
  • Derive a lower bound on our geometric cache-oblivious metric
  • Employ mesh compression to further reduce disk I/O accesses
  • Investigate efficient layout method for deforming models
  • Apply to non-graphics applications
    • e.g., shortest path or other graph computations
cache efficient layouts of bounding volume hierarchies
Cache-Efficient Layouts of Bounding Volume Hierarchies
  • Yoon and Manocha, Eurographics 06

Collision detection

Ray tracing

acknowledgements
Acknowledgements
  • Ajith Mascarenhas
  • Martin Isenburg
  • Dinesh Manocha
  • Fabio Bernardon, Joao Comba, and Claudio Silva
    • For their unstructured tetrahedra rendering program
  • Members of data analysis group in LLNL
  • Anonymous reviewers
ucrl pres 225448
UCRL-PRES-225448

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

cache coherence of layouts
Cache-Coherence of Layouts
  • Well known heuristics for cache-coherent layouts
    • Space-filling curves [Sagan 94]
  • How can we compute better layouts?
    • Requires metrics measuring cache-coherence of layouts
main results
Main Results
  • Define cache-coherence of layout as:
    • Expected number of cache misses during random walks of a graph given block-based caches
  • Then, the exp. number of cache misses
    • Number of straddling arcs in a cache-aware cache
    • Geometric mean of arc lengths in a cache-oblivious case
data layout optimization50
Data Layout Optimization
  • Rendering sequences
    • Triangle strips
    • [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02]
  • Processing sequences
    • [Isenburg and Gumhold 03, Isenburg and Lindstrom 05]

Assume that access pattern

globally follows the layout order

data layout optimization51
Data Layout Optimization
  • Space-filling curves
    • [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]

Assume geometric regularity

data layout optimization52
Data Layout Optimization
  • Graph and matrix layout
    • A survey [Diaz et al. 02]
    • Minimum linear arrangement (MLA)
    • Bandwidth
    • Profile
    • Wavefront, etc.

Does not necessarily produce

good layouts for block-based caches

cache aware metric
Cache-Aware Metric
  • Expected number of cache misses given a block size B
correlation between cache aware metric and observed number of cache misses
Correlation between Cache-Aware Metric and Observed Number of Cache Misses

R2 = 0.97

R2 = 0.97

M = 5 M = 25

Observed number of cache misses

Cache-aware

metric

Cache block size = 4KB

evaluating existing layouts
Evaluating Existing Layouts

An existing layout, φ

  • No known tight bound
  • Compare against the best layout we can construct
    • Employ an efficient sampling method

Is it close to

the optimal layout?

Use it

Build a new one

implementation
Implementation
  • Modify our open source layout computation codes, OpenCCL
    • Based on METIS graph partitioning library [Karypis and Kumar 98]
  • Processing speed
    • 15k triangles / second
comparison with cache oblivious mesh layouts coml yoon et al 05
Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05]
  • Two major improvements
    • Accuracy
    • Usability
pros and cons
Pros and Cons
  • Limitations
    • Not directly applicable to dynamic models
  • Pros
    • Generality
    • Can have benefits without modifying underlying codes
specialization to meshes59
Specialization to Meshes

na

na

  • Assume an equally likelihood to access adjacent nodes given a node

nb

nd

nb

nd

nc

nc

Implicitly derived graph

An input mesh

two possible block size progressions60
Two Possible Block Size Progressions

Uniform

distribution

  • Arithmetic progression
    • 1, 2, 3, 4, …
  • Geometric progression
    • 20 , 21 , 22 ,23 , …
    • Well reflects current caching architectures
    • E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

Pr(B)

B

Geometric

distribution

Pr(B)

B

cache miss ratio with different mesh resolutions of bunny model
Cache Miss Ratio with Different Mesh Resolutions of Bunny Model

GPU vertex

cache miss

ratio

(case size: 32)

Hoppe’s rendering sequence

[Hoppe 99]

COLg

Mesh resolution

correlation between cache aware metric and number of cache misses
Correlation between Cache-Aware Metric and Number of Cache Misses

R2 = 0.97

R2 = 0.97

M = 5 M = 25

Observed number of cache misses

Cache block size = 4KB

correlations with observed number of cache misses
Correlations with Observed Number of Cache Misses

R2 = 0.98

R2 = 0.81

Cache block size = 4KB

,

Cache misses

One blk : Mult blks

Geometric

CO metric

Arithmetic

CO metric

comparison with other layouts64
Comparison with Other Layouts

Correlation:

0.94

0.98

0.94

0.99

X-axis

BFL

Spectral layout

[Juvan and Mohar 92]

Up to 2X speedup

9% lower than cache-aware layout

DFL

COML

[Yoon et al. 05]

Z-curve

COLg

Aware

comparison with other layouts65
Comparison with Other Layouts
  • Compute eight different layouts
    • Our geometric cache-oblivious layout
    • Our cache-aware layout
    • Breadth-first layout
    • Depth-first layout
    • Spectral layout [Juvan and Mohar 92]
    • Cache-oblivious mesh layout [Yoon et al. 05]
    • Z-curve [Sagan 94]
    • X-axis sorted layout