- By
**nay** - Follow User

- 124 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Mesh Layouts for Block-Based Caches' - nay

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

for Block-Based Caches

Sung-Eui Yoon

Peter Lindstrom

Lawrence Livermore National Laboratory

Goal

- Provide cache-coherent layouts of meshes and graphs
- Derive metrics that measure cache-coherence of layouts
- Generality
- Simplicity
- Efficiency
- Accuracy

Cache-Coherent Metrics

- Measure the expected number of cache misses of a layout given block-based caches
- Should correlate well with the observed number of cache misses
- Cache-aware metrics
- Measure cache-coherence given known cache parameters (e.g., block size)
- Cache-oblivious metrics
- Consider all possible cache parameters

Motivation

- Lower growth rate of data access speed

130X

Accumulated

growth rate

during 1993 – 2004

(log scale)

46X

20X

1.5X

during 99 - 04

Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/

Memory Hierarchies and Block-Based Caches

Fast memory

or cache

Slow memory

Block

transfer

Disk

CPU

10-2 sec

10-8 sec

10-7 sec

Access time:

Main Contributions

- Propose novel and practical cache-aware and cache-oblivious metrics
- Derive metrics given block-based caches
- Propose efficient cache-coherent layout constructions
- Apply to different applications

Related Work

- Computation reordering
- Data layout optimization

Computational Reordering

- Cache-aware [Coleman and McKinley 95, Vitter 01, Sen et al. 02]
- Cache-oblivious [Frigo et al. 99, Arge et al. 04]

Focus on specific problems such as

sorting and linear algebra computations

Data Layout Optimization

- Graph and matrix layout [Diaz et al. 02]
- Minimum linear arrangement (MLA), bandwidth, and wavefront, etc.
- Space-filling curves
- [Sagan 94, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]
- Rendering and processing sequences
- [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02, Isenburg and Lindstrom 05]
- Cache-oblivious mesh layout
- [Yoon et al. 05]

Outline

- Computation models
- Cache-aware and cache-oblivious metrics
- Results

Outline

- Computation models
- Cache-aware and cache-oblivious metrics
- Results

General Framework of Layout Computation

na

Input directed

graph, G (N, A)

nb

nd

nc

Cache-coherent metric

Layout algorithm, φ

nd

nb

na

nc

……..

1D layout, φ(N)

na

nc

Two-Level I/O Model [Aggarwal and Vitter 88]na

Input directed

graph

nb

nd

nc

M cache blocks, whose size is B

Layout algorithm

nb

Cache

nd

nb

na

nc

……..

1D layout with block size = 3

Graph Representation

- Directed graph, G = (N, A)
- Represent access patterns between nodes
- Nodes, N
- Data element
- (e.g., mesh vertex or mesh triangle)
- Directed arcs, A
- Connects two nodes if they are accessed sequentially

na

nb

nd

nc

Weights of Nodes and Arcs

- Indicate probabilities that each element will be accessed
- Computed in an equilibrium status during infinite random walks
- Assume that applications infinitely access the data according to the input graph
- Correspond to eigen-values of the probability transition matrix

Cache-Coherence of a Layout given Block-Based Caches

- Expected number of cache misses of a layout
- Probability accessing a node from another node by traversing an arc
- Conditional probability that we will have a cache miss given the above access pattern

na

nb

nd

nc

Specialization to Meshes

- Expected number of cache misses of a layout
- Probability accessing a node from another node by traversing an arc
- Conditional probability that we will have a cache miss given the above access pattern

= constant

na

na

1. Two opposite directed

arcs

2. Uniform distribution to

access adjacent nodes

given a node

nb

nd

nb

nd

nc

nc

Implicitly

derived graph

An input mesh

Outline

- Computation models
- Cache-aware and cache-oblivious metrics
- Results

Four Different Cases

Cache-aware case

single cache block,

M=1

Cache-oblivious case

single cache block,

M=1

Cache-aware case

multiple cache blocks,

M>1

Cache-oblivious case

multiple cache blocks,

M>1

Cache-Aware: Single Cache Block, M=1

na

Input directed

graph

nb

nd

nc

Straddling arcs

Cache, whose block size is B

nd

nb

na

nc

……..

1D layout with block size = 3

Cache-Aware: Multiple Cache Blocks, M>1

na

Input directed

graph

nb

nd

nc

Straddling arcs

Cache

nd

nb

na

nc

……..

1D layout with block boundary

Final Cache-Aware Metric

- Counts the number of straddling arcs of the layout given a block size B

: block index containing the node, i

where

: Unit step function, 1 if x > 0

0 otherwise.

High Accuracy of Cache-Aware Metric

Tested block size = 4KB

Z-curve on a uniform grid

Tested layouts:

Z-curve, Hilbert curve, H-order,

minimum linear arrangement layout,

βΩ-layout, geometric CO layout,

(bi or uni) row-by-row,

(bi or uni) diagnoal layouts

Four Different Cases

Cache-aware case

single cache block,

M=1

Cache-oblivious case

single cache block,

M=1

Cache-aware case

multiple cache blocks,

M>1

Cache-oblivious case

multiple cache blocks,

M>1

Cache-Oblivious: Single Cache Block, M=1

Does not assume a particular block size:

Then, what are good representatives

for block sizes?

Cache

Two Possible Block Size Progressions

- Arithmetic progression
- 1, 2, 3, 4, …
- Geometric progression
- 20 , 21 , 22 ,23 , …
- Well reflects current caching architectures
- E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

Probability that an Arc is a Straddling Arc

Computed as a probability

as a function of arc length,l

Is an arc straddling

given a block size?

Arc length, l, = 2

nd

nb

na

nc

……..

Two Cache-Oblivious Metrics

- Arithmetic cache-oblivious metric,
- Geometric cache-oblivious metric,

MLA metric,

Arithmetic mean

Arc length of arc (i, j)

Geometric mean

of arc lengths

Validation for Cache-Oblivious (CO) Metrics

73% of tested

power-of-two block sizes

97% of tested

block sizes

- Geometric cache-oblivious metric
- Practical and useful

The number of

cache misses

when M = 1

(log scale)

Geometric

CO layout

Arithmetic

CO layout

Correlations between Metrics and Observed Number of Cache Misses

Tested block size = 4KB

Tested with 10 different layouts on a uniform grid

Efficient Layout Computation for Our Metrics

- Cache-aware layouts
- Optimized with cache-aware metric given a block size B
- Computed from the graph partitioning
- Geometric cache-oblivious metric
- Very efficient
- Can be used in different layout methods

Layout Computation with Geometric Cache-Oblivious Metric

- Multi-level construction method
- Partition an input mesh into k different sets
- Layout partitions based on our metric

- Generalized layout method
- for unstructured meshes

1. Partition

2. Lay out

Outline

- Computation models
- Cache-aware and cache-oblivious metrics
- Results

Applications

- Isosurface extraction
- View-dependent rendering

Iso-Surface Extraction

- Uses contour tree [van Kreveld et al. 97]
- Runtime is dominated by the traversal of iso-surface
- Layout graph
- Use an input tetrahedral mesh

Spx model

(140K vertices)

High Correlation with Number of Cache Misses

Tested block size = 4KB

Tested with 8 different layouts:

our geometric CO, our cache-aware, breadth-first (and depth-first) layouts, spectral [Juvan and Mohar 92], cache-oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis sorted layouts

High Correlation with Runtime Performance

Memory access time

is major bottleneck

Disk I/O time is

major bottleneck

Comparison with Other Layouts

The first iso-surface extraction time (sec)

8% - 77%

improvement and

very close to

the cache-aware

performance

View-Dependent Rendering

- Layout vertices and triangles of progressive meshes
- Used in an efficient VDR system [Yoon et al. 04]
- Reduce misses in GPU vertex cache

Cache Miss Ratio on Bunny Model

Universal rendering seq.

[Bogomjakov and Gotsman 02]

GPU vertex

cache miss

ratio

Hoppe [Hoppe 99]

Theoretical

lower bound

[Bar-Yehuda and

Gotsman 96]

Geometric

CO layout

Vertex cache size

Cache Miss Ratio on Power Plant Model

GPU vertex

cache miss

ratio

Z-curve

COML

[Yoon et al. 05]

Hoppe’s rendering

seq. [Hoppe 99]

Theoretical

lower bound

[Bar-Yehuda and

Gotsman 96]

Geometric

CO layout

Vertex cache size

Conclusion

- Novel cache-aware and cache-oblivious metrics to evaluate layouts
- Derived metrics based on two-level I/O model
- Improved the performance of applications without modifying codes

OpenCCL, open source library

http:://gamma.cs.unc.edu/COL/OpenCCL

Ongoing and Future Work

- Derive a lower bound on our geometric cache-oblivious metric
- Employ mesh compression to further reduce disk I/O accesses
- Investigate efficient layout method for deforming models
- Apply to non-graphics applications
- e.g., shortest path or other graph computations

Cache-Efficient Layouts of Bounding Volume Hierarchies

- Yoon and Manocha, Eurographics 06

Collision detection

Ray tracing

Acknowledgements

- Ajith Mascarenhas
- Martin Isenburg
- Dinesh Manocha
- Fabio Bernardon, Joao Comba, and Claudio Silva
- For their unstructured tetrahedra rendering program
- Members of data analysis group in LLNL
- Anonymous reviewers

UCRL-PRES-225448

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

Cache-Coherence of Layouts

- Well known heuristics for cache-coherent layouts
- Space-filling curves [Sagan 94]
- How can we compute better layouts?
- Requires metrics measuring cache-coherence of layouts

Main Results

- Define cache-coherence of layout as:
- Expected number of cache misses during random walks of a graph given block-based caches
- Then, the exp. number of cache misses
- Number of straddling arcs in a cache-aware cache
- Geometric mean of arc lengths in a cache-oblivious case

Data Layout Optimization

- Rendering sequences
- Triangle strips
- [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02]
- Processing sequences
- [Isenburg and Gumhold 03, Isenburg and Lindstrom 05]

Assume that access pattern

globally follows the layout order

Data Layout Optimization

- Space-filling curves
- [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]

Assume geometric regularity

Data Layout Optimization

- Graph and matrix layout
- A survey [Diaz et al. 02]
- Minimum linear arrangement (MLA)
- Bandwidth
- Profile
- Wavefront, etc.

Does not necessarily produce

good layouts for block-based caches

Cache-Aware Metric

- Expected number of cache misses given a block size B

Correlation between Cache-Aware Metric and Observed Number of Cache Misses

R2 = 0.97

R2 = 0.97

M = 5 M = 25

Observed number of cache misses

Cache-aware

metric

Cache block size = 4KB

Evaluating Existing Layouts

An existing layout, φ

- No known tight bound
- Compare against the best layout we can construct
- Employ an efficient sampling method

Is it close to

the optimal layout?

Use it

Build a new one

Implementation

- Modify our open source layout computation codes, OpenCCL
- Based on METIS graph partitioning library [Karypis and Kumar 98]
- Processing speed
- 15k triangles / second

Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05]

- Two major improvements
- Accuracy
- Usability

Pros and Cons

- Limitations
- Not directly applicable to dynamic models
- Pros
- Generality
- Can have benefits without modifying underlying codes

Specialization to Meshes

na

na

- Assume an equally likelihood to access adjacent nodes given a node

nb

nd

nb

nd

nc

nc

Implicitly derived graph

An input mesh

Two Possible Block Size Progressions

Uniform

distribution

- Arithmetic progression
- 1, 2, 3, 4, …
- Geometric progression
- 20 , 21 , 22 ,23 , …
- Well reflects current caching architectures
- E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

Pr(B)

B

Geometric

distribution

Pr(B)

B

Cache Miss Ratio with Different Mesh Resolutions of Bunny Model

GPU vertex

cache miss

ratio

(case size: 32)

Hoppe’s rendering sequence

[Hoppe 99]

COLg

Mesh resolution

Correlation between Cache-Aware Metric and Number of Cache Misses

R2 = 0.97

R2 = 0.97

M = 5 M = 25

Observed number of cache misses

Cache block size = 4KB

Correlations with Observed Number of Cache Misses

R2 = 0.98

R2 = 0.81

Cache block size = 4KB

,

Cache misses

One blk : Mult blks

Geometric

CO metric

Arithmetic

CO metric

Comparison with Other Layouts

Correlation:

0.94

0.98

0.94

0.99

X-axis

BFL

Spectral layout

[Juvan and Mohar 92]

Up to 2X speedup

9% lower than cache-aware layout

DFL

COML

[Yoon et al. 05]

Z-curve

COLg

Aware

Comparison with Other Layouts

- Compute eight different layouts
- Our geometric cache-oblivious layout
- Our cache-aware layout
- Breadth-first layout
- Depth-first layout
- Spectral layout [Juvan and Mohar 92]
- Cache-oblivious mesh layout [Yoon et al. 05]
- Z-curve [Sagan 94]
- X-axis sorted layout

Download Presentation

Connecting to Server..