Cache Design

Cache Design • Cache parameters (organization and placement) • Cache replacement policy • Cache performance evaluation method \course\cpeg324-05F\Topic7c

Cache Parameters • Cache size : Scache(lines) • Set number: N (sets) • Line number per set: K (lines/set) Scache = KN (lines) = KN * L (bytes) --- here L is line size in bytes K-way set-associative \course\cpeg324-05F\Topic7c

Trade-offs in Set-Associativity Full-associative: Higher hit ratio, concurrent search, but slow access when associativity is large; Direct mapping: fast access (if hits) and simplicity for comparison trivial replacement alg. Also, if alternatively use 2 blocks which mapped into the same cache block frame: “trash” may happen. \course\cpeg324-05F\Topic7c

Smain Scache Note • Main memory size: Smain (blocks) Cache memory Size: Scache (blocks) Let P = Since P >>1, average search length is much greater than 1. • Set-associativity provides a trade-off between • concurrency in search • average search/access time per block You need search! \course\cpeg324-05F\Topic7c

Set # < < 1 N Scache Full associative Set associative Direct Mapped \course\cpeg324-05F\Topic7c

Important Factors in Cache Design • Address partitioning strategy (3-dimention freedom) • Total cache size/memory size • Work load \course\cpeg324-05F\Topic7c

Address Partitioning M bits • Byte addressing mode Cache memory size data part = NKL (bytes) • Directory size (per entry) M - log2N - log2L • Reduce clustering (randomize accesses) Log N Log L Set number address in a line set size \course\cpeg324-05F\Topic7c

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.34 0.3 0.2 0.1 8 10 20 30 40 Cache Size Note: The exists a knee Miss Ratio General Curve Describing Cache Behavior \course\cpeg324-05F\Topic7c

…the data are sketchy and highly dependent on the method of gathering... … designer must make critical choices using a combination of “hunches, skills, and experience” as supplement… “a strong intuitive feeling concerning a future event or result.” \course\cpeg324-05F\Topic7c

Basic Principle • Typical workload study + intelligent estimate of others • Good Engineering: small degree over-design • “30% rule” • Each doubling of the cache size reduces misses by 30% by A. Smith • It is a rough estimate only \course\cpeg324-05F\Topic7c

Cache Design Process • “Typical”, not “Standard” • “Sensitive” to: • Price-performance -- Technology • main M access time • cache access • chip density • bus speed • on-chip cache \course\cpeg324-05F\Topic7c

Cache Design Process Choose cache size fix K, L, varying N Pick k = 2 - (likely k = 1) K is small Choose line size L for fix cache size and K, varying L Use new k Choose associativity K fix cache size and L, varying K If k = old \course\cpeg324-05F\Topic7c

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Step 1 : Choose Sizefix K, L, varying N Relative Number of Misses 10 20 30 40 Cache Size (N) \course\cpeg324-05F\Topic7c

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Step 2 : Choose Lfix NKL = size and K, varying L Relative Number of Misses 10 20 30 40 Cache Line Size (L) \course\cpeg324-05F\Topic7c

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Step 2 : Choose Kfix NKL = size and L, varying K Relative Number of Misses 10 20 30 40 Cache Associativity Factor (K) \course\cpeg324-05F\Topic7c

N: Set number Cache directory# = N K Cache size = N K L Constraints in selection of N: (page size) \course\cpeg324-05F\Topic7c

K: Associativity • Bigger miss ratio • Smaller is better in: • faster • Cheaper • 4 ~ 8 get best miss ratio simpler \course\cpeg324-05F\Topic7c

L : Line Size • Atomic unit of transmission • Miss ratio • Smaller • Larger average delay • Less traffic • Larger average hardware cost for associative search • Larger possibility of “Line crossers” • Workload dependent • 16 ~ 128 byte \course\cpeg324-05F\Topic7c

Cache Replacement Policy • FIFO (first-in-first-out) • LRU (least-recently used) • OPT (furthest-future used) do not retain lines that have next occurrence in the most distant future Note: LRU performance is close to OPT for frequently encountered program structures. \course\cpeg324-05F\Topic7c

Program Structure fori = 1 to n forj = 1 to n endfor endfor last-in-first-out feature makes the recent past likes the near future …. \course\cpeg324-05F\Topic7c

Nearest Future Access Furthest Future Access Nearest Future Access Furthest Future Access Nearest Future Access Furthest Future Access A B B C B Z C C A D D A D E E E F F G G F G H H [c] [a] [b] An eight-way cache directory maintained with the OPT policy: [a] Initial state for future reference-string AZBZCADEFGH; [b] After the cache hit the Line A; and [c] After the cache miss to Line Z. \course\cpeg324-05F\Topic7c

Why LRU and OPT are Close to Each Other? LRU : look only at past OPT : look only at future But, recent past nearest future Why? (Consider nested loops) ~ ~ \course\cpeg324-05F\Topic7c

Problem with LRU • Not good in mimic sequential/cyclic • Example ABCDEF ABC…… ABC…… With a set size of 3 \course\cpeg324-05F\Topic7c

OPT A B C D E F G A B C……G ABC…G ABC…G ABC A A A B C A B B C A B C A B C LRU A B C D E F …... A B C D E A B C D Sequential Access \course\cpeg324-05F\Topic7c

Empirical Data OPT can gain about 10% ~ 30% improvement over LRU (in terms of miss reduction) \course\cpeg324-05F\Topic7c

A Comparison • OPT has two candidates for replacement • LRU only has one • the least-recently used -- it never replaces the most recently referenced deadline in LRU the most recently referenced the furthest to be referenced in the future \course\cpeg324-05F\Topic7c

Performance Evaluation Methods for Workload • Analytical modeling • Simulation • Measuring \course\cpeg324-05F\Topic7c

Cache Analysis Methods • Hardware monitoring • fast and accurate • not fast enough (for high-performance machines) • cost • flexibility/repeatability \course\cpeg324-05F\Topic7c

Cache Analysis Methods cont’d • Address traces and machine simulator • slow • accuracy/fidelity • cost advantage • flexibility/repeatability • OS/other impacts - how to put them in? \course\cpeg324-05F\Topic7c

Trace Driven Simulation for Cache • Workload dependence • difficulty in characterizing the load • no general accepted model • Effectiveness • possible simulation for many parameters • repeatability \course\cpeg324-05F\Topic7c

Problem in Address Traces • Representative of the actual workload (hard) • only cover milliseconds of real workload • diversity of user programs • Initialization transient • use long enough traces to absorb the impact • Inability to properly model multiprocessor effects \course\cpeg324-05F\Topic7c

An Example • Assume a two-way associative cache with 256 sets Scache = 2 x 256 lines • Assume that the difficulties of count or not count the initialization causes 512 more misses than actually required • Assume a trace of length 100,000 with hit rate 0.99 than 1000 misses is generated the 512 makes big difference!! • If want 512 miss count less than 5% then total misses = 512/5% = 10,240 miss thus with hit = 0.99 required trace length > 1,024,000! \course\cpeg324-05F\Topic7c

One may not know the cache parameters before hand What to do?? Make it longer than minimum acceptable length! \course\cpeg324-05F\Topic7c

100,000? too small • (10 ~ 100) x 106 OK? • 1000 x 106 or more being used now .. \course\cpeg324-05F\Topic7c

Cache Design

Cache Design

Presentation Transcript

Cache

Memory Hierarchy and Cache Design

CPE 631 Lecture 06: Cache Design

Lecture 17: Large Cache Design

Cache Design

Memory Hierarchy and Cache Design (4)

Lecture 14: Large Cache Design II

Cache Memory Design for Network Processors

Lecture 10: Large Cache Design III

Lecture 11: Large Cache Design IV

CpE 442 Cache Memory Design

Lecture 8: Large Cache Design I

Lecture 15: Large Cache Design III

An Introduction to Cache Design

Lecture 12: Large Cache Design

Lecture 14: Large Cache Design III

Cache