Improving Index Performance through Prefetching

School of Computer ScienceCarnegie Mellon University Information SciencesResearch CenterBell Laboratories † Improving Index Performance through Prefetching Shimin Chen,Phillip B.Gibbons†andTodd C.Mowry

CPU L2/L3 Cache L1 Cache Larger, slower, cheaper Main Memory Disk Databases and the Memory Hierarchy Traditional Focus: • buffer pool management (DRAM as a cache for disk) Important Focus Today: • processor cache performance (SRAM as a cache for DRAM) • e.g., [Ailamaki et al, VLDB ’99], etc. - 2 -

Non-Leaf Nodes Leaf Nodes Index Structures • Used extensively in databases to accelerate performance • selections, joins, etc. Common Implementation:B+-Trees - 3 -

B+-Tree Indices: Common Access Patterns Search: • locate a single tuple Range Scan: • locate a collection of tuples within a range - 4 -

Data Cache Stalls Other Stalls Busy Time Cache Performance of B+-Tree Indices • A main memory B+-Tree containing 10M keys: • Search: 100K random searches • Scan: 100 range scans of 1M keys, starting at random keys • Detailed simulations based on Compaq ES40 system Most of execution time is wasted on data cache misses • 65% for searches, 84% for range scans - 5 -

B+-Trees: Optimizing Search for Cache vs. Disk • To minimize the number of data transfers (I/O or cache misses):Optimal Node Width = Natural Data Transfer Size • for disk: disk page size (~8 Kbytes) • for cache: cache line size (~64 bytes) • Much narrower nodes and higher trees • Search performance more sensitiveto changes in branching factors Optimized for disk Optimized for cache - 6 -

K2 K1 K2 K1 K3 K4 K4 K3 K5 K2 K2 K2 K2 K2 K1 K3 K1 K3 K1 K3 K1 K3 K1 K3 K6 K8 K7 K4 K4 K4 K4 K4 Previous Work: “Cache-Sensitive B+-Trees” Rao and Ross [SIGMOD 2000] Key insight: • nearly all child ptrs can be eliminated by restricting data layout • double the branching factor of cache-line-sized non-leaf nodes B+-Trees CSB+-Trees - 7 -

Data Cache Stalls Other Stalls Busy Time Impact of CSB+-Trees on Search Performance • Search is 15% faster due to reduction in height of tree • However: • update performance is worse [Rao & Ross, SIGMOD ’00] • range scan performance does not improve There is still significant room for improvement B+-Tree CSB+-Tree - 8 -

CPU L2/L3 Cache L1 Cache Main Memory Latency Tolerance in Modern Memory Hierarchies • Modern processors overlapmultiple simultaneous cache misses • e.g., Compaq ES40 supports 8 off-chip misses per processor • Prefetch instructionsallow software to fullyexploittheparallelism What dictates performance: • NOT simply the number of cache misses • but rather the amount of exposed miss latency pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) - 9 -

Our Approach New Proposal:“Prefetching B+-Trees” (pB+-Trees) • use prefetching to reduce the amount of exposed miss latency Key Challenge: • data dependences caused by chasing pointers Benefits: • significant performance gains for: • searches • range scans • updates (!) • complementary toCSB+-Trees - 10 -

Overview • Prefetching Searches • Prefetching Range Scans • Experimental Results • Conclusions - 11 -

150 300 450 600 We suffer one full cache miss at each level of the tree. Example: Search where Node Width = 1 Line Cache miss Time (cycles) 0 1000 keys, 64B lines, 4B keys, ptrs & tupleIDs 4 levels in B+-Tree (cold cache) - 12 -

Cache miss Time (cycles) 0 150 300 450 600 150 300 450 600 750 900 Additional misses per node dominate reduction in # of levels. Same Example where Node Width = 2 Lines Cache miss Time (cycles) 0 3 levels in tree - 13 -

Cache miss Time (cycles) 0 150 300 450 600 Cache miss Time (cycles) 0 150 300 450 600 750 900 160 320 480 How Things Change with Prefetching # of misses  exposed miss latency fetch all lines within a node in parallel Cache miss Time (cycles) 0 - 14 -

pB+-Trees: Using Prefetching to Improve Search Basic Idea: • make nodes wider than the natural data transfer size • e.g., 8 cache lines wide • prefetchall lines of a node before searching in the node Improved Search Performance: • Larger branching factors, shallower trees • Cost to access every node only increased slightly Reduced Space Overhead: • primarily due to fewer non-leaf nodes Update Performance: ??? - 15 -

Overview • Prefetching Searches • Prefetching Range Scans • Experimental Results • Conclusions - 16 -

150 300 450 600 750 900 We suffer a full cache miss for each leaf node! Range Scan Cache Behavior: Normal B+-Trees Cache miss Time(cycles) 0 • Steps in Range Scan: • search for the starting leaf node • traverse the leaves until end is found - 17 -

Cache miss Time(cycles) 0 150 300 450 600 750 900 160 320 480 • Exposed miss latency is reduced by up to a factor ofnode width. • A definite improvement, but can we still do better? If Prefetching Wider Nodes e.g., node width = 2 lines Cache miss Time(cycles) 0 - 18 -

Cache miss Time(cycles) 0 150 300 450 600 750 900 Cache miss Time(cycles) 0 160 320 480 200 The Ideal Case • Overlap misses until • all latency is hidden, or • run out of bandwidth Cache miss Time(cycles) How can we achieve this? 0 - 19 -

Ideal case Directly prefetch If prefetching through pointer chasing,still experience the full latency at each node The Pointer Chasing Problem Currently visiting Want to prefetch - 20 -

Back pointers needed to initialize prefetching Our Solution: Jump Pointer Arrays Put leaf addresses in an array Directly prefetch by using the jump pointers - 21 -

Our Solution: Jump Pointer Arrays Cache miss Time 0 - 22 -

hints chunked linked-list External Jump Pointer Arrays: Efficient Updates • Impact of an insertion islimited to its chunk • Deletions leave empty slots • Actively interleave empty slots during bulkload and chunk splits • Back pointer to position in jump-pointer array is now ahint • points to correct chunk • but may require local search within chunk to init prefetching - 23 -

bottom non-leaf nodes • the parents of the leaf nodes ( “bottom non-leaf nodes”) • By linking them together, we can use them as a jump-pointer array Tradeoff: • no need for back-pointers, and simpler to maintain • consumes less space, though external array overhead is <1% • but less flexible, chunk size is fixed by B+-Tree structure Alternative Design: Internal Jump-Pointer Arrays • B+-Trees already contain structures that point to the leaf nodes - 24 -

Overview • Prefetching Searches • Prefetching Range Scans • Experimental Results • search performance • range scan performance • update performance • Conclusions - 25 -

Experimental Framework • Results are for a main-memory database environment • (we are extending this work to disk-based environments) Executables: • we added prefetch instructions to C source code by hand • used gcc to generate optimized MIPS executables with prefetch instructions Performance Measurement: • detailed, cycle-by-cycle simulations Machine Model: • based on Compaq ES40 system, with slightly updated parameters - 26 -

Simulation Parameters Models all the gory details, including memory system contention - 27 -

80 B+tree CSB+ 70 p2B+tree p4B+tree 60 p8B+tree p16B+tree 50 p8CSB+ time (M cycles) 40 30 20 10 4 5 6 7 10 10 10 10 # of tupleIDs in the trees Index Search Performance 100K random searches after bulkload; 100% full (except root); warm caches. • pB+-Trees achieve 27-47% speedup vs. B+-Trees, 14-34% vs.CSB+-Trees • optimal node width is 8 cache lines • pB+-Trees and CSB+-Trees are complementary:p8CSB+-Trees are best - 28 -

180 B+tree CSB+ 160 p2B+tree p4B+tree p8B+tree 140 p16B+tree p8CSB+ time (M cycles) 120 100 80 60 4 5 6 7 10 10 10 10 # of tupleIDs in trees Same Search Experiments with Cold Caches • Large discrete steps within each curve What is happening here? 100K random searches after bulkload; 100% full (except root); cold caches (i.e. cleared after each search). - 29 -

180 B+tree CSB+ 160 p2B+tree p4B+tree p8B+tree 140 p16B+tree p8CSB+ time (M cycles) 120 100 80 60 4 5 6 7 10 10 10 10 # of tupleIDs in trees Analysis of Cold Cache Search Behavior Height of the tree dominates performance • effect is blurred in warm cache case If the same height, the smaller the node size the better # of Levels in the Trees - 30 -

10 10 B+tree p8B+tree p8eB+tree p8iB+tree 8 10 time (Cycles) 6 10 4 10 1 2 3 4 5 6 10 10 10 10 10 10 # of tupleIDs scanned through in a single call Index Range Scan Performance log scale Scans of 1K-1M keys:6.5-8.7 speedupover B+-Trees • factor of 3.5-3.7 from prefetching wider nodes • additional factor of ~2 from jump-pointer arrays 100 scans starting at random locations on index bulkloaded with 3M keys (100% full) - 32 -

10 10 B+tree p8B+tree p8eB+tree p8iB+tree 8 10 time (Cycles) 6 10 4 10 1 2 3 4 5 6 10 10 10 10 10 10 # of tupleIDs scanned through in a single call Index Range Scan Performance log scale Small scans (<1K keys):overshooting cost is noticeable • exploit only if scan is expected to be large (e.g., search for end) 100 scans starting at random locations on index bulkloaded with 3M keys (100% full) - 33 -

110 110 B+tree p8B+tree 100 p8eB+tree 100 p8iB+tree 90 90 80 time (M cycles) 80 70 70 60 50 60 50 60 70 80 90 100 percentage of entries used in leaf nodes 50 50 60 70 80 90 100 percentage of entries used in leaf nodes Update Performance Insertions Deletions • pB+-Trees achieve at least a 1.24speedup in all cases Why? 100K random insertions/deletions on 3M-key bulkloaded index; warm caches - 35 -

110 110 B+tree p8B+tree 100 p8eB+tree 100 p8iB+tree 90 90 80 time (M cycles) 80 70 70 60 50 60 50 60 70 80 90 100 percentage of entries used in leaf nodes 50 50 60 70 80 90 100 percentage of entries used in leaf nodes Update Performance Insertions Deletions Reason #1:faster search times Reason #2:less frequent node splits with wider nodes 100K random insertions/deletions on 3M-key bulkloaded index; warm caches - 36 -

pB+-Trees: Other Results Similar results for: • varying bulkload factors of trees • large segmented range scans • mature trees • varying jump-pointer array parameters: • prefetch distance • chunk size Optimal node width: • increases as memory bandwidth increases • (matches the width predicted by our model in the paper) - 37 -

Data Cache Stalls Other Stalls Busy Time Cache Performance Revisited Search:eliminated 45% of original data cache stalls1.47 speedup Scan:eliminated 97% of original data cache stalls8-fold speedup - 38 -

Conclusions • Impact ofPrefetching B+-Trees on performance: • Search:1.27-1.55 speedup over B+-Trees • wider nodes reduce height of tree, # of expensive misses • outperform and are complementary to CSB+-Trees • Updates:1.24-1.52 speedup over B+-Trees • faster search and less frequent node splits • in contrast with significant slowdowns for CSB+-Trees • Range Scan:6.5-8.7 speedup over B+-Trees • wider nodes: factor of ~3.5 speedup • jump-pointer arrays: additional factor of ~2 speedup • Prefetching B+-Treesalso reduce space overhead. • These benefits are likely to increase with future memory systems. • Applicable to other levels of the memory hierarchy (e.g., disks). - 39 -

Backup Slides - 40 -

Total cache misses Misses per level # of levels in tree Revisiting the Optimal Node Width for Searches Total cache misses for a search is minimized when: w = 1 w = # of cache lines per node m = # of child pointers per one-cache-line wide node N = # of tupleIDs in index - 41 -

ni+1 ni+2 ni+3 ni+2 ni+3 ni+3 ni ni L W Scheduling Prefetches Early Enough want to prefetch currently visiting P p = &n0;while(p) { work(p->data); p = p->next;} Loading a node Work() • Our Goal:fully hide latency • thus achieving fastest possible computation rate of 1/W e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this. - 42 -

Time ni+1 ni+2 Li Li+3 Li+2 Wi+3 Wi Wi+2 ni+3 ni Li+1 Wi+1 Li loading nk Wi work(nk) Performance without Prefetching while(p) { work(p->data); p = p->next;} Computation rate = 1/(L+W) - 43 -

Time ni+1 ni+2 Li+3 Li+2 Li Wi+2 Wi+3 Wi visiting ni+1 ni+3 ni ni Li+1 Wi+1 Li loading nk Wi work(nk) data dependence Prefetching One Node Ahead while(p) {pf(p->next); work(p->data); p = p->next;} pf(p->next) prefetch • Computation is overlapped with memory accesses. • computation rate = 1/L - 44 -

Time ni+1 ni+2 ni+3 ni+3 ni ni Li loading nk Wi work(nk) data dependence Prefetching Three Nodes Ahead while(p) {pf(p->next->next->next); work(p->data); p = p->next;} visiting Li Wi L Li+1 Wi+1 Li+2 Wi+2 pf(p->next->next->next) prefetch Wi+3 Li+3 Computation rate does not improve (still = 1/L)! • Pointer-Chasing Problem:[Luk & Mowry, ASPLOS ’96] • any scheme which follows the pointer chain is limited to a rate of 1/L - 45 -

Time ni+1 ni+2 Li+3 Li+2 Li Wi+3 Wi+2 Wi visiting ni+1 ni+3 ni ni prefetch Li+1 Wi+1 Li loading nk Wi work(nk) data dependence Our Goal: Fully Hide Latency while(p) {pf(&ni+3); work(p->data); p = p->next;} pf(&ni+3) Achieves the fastest possible computation rate of 1/W. - 46 -

back pointers jump-pointer array Challenges in Supporting Efficient Updates Conceptual view of jump-pointer array: • What if we really implemented it this way? • Insertion: could incur significant overheads • copying data within the array to create a new hole • updating back-pointers • Deletion: ok; just leave a hole - 47 -

Summary: Why We Expect Updates to Perform Well Insertions: • only a small number of jump pointers move • between insertion point and nearest hole in the chunk • normally only update the hint pointer for the inserted node • which does not require any significant overhead • significant overheads only occur on chunk splits, which are rare Deletions: • no data is moved (just leave an empty hole) • no need to update any hints In general, the jump-pointer array requires little concurrency control. - 48 -

B+-Trees Modeled and their Notations • B+-Trees:regular B+-Trees • CSB+-Trees:cache-sensitive B+-Trees [Rao & Ross, SIGMOD 2000] • pwB+-Trees: prefetchingB+-Trees with node size = w cache lines and nojump-pointer arrays • we consider w = 2, 4, 8, and 16 • p8B+-Trees: prefetchingB+-Trees with node size = 8 cache lines and external jump-pointer arrays • p8B+-Trees: prefetchingB+-Trees with node size = 8 cache lines and internal jump-pointer arrays • p8CSB+-Trees: prefetching cache-sensitiveB+-Trees with node size = 8 cache lines (and no jump-pointer arrays) (Gory implementation details are in the paper.) e i - 49 -

90 B+tree CSB+ 200 p2B+tree p4B+tree 80 p8B+tree 180 p16B+tree p8CSB+ 70 160 time (M cycles) time (M cycles) 140 60 120 50 100 50 60 70 80 90 100 percentage of entries used in leaf nodes 40 50 60 70 80 90 100 percentage of entries used in leaf nodes Searches with Varying Bulkload Factors warm caches cold caches • Similar trends with smaller bulkload factors as when 100% full • Performance of pB+-Trees is somewhat less sensitive to bulkload factor - 50 -

Improving Index Performance through Prefetching

Improving Index Performance through Prefetching

Presentation Transcript

improving performance through strength

Improving Outcomes through Quality and Performance Improvement

Improving Performance Through Empowerment, Teamwork, and Communication

Prefetching

Improving internal audit performance through quality assurance

Module 8 Improving Performance through Nonclustered Indexes

Improving Performance of Rubber Materials Through Blending

Students Helping Students: Improving Performance Through Peer Leaders

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Improving Performance

Improving ACE Performance

Improving Performance

Improving performance

Improving Performance through Bilateral Utility Partnerships

Improving Safety Performance Through Measurement

Improving Application Performance through Swap Compression

Improving Student Performance

Improving Spelling through

Improving Ado.Net Performance