1 / 12

MonetDB : A column-oriented DBMS

MonetDB : A column-oriented DBMS. Ryan Johnson CSC2531. The memory wall has arrived. CPU performance +70%/year Memory performance latency: -50%/ decade bandwidth: +20%/year (est.) Why? DRAM focus on capacity (+70%/year) Physical limitations (pin counts, etc.)

cassia
Download Presentation

MonetDB : A column-oriented DBMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MonetDB: A column-oriented DBMS Ryan Johnson CSC2531

  2. The memory wall has arrived • CPU performance +70%/year • Memory performance latency: -50%/decade bandwidth: +20%/year (est.) • Why? • DRAM focus on capacity (+70%/year) • Physical limitations (pin counts, etc.) • Assumption that caches "solve” latency problem DBMS spends 95% of time waiting for memory

  3. The problem: data layouts • Logical layout: 2-D relation => Unrealizable in linear address space! • N-ary storage layout, aka “slotted pages” • Easy row updates, strided access to columns => Low cache locality for read-intensive workloads . . . “NSM layouts considered harmful”

  4. Coping with The Wall • Innovation: decompose all data vertically • Columns stored separately, rejoined at runtime • Binary Association Table (BAT) replaces Relation • List of (recordID, columnValue) pairs • Compression and other tricks ==> 1 byte/entry BAT + clever algos=> cache locality => Winner!

  5. Exploring deeper • Performance study (motivation) • Physical data layouts • Cache-optimized algorithms • Evaluating MonetDB performance • Implications and lingering questions

  6. NSM: access latency over time Read one column(record size varies with x) Latency increases ~10x as accesses/cache line  1(slope changes at L1/L2 line size)

  7. Efficient physical BAT layout • Idea #1: “virtual OID” • Optimizes common case • Dense, monotonic OIDs • All BATs sorted by OID • Idea #2: compression • Exploits small domains • Boosts cache locality, effective mem BW Joining two BAT on OID has O(n) cost! Out-of-band values? Can’t we compress NSM also? How to handle gaps?

  8. Cache-friendly hash join • Hash partitioning: one pass but trashes L1/L2 • #clusters > #cache lines • Radix-partitioning: limit active #partitions by making more passes Recall:CPU is cheap compared to memory access

  9. Great, but how well does it work? • Three metrics of interest • L1/L2 misses (= suffer latency of memory access) • TLB misses (even more expensive than cache miss) • Query throughput (higher is better) • Should be able to explain throughput using other metrics • Given model makes very good predictions => Memory really is (and remains!) the bottleneck

  10. A few graphs Radix clustering behavior as cardinality varies Radix-clustered HJ vs. other algorithms Big win: stability as cardinalities vary

  11. Implications and discussion points • Cache-friendly really matters (even w/ I/O) • Traditional DBMS memory-bound • Vertically decomposed data: superior density • Data brought to cache only if actually needed • Compression gives further density boost • Questions to consider... • Queries accessing many columns? • What about inserts/updates (touch many BAT)? • What about deletes/inserts(bad for compression)?

  12. Implications and discussion points • Cache-friendly really matters (even w/ I/O) • Traditional DBMS memory-bound • Vertically decomposed data: superior density • Data brought to cache only if actually needed • Compression gives further density boost • Questions to consider... • Queries accessing many columns? • How to make a good query optimizer? • Performance of transactional workloads? • Update-intensive, concurrency control, ... • What about inserts (bad for compression)?

More Related