Time for Change: Why Not Transact In Memory?

Time for Change:Why Not Transact In Memory? Sang K. Cha Email: chask@db.stanford.edu on leave from Seoul National University, Korea

Transact In Memory Project • My interest in main memory database started from: • WS-IRIS at HP Lab (1991-1992.1): • In-memory personal object manager project of Tore Risch (before SmallBase project which became TimesTen) • Related previous projects at Seoul National Univ.: • C++ design & implementation of MMDBMS prototypes • Initial design of MMDBMS modules (1993-1994) • Mr.RT for ETRI (1995-1996) • 2+ local startups • Xmas: Extension of Mr.RT (1997-1999) • Object-oriented middleware for spatial object database (1994-1999) • Transact in Memory project (2000 - present)

Transact In Memory Project Spirit: • Do what we think are important rather than what others (government bureaucrats or committees) think important. • No more Build and Perish: • Deal with the end users directly for quick feedback on our work Funding Scheme: • Experimental • Presently, self-managed through Transact In Memory, Inc. for research and development freedom • Venture capital financing in the future?

Transact In Memory Project Our Users • Performance-demanding users with simple applications (with 3n-letter company product bases) • Telco companies • Stock brokerage companies • …. Our People: • 20+ is expanding • Graduate students • Consultant (patent attorney for IP protection) • Full-time, field service engineering & marketing staffs

Outline • Introduction to Main Memory DBMS • MMDBMS vs DRDBMS • Memory-primary vs disk-primary architecture • More on Transact In Memory Project • Some hypotheses • Parallel Architecture with Parallel Logging and Recovery • Some Memory-Primary Design • P*TIME: a Highly Parallel Transact In Memory Engine • Summary

Main Memory DBMS • Database resident in memory • Read transactions simply read the in-memory data. • Update transactions do in-memory updates and write update log to the log disk. • Occasionally, checkpoint the dirty pages of the in-memory database to the disk-resident backup DB to shorten the recovery time. MMDBMS Primary DB Checkpointing Logging Log Backup DB

MMDBMS Technical Issues • Logging and Recovery: critical for not losing updates • In-Memory Index Structures • Efficient L2 Cache Utilization? • Concurrency Control of Index and Data • DBMS Process & Application Binding Architecture • Query Processing • Memory Management • Availability for Mission-critical Applications

Reminder on Hardware Advances • 1000 times over the 15 years [Patterson] • CPU power, Memory capacity, Disk capacity • Not really for memory and disk speed • Network interface speed: 100Mbps – Gbps (maybe 100 times) • VAX 11/750 in 1984 (Stanford KBMS personal workstation) • CPU (32 bit, cycle time: 320 ns, CPI: 10) • 8MB memory (800 ns for write, 640ns for read) • Hundreds of MB disk space • 10Mbps ethernet connection How much speedup in the DBMS performance over the same period? --- Not 1000 times!

In the old days (1984), a few MB was said… With the availability of very large, relatively inexpensive main memories, it is becoming possible to keep large databases resident in memory. …. At the present time, memory for VAX 11/780 costs approximately $1,500 a megabyte. In 1990, a gigabyte of memory should cost less than $200,000 (with the availability of 1Mb chips). If 4 Mbit chips are available, the price might be as low as $50,000. - DeWitt, Katz, Olken, Shapiro, Stonebraker, and Wood, “Implementation Techniques for Main Memory Database Systems,”SIGMOD 1984

and today a few GB…. We write the paper with essentially the same wording: • Real numbers: As the price of memory continues to drop below $1,000/GB, it is now feasible to place many of the database tables and indexes in main memory.- Kim, Cha, Kwon, SIGMOD 2001

MMDBMS vs DRDBMS • Well-DesignedMMDBMS should outperform DRDBMS significantly not only for read-oriented applications but also for update-intensive applications! with memory-optimized data structures and algorithms disk access reduced to the sequential form of log writing and occasional checkpointing

Large Buffer record Index Blocks Data Blocks disk address Database Log Q: Is Disk Database with large buffer the same as Main Memory Database? • No! • Complex mapping between disk and memory • E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk and memory addresses • Swizzle or not • Disk index block design is not optimized against hardware • cache misses.

Other Reasons to Say No! • Update performance advantage of main memory database • Disk-resident DBMS is limited in taking advantage of hardware advances • Huge code base • many levels of abstraction for portability • backward compatibility with the old customers (long history of evolution) • Slow to change • Too many control knobs to optimize • Disk-primary design

Two Different Paradigms Memory Primary Disk Primary

Memory Primary versus Disk Primary CPU Cache Memory Memory-Primary Focus Main Memory (DRAM) Disk-Primary Focus

CPU speed: 60% increase per year DRAM speed: 10% increase per year (From David Patterson, et. al., IEEE Micro, 1997) Why Memory-Primary Design? Memory access is no longer uniform! • Cost of L2 cache misses: hundred clock cycles, or thousand instruction opportunity

Memory-Primary Design • Cache-conscious: the well-known disk access rule comes back in a slightly different flavor • Random access is expensive: • Example: Pointer chasing • Block access is more efficient: • Align data structures with L2 cache lines (e.g., 64 bytes). • Pack the right data in the cache block.

Cache behavior of commercial DBMS(on Uniprocessor Pentium II Xeon) Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where does time go?, VLDB 99 • Memory related delays: 40-80% of execution time. • Data accesses on caches: 19-86% of memory stalls. • Multiprocessor cache behavior? • Probably worse because of coherence cache misses

Coherence Cache Miss Problem • Updates on the shared data structures on shared-memory multiprocessor systems • Cache invalidation • Coherence Cache Miss • Problems with • Frequently updated, shared internal control structures • locks, transaction tables, etc.

Transact In Memory Project at SNU Objective: Build a highly parallel, highly scalable Main-Memory DBMS • Exploit the parallelism in logging, recovery, and CC • Exploit advances in HW computing power Specific technical issues published so far: • Parallel logging and recovery [ICDE01] • Cache-conscious index [SIGMOD01] • Cache-conscious concurrency control of index [VLDB01]

Transact In Memory ProjectSome Hypotheses • MMDBMS should do more than just keeping the database in memory • New architecture and new algorithms are needed; Otherwise DRDBMS will catch up eventually. • Memory-primary design should boost the performance significantly • MMDBMS is the best place to apply the self-tuning RISC DBMS concept! • Thanks to Surajit Chaudhuri and Gerhard Weikum for their VLDB 2000 vision paper! • Once we have a high-performance MMDBMS, today’s multi-tier architecture may not be the best fit. • Database servers may become idle … • This is also a good place for inventing … • Because of new cost model, new premise

Parallel Main Memory DBMS Parallel Checkpointing Parallel Logging

Parallel Main Memory DBMS Fully Parallel Recovery: Backup Database Loading and Log Processing all in Parallel

Problems with Existing Logging Schemes • Constraint on log replay by the serialization order • Conventional Wisdom • Distribute log records to multiple disks • Pays the cost of merging and applying log records by the serialization order during recovery • Partition the database and assigning different log disks to different partitions => limited parallelism • Poor utilization of multiple log disks when updates are skewed to certain partitions • Difficulty of handling transactions updating multiple segments

Differential Logging: Definition Let the value of R: p  q Differential log (p,q) = p  q Redo: p  (p,q) = q Undo: q  (p,q) = p where  denotes the bit-wise XOR [Lee, Kim, and Cha, ICDE 2001]

Log sequence number Resource ID XOR difference 0101 0101 L2 L1 R R … … 1001 1100 0101 1001 Redo L1; Redo L2; 1100 0000 Redo L2; Redo L1; 0000 1001 Differential Logging Example Value of a resource, R 0000

Source of Parallelism • XOR operation is commutative and associative • Serialization order is insignificant • Log records can be applied in arbitrary order • Log records can be distributed to multiple disks to improve the logging and recovery performance. • Redo and undo can be intermixed • Single-pass recovery possible • Backup DB loading can be intermixed with log processing • Initialize the primary DB with 0’s. Copying Backup DB into memory is equivalent to applying XOR to the backup data and the corresponding memory initialized with 0’s.

Cache-conscious Index Structures • Design focus: minimize the L2 cache misses • Index node alignment with cache lines • Node size: a few multiples of L2 cache blocks • Pointer elimination for increasing fanout  Reduced height  Reduced cache misses • CSB+-tree [Rao & Ross, SIGMOD 2000] • Key compression for increasing fanout • CR-tree [Kim et al, SIGMOD 2001], pkT/pkB-tree[Chen, et al SIGMOD 2001] B+-tree: CSB+-tree: 23 34 23 34 47 58

Concurrency Control of Index • What if we combine traditional latch-based index concurrency control schemes? • Lock coupling involves latching and unlatching index nodes while traversing down the tree • Latching/unlatching the index node involves memory write on the cache block containing the latch. • Memory writes on the shared data structures on the shared-memory multiprocessor systems lead to coherence cache misses.

OLFIT Concurrency Control: Overview • Consistent node access protocol [Cha et al, VLDB 2001, Patent-Pending]: • ReadNode: • Just read the node optimistically with no latch operations • At the end, detect the read-update conflict by checking the latch state and the version number updated by the UpdateNode operation • Retry on the conflict. • UpdateNode: • Latch and unlatch the index node before and after updating the node content • Prevent other node updates from interfering • Increment the version number of the index node before unlatching

Backup DB Partitions Backup DB Partitions Backup DB Partitions P*TIME: Highly Parallel Transact In Memory Engine CPU CPU CPU CPU Cache Cache Cache Cache Cache-Conscious Index Concurrency Control Primary DB Parallel Logging Parallel Checkpointing Log Log Log Disks Parallel Recovery

P*TIME: Highly Parallel Transact In Memory Engine • Lightweight, multithreaded architecture • Scalable performance on multiprocessor platforms • Implemented mostly in C++ • Small footprint (<10MB) • Industry standard API: SQL, JDBC (with binding to Java and C++), ODBC • Support of DB embedded application architecture • Multi-level concurrency control • Data replication for high availability • Simple performance model • Ease of application tuning and optimization

15M customer database 1 table (~1GB)3 hash indexes (~1.3GB) P*TIME-Embedded TelcoDirectory Server • Backup DB loading time: 16 sec(65MB/sec) • Log processing time: 30-40MB/sec • Parallel index building time: 19 sec • Initial DB loading from text file: 6 min

Summary • The memory-primary architecture • Utilizes the hardware resource more efficiently. • Still many things to do • P*TIME: second-generation MMDBMS • Memory-primary architecture with parallelism exploited • Other optimizations in implementation • Numerous applications in Telecom and Internet • Directory servers • Update-intensive applications: batch  on-line

Time for Change: Why Not Transact In Memory?