CS411 Database Systems

CS411Database Systems 08: Storage & Representation

The Big Picture-- DBMS Architecture User/Web Forms/Applications/DBA query transaction DDL commands Query Parser Transaction Manager DDL Processor Query Rewriter Concurrency Control Logging & Recovery Query Optimizer Query Executor Records Indexes Lock Tables Buffer: data, indexes, log, etc Buffer Manager Main Memory Storage Manager Storage data, metadata, indexes, log, etc

Data Storage • Storing data: disks • Buffer manager • Representing data • External sorting

Lies, damn lies, and textbooks 2. Too old • Uses 2001 statistics that are now far out of date • Describes ancient technology (floppies, optical disks) 1. Describes many alternatives, without saying what people actually use • Written as though the DBMS were using disks attached to the computer where it runs • Written as though the DBMS were using raw disk and people get to pick cylinders Overall: interesting systems perspective, with many useful ideas, but not a depiction of what actually happens in practice today

Warning: all speed and size statistics in these slides are out of date. What really matters, though, is not the absolute numbers, but their relative differences. DisksBuffer Manager

The memory hierarchy

Everything is getting faster, but…

The relative gaps in performance are increasing. Each level is now thousands of times faster than the level below it. If and when memory becomes cheaper than disk, by then “memory will be the new disk” because it’ll be so much slower than the processor and L2 cache. 1 nsec/ access 3 msec/ seek 3 sec just to load a tape

From www.gear6.com “A widening gap has emerged between the exponentially increasing speed of servers and the inherent limitations of disk-based storage systems. This […] represents a growing bottleneck that severely limits overall data center performance. “As disk technology moves forward with greater density on the same conventional mechanical platforms, the problem only gets worse. As a result, IT administrators and managers typically experience the following troublesome scenarios: • “Inability to sustain operations through peak loads placing business at risk • “Valuable IT resources deployed simply to sustain acceptable performance levels • “Continuous troubleshooting and maintenance causing business interruptions • “Premature infrastructure upgrades leading to excessive capital spending”

The Memory Hierarchy Main Memory = Disk Cache Processor Cache: • access time 10 nanosec. • Volatile • 256M-1G • expensive • Access time: 10-100 nanoseconds Disk Tape • Persistent • 2-10 GB storage • speed: • Rate=5-10 MB/S • Access time= 10-15 msecs. • 1.5 MB/S transfer rate • 280 GB typical capacity • Only sequential access • Not for operational data

Processor speed trends, from Dell

From www.globalsecurity.org

CPU clock rates, from www.pcpitstop.com

Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth { per access + per byte Disk Performance Trends (D. Patterson, 2000) • Capacity • + 60%/year (2X / 1.5 yrs) • Transfer rate (BW) • + 40%/year (2X / 2.0 yrs) • Rotation + Seek time • – 8%/ year (1/2 in 10 yrs) • MB/$ • > 60%/year (2X / <1.5 yrs) • Fewer chips + areal density source: Ed Grochowski, 1996, “IBM leadership in disk drive technology”; www.storage.ibm.com/storage/technolo/grochows/grocho01.htm,

Predictions from dba.oracle.com 1990s — Large shops have hundreds of small UNIX-based computers for their Oracle databases. 2000s — Monolithic servers reappear, and Oracle shops undertake a massive server consolidation. By 2008, servers with 256 processors run hundreds of Oracle instances. 2010s — Disk becomes obsolete, and all Oracle database are solid-state. Hardware costs fall so much that 70 percent of the IT budget is spent on programmers and DBAs. Energy is another 25%.

Typical IT budgets, from www.infoedge.com (left) and www.sterlinghoffman.com (right, 4Q04) Data center costs: 25% is for energy

Where do data centers go? See http://www.invest.is/Key-Sectors/Data-Centers-in-Iceland/ What about outsourcing? See /www.sourcingmag.com/content/c060424a.asp

From dba.oracle.com Apparent speedup (due to memory buffers in disk box) “Platters can only spin so fast without becoming aerodynamic, and the disk vendors were hard-pressed to keep their technology improving in speed. Their solution was to add a RAM front-end to their disk arrays and sophisticated, asynchronous read-write software to provide the illusion of faster hardware performance.”

Main Memory • Fastest, most expensive • Today: 256MB are common even on PCs • Many databases could fit in memory • Recent industry niche: Main Memory Databases (e.g., sold byTimesTen) • Main issue is volatility: lose the data on a crash • To help with that problem, big DBMS servers have some flash memory (nonvolatile)

Secondary Storage • Disks • Slower, cheaper than main memory • Persistent !!! • The unit of disk I/O = block • 1 block is typically 4KB • Used with a main memory buffer

Not something you’ll get to mess with Tracks Arm movement Arm assembly The Mechanics of Disks Mechanical characteristics: • Rotation speed (5400RPM) • Number of platters (1-30) • Number of tracks (<=10000) • Number of bytes/track(105) Cylinder Spindle Disk head Sector A high-performance disk today might have only 2-3 platters Platters

D. Patterson, 2000Chips / 3.5 inch Disk: 1993 v. 1994 15 vs. 12 chips; 2 chips (head, misc) in 200x?

If the picture below isn’t how it really is for big DBs, then how is it? CPU with a directly attached disk (or disks, if we get fancy)

DBMS server is separate from storage server, which is just as smart as the DBMS • Their own fast CPUs, enormous memories • A whole industry • Dominant players: EMC, IBM, Network Appliance • Separate from the DB industry (except IBM) From http://www.brocade.com Nice photos of little storage servers at http://www.unibrain.com/Products/Storage/FireNAS2U.htm SAN = Storage Area Network (that’s right, the storage gets its own network)

http://www.commandsoft.com

The DB server asks the storage server for the blocks it wants Server Storage Area Network Memory bus CPU (FC-AL) Internal I/O bus Memory RAID bus (PCI) Mem From David Patterson, 2000 External I/O bus Disk Array (SCSI) Storage server does its own prefetching, caching, buffering, … (15 disks/bus)

Important Disk Access Characteristics • Disk latency= time between when command is issued and when data is in memory • Disk latency = seek time + rotational latency + transfer time • Seek time = time for the head to reach cylinder • 10ms – 40ms • Rotational latency = time for the sector to rotate • Rotation time = 10ms • Average latency = 10ms/2 • Transfer time = typically 5-10MB/s • Disks read/write one block at a time (typically 4kB)

RAIDs • = “Redundant Array of Independent Disks” • Was “inexpensive” disks • Idea: use more disks, increase reliability • Recall: • Database recovery helps after a system crash, not after a disk crash • 6 ways to use RAIDs. More important: • Level 4: use N-1 data disks, plus one parity disk • Level 5: same, but alternate which disk is the parity • Level 6: use Hamming codes instead of parity

RAID = Redundant Array of Independent Disks (image from www.bjorn3d.com/read_pf.php?cID=777) The most common RAID levels are RAID 0 (striping), RAID 1 (mirroring) and RAID 5 (striping with parity - see above image).

DB Buffer Management in a DBMS • Data must be in memory for DBMS to operate on it! • Keep a table of <frame#, DBpageID> pairs, hashed on DBpageID, to quickly see if a page is in memory DB Page Requests from Higher Levels BUFFER POOL disk block(s) free frame MAIN MEMORY DISK choice of frame dictated by replacement policy

Buffer Manager • When a page is first requested, it is read into a free frame • The DBMS typically requests it to be pinned: • pin_count = how many processes requested it pinned • When the DBMS writes to it, the buffer manager marks it dirty • When the DBMS doesn’t need it any more, unpinned: • pin_count is decremented • Typical replacement policies (always choose among unpinnedframes): • LRU • Clock • MRU

Buffer Manager Why not use the operating system to manage memory? - DBMS may be able to anticipate access patterns - Hence, may also be able to perform prefetching - DBMS needs the ability to force pages to disk.

Representing Data

How to lay out a tuple (= record) CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), wholesale BIT, description VARCHAR(200); pid 4 B name 21 B wholesale 1 bit description 200 B First guess

How to lay out a tuple (= record) CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), wholesale BIT, description VARCHAR(200); because it is too slow to parse things that don’t align with word boundaries pid 4 B name 21 B wholesale 1 bit description 200 B empty space Second guess

How to lay out a tuple (= record) CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), wholesale BIT, description VARCHAR(200); because it is too slow to parse things that don’t align with word boundaries pid 4 B name 21 B wholesale 1 bit description 200 B Second guess and some empty space here too possibly even some empty space here too

How to lay out a tuple (= record) CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), wholesale BIT, description VARCHAR(200); The old way wasted too much space pid 4 B name 21 B wholesale 1 bit description 200 B actual length + 2 B Third guess Even this isn’t quite right. To see why, let’s look at page layouts.

How to lay out a DB page (= block) DB page/block = multiple of disk block size In practice, 8 KB or more page First attempt

How to lay out a DB page (= block) DB page/block = multiple of disk block size In practice, 8 KB or more page tuple/record tuple/record tuple/record tuple/record free space First attempt

How to lay out a DB page (= block) DB page/block = multiple of disk block size In practice, 8 KB or more page pid tuple/record name wholesale description tuple/record pid name wholesale description pid tuple/record name wholesale description tuple/record pid name wholesale description How to find where the 3rdtuple starts, without parsing the whole page?? free space First attempt (with detail)

How to lay out a DB page (= block) DB page/block = multiple of disk block size = 8 KB+ Need a tuple? Fetch its entire page into memory. page tuple/record What if, what if, what if? What if one tuple is so big it won’t fit on a single page? (no) free space First attempt (with detail) What if a tuple has multimedia, e.g., mp3?

The truth about how to lay out a DB page PostgreSQL page page header (20 B) one (offset, length) pair for each record on the page(4 B each) free space tuple/record tuple/record page header (20 B) LSN TimelineID Free Start Free End SpecialInfo Start Refer to a tuple as (page#, i) for its entire lifetime, even though the DBMS rearranges page contents

The truth about how to lay out a DB page PostgreSQL page page header (20 B) (offset1, length1) (offset2, length2) free space tuple/record tuple/record page header (20 B) LSN TimelineID Free Start Free End SpecialInfo Start Refer to a tuple as (page#, i) for its entire lifetime, even though the DBMS rearranges page contents

Why rearrange a DB page? PostgreSQL page page header (20 B) (offset1, length) (offset2, length) free space tuple/record tuple/record updatedtuple/record In most DBMSs, all the tuples on a page will be from the same relation.

Eventually the free space may be so fragmented that you’ll need to defragment PostgreSQL page page header (offset, length) pairs free space Tuple 2 on this page Tuple 6 on this page Tuple 3 Tuple 1 on this page Tuple 4 on this page In practice, that doesn’t happen very often, because most applications tend to get more and more data.

Announcements • Treats • Histograms of grades from midterm • Does attending lecture in person make a difference in your grade? Fill out our survey. • Fill out your departmental surveys too.

Educational storage jokes • How disks work: http://www.hitachigst.com/hdd/research/recording_head/pr/PerpendicularAnimation.html • Data recovery: http://www.symantec.com/backupexec/hal5/ • Mr. T and network virtualization: http://www.youtube.com/watch?v=tW1S2tsxVHg When a storage startup dummy CEO was asked his age, he replied, “Storage.”

What if a tuple no longer fits on the page? page page header (offset1, length1), (offset2, length2), (offset3, length3), (offset4, length4) tuple 4 free space tuple 3 tuple 2 tuple/record updatedtuple 1

CS411 Database Systems