High Speed Sequential IO on Windows NT™ 4.0 (sp3)

High Speed Sequential IO on Windows NT™ 4.0 (sp3) Erik Riedel(of CMU) Catharine van Ingen Jim Gray http://Research.Microsoft.com/BARC/Sequential_IO/

Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO • Single disk (temp, buffered, unbuffered, deep) • Multiple disks and busses • RAID • Pitfalls • Summary

Brad Waters, Wael Bahaa-El-Din, and Maurice FranklinShared experience, results, tools, and hardware lab. Helped us understand NT Feedback on our preliminary measurements Tom Barclayiostress benchmark program Barry Nolte & Mike Parkes allocate issues Doug Treuting, Steve Mattos + Adaptec SCSI and Adaptec device drivers Bill Courtright, Stan Skelton, Richard Vanderbilt, Mark Regester loanded us a Symbios Logic array, host adapters, and r expertise. . Will Dahli : helped us understand NT configuration and measurement. Joe Barrera & Don Slutz & Felipe Cabrera valuable comments, feedback and helped in understanding NTFS internals. David Solomon:Inside Windows NT2nd edition draft We Got a Lot of Help

The Actors • Measured & Modeling Sequential IO • Where are the bottlenecks? • How does it scale with • SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory App address space Mem bus File cache Controller Adapter SCSI PCI

PAP (peak advertised Performance) vsRAP (real application performance) • Goal: PAP = RAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application 10-15 MBps Data 7.2 MB/s File System SCSI Buffers Disk 133 MBps PCI 7.2 MB/s

Two Basic Shapes • Circle (disk) • storage frequently returns to same spot • so less total surface area • Line (tape) • Lots more area, • Longer time to get to the data. • Key idea: multiplex expensive read/write head over large storage area: trade $/GB for access/second

Disk Terms • Disks are called platters • Data is recorded on tracks (circles) on the disk. • Tracks are formatted into fixed-sized sectors. • A pair of Read/Write heads for each platter • Mounted on a disk arm • Client addresses logical blocks (cylinder, head, sector) • Bad blocks are remapped to spare good blocks.

Disk Access Time • Access time = SeekTime 6 ms + RotateTime 3 ms + ReadTime 1 ms • Rotate time: • 5,000 to 10,000 rpm • ~ 12 to 6 milliseconds per rotation • ~ 6 to 3 ms rotational latency • Improved 3x in 20 years

Disk Seek Time • Seek time is ~ Sqrt(distance)(distance = 1/2 acceleration x time2) • Specs assume seek is 1/3 of disk • Short seeks are common. (over 50% are zero length) • Typical 1/3 seek time: 8 ms • 4x improvement in 20 years. Full Stop Full Accelerate speed time

Read/Write Time: Density • Time = Size / BytesPerSecond • Bytes/Second = Speed * Density • 5 to 15 MBps • MAD (Magnetic Aerial Density) • Today 3 Gbits/inch25 gbpsi in lab • Rising > 60%/year • ParaMagnetic Limit: 10 Gb/inch2 • linear density is sqrt10x per decade 10,000 1,000 100 10 1 MAD (Mbpsi) Hoagland’s Law 1970 1980 1990 2000

Read/Write Time: Rotational Speed • Bytes/Second = Speed * Density • Speed greater at edge of circle • Speed 3600 -> 10,000 rpm • 5%/year improvement • bit rate varies by ~1.5x today p r2 = 4 r = 2 p r2 = 1 r = 1

8 sectors/track 8 sectors/track 14 sectors/track Read/Write Time: Zones • Disks are sectored • typical: 512 bytes/sector • Sector is read/write unit • Failfast: can detect bad sectors. • Disks are zoned • outer zones have more sectors • Bytes/second higher in outer zones.

Disk Access Time • Access time = SeekTime 6 ms 5%/y + RotateTime 3 ms 5%/y + ReadTime 1 ms 25%/y • Other useful facts: • Power rises more than size3 (so small is indeed beautiful) • Small devices are more rugged • Small devices can use plastics (forces are much smaller)e.g. bugs fall without breaking anything

Wait Transfer Transfer Rotate Rotate Seek Seek The Access Time Myth The Myth: seek or pick time dominates The Reality:(1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server • shorter queues • parallel transfer • lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays

10x better access time 10x more bandwidth 4,000x lower media price DRAM/disk media price ratio changed 1970-1990 100:1 1990-1995 10:1 1995-1997 50:1 today ~ .2$pMB disk 10$pMB dram Storage Ratios Changed

Year 2002 Disks • Big disk (10 $/GB) • 3” • 100 GB • 150 kaps (k accesses per second) • 20 MBps sequential • Small disk (20 $/GB) • 3” • 4 GB • 100 kaps • 10 MBps sequential • Both running Windows NT™ 7.0?(see below for why)

Tape & Optical: Beware of the Media Myth • Optical is cheap: 200 $/platter 3 GB/platter => 70$/GB (cheaper than disc) • Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).

The Media Myth • Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 10$/GB ... 150$/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) • Robots have poor access timesNot good for Library of Congress (25TB) Data motel: data checks in but it never checks out!

Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • NT and BackOffice in the disk controller (a processor with 100MB dram) ASIC

The Disk Farm On a Card 14" The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth • Life is cheap, its the accessories that cost ya. • Processors are cheap, it’s the peripherals that cost ya • (a 10k$ disc card).

ASIC Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB • Storage • Network • Display M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

All Device Controllers will be Cray 1’s Central Processor & Memory • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • Economics (cyberbricks) • Move computation to data (minimize traffic) Tera Byte Backplane

System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.

Tera Byte Backplane With Tera Byte Interconnectand Super Computer Adapters • Processing is incidental to • Networking • Storage • UI • Disk Controller/NIC is • faster than device • close to device • Can borrow device package & power • So use idle capacity for computation. • Run app in device.

Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Tera Byte Backplane Central Processor & Memory Implications Conventional Radical

Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams VIAL/VIPL h Wire(s)

Will He Ever Get to The Point? • I thought this was about NTFS sequential IO. • Why is he telling me all this other crap? It is relevant background

Processor - Memory bus Memory holds file cache and app data Application reads and writes memory The Disk: writes, stores, reads data The Disk Controller: manages drive (error handling) reads & writes drive converts SCSI commands to disk actions May buffer or do RAID The SCSI bus: carries bytes The Host-Bus Adapter: protocol converter to system bus may do RAID The Actors Memory App address space Mem bus File cache Controller Adapter SCSI PCI

10 1 Sequential vs Random IO • Random IO is typically small IO (8KB) • seek+rotate+transfer is ~ 10 ms • 100 IO per second • 800 KB per second • Sequential IO is typically large IO • almost no seek (one per cylinder read/written) • No rotational delay (reading whole disk track) • Runs at MEDIA speed: 8 MB per second • Sequential is 10x more bandwidth than random!

Basic File Concepts • Buffered: • File reads/writes go to file cache • File system does pre-fetch, post write, aggregation. • Unbuffered bypasses file cache • Data written to disk at file close or LRU or lazy write • Overlapped: • requests are pipelined • completions via events, completion ports, • A simpler alternative to multi-threaded IO. • Temporary Files: • Files written to cache, not flushed on close.

Experiment Background • Used Intel/Gateway 2000 G6-200Mhz Pentium Pro • 64 MB DRAM (4x interleave) • 32-bit PCI • Adaptec 2940 Fast-Wide (20 MBps) and Ultra-Wide (40 MBps) controllers • Seagate 4GB SCSI disks (fast and ultra) • (7200 rpm, 7-15 MBps “internal”) • NT 4.0 SP3, NTFS • i.e.: modest 1997 technology. • Not multi-processor, Not DEC Alpha, Some RAID

Simplest Possible Code #include <stdio.h> #include <windows.h> int main() { const int iREQUEST_SIZE = 65536; char cRequest[iREQUEST_SIZE]; unsigned long ibytes; HANDLE hFile = CreateFile("C:\\input.dat", // name GENERIC_READ, // desired access 0, NULL, // share & security OPEN_EXISTING, // pre-existing file FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_SEQUENTIAL_SCAN, NULL); // file template while(ReadFile(hFile,cRequest,iREQUEST_SIZE,&ibytes,NULL) ) // do read { if (ibytes == 0) break; // break on end of file /* do something with the data */ }; CloseHandle(hFile); return 0; } • Error checking adds some more, but still, its easy

Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware islimited to 150 MBpsper processor The Best Case: Temp File, NO IO

Out of the Box Disk File Performance • One NTFS disk • Buffered read • NTFS does 64 KB read-ahead • if you ask FILE_FLAG_SEQUENTIAL • or if it thinks you are sequential • NTFS does 64 KB write behind • under same conditions • aggregates many small IO to few big IO. 64KB

Read throughput is GREAT! Write throughput is 40% of read WCE is fast but dangerous Net: default out of the box performance is good. 20 ms/MB ~ 2 instructions/byte! CPU will saturate at 50MBps Synchronous Buffered Read/Write

For IOs less than 4KBif OVERWRITING datafile system reads 4KB pagethen overwrites bytesthen writes bytes Cuts throughput by 2x - 3x So, write in multiples of cluster size. Write Multiples of Cluster Size Out of the Box Throughput 10 Read 8 Write +WCE 6 Throughput (MB/s) 4 Write 2 0 2 4 8 16 32 64 128 192 Request Size (K-Bytes) 2KB writes are5x slower than reads 2x or 3x slower than 4KB writes

What is WCE? Dangerous • Write Cache Enable lets disk controller respond “yes” before data is on disk. • Dangerous • If power fails, WCE can destroy data integrity • Most RAID controllers have Non Volatile RAMThat makes WCE safe (invisible) if they do RESET right. • About 50% of disks we see have WCE onYou can turn it off with 3rd party SCSI Utilities. • As seen later: 3-deep request buffering gets similar performance.

Reads do well above 2KB Writes are terrible WCE helps writes Ultra media is 1.5x Faster 1/2 power point Read: 4KB Write: 64h KB no wce 4 KB with wce Synchronous Un-Buffered Read/Write

Saves Buffer Memory copy. Was 20 ms/MB, now 2 ms/MB Cost/request ~ 120 s (wow) Note: unbuffered must be sector aligned. Buffered: saturates CPU at 50 MB/s Un Buffered saturates CPU at 500 MB/s Cost of Un-Buffered IO

Out of the box Read RAP ~PAP (thanks NTFS) Write RAP ~ PAP / 10 …PAP/2 Buffering small IO is great! Buffering large IO is expensive WCE is a dangerous way out but frequently used. Parallelism Tricks: deep requests (async, overlap) striping (raid0, raid5) allocation and other tricks Read Buffered Write Buffered Write Buffered + WCE Read Write Write+WCE Summary Out of the Box Throughput WCE Out of Box Throughput Out of the Box Overhead Un-Buffered Write Un-Buffered 60 10 10 50 8 8 Read & Write Buffered Write 40 6 6 Throughput (MB/s) 30 4 4 FS Buffered 20 2 2 Read & Write 10 0 0 2 4 8 16 32 64 128 192 2 4 8 16 32 64 128 192 0 2 4 8 16 32 64 128 192 Request Size (K-Bytes) Request Size (K-Bytes) Request Size (K Bytes)

Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits MemoryRead/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps

Kinds of Parallel Execution A Any Sequential Sequential Pipeline Step Step Sequential Sequential Partition outputs split N ways inputs merge M ways Any Any Sequential Sequential Sequential Sequential Step Step

Does not help reads muchThey were already pipelined by the disk controller Pipeline (async, overlap) IO is a BIG win (RAP ~ 85% PAP) Helps writes a LOT Above 16KB 3-deep matches WCE Pipeline Requests to One Disk

Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 100 second SCAN. 1 Terabyte 1 Terabyte BANDWIDTH 10 GB/s 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Stripes NEED pipeline 3-deep is good enough Saturate at 15 MBps 8-deep Pipeline matches WCE Pipeline Access: Stripe Across 4 disks

3 disks can saturate adapter Similar story with UltraWide CPU time goes down with request size Ftdisk (striping is cheap) = 3 Stripes and Your Out!

High Speed Sequential IO on Windows NT™ 4.0 (sp3)