Scaleabilty

Scaleabilty Jim Gray Gray@Microsoft.com (with help from Gordon BellGeorge Spix Catharine van Ingen http://research.Microsoft.com/~Gray/Talks/

SMP Super Server Departmental Server Personal System ScaleabilityScale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

There'll be Billions Trillions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous

Trillions Billions Of ClientsNeed Millions Of Servers Billions • All clients networked to servers • May be nomadicor on-demand • Fast clients wantfaster servers • Servers provide • Shared Data • Control • Coordination • Communication Clients Mobileclients Fixedclients Servers Server Super server

3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Nano Micro 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance & Management?

4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)The Bricks of Cyberspace • Cost 1,000 $ • Come with • NT • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks

Kilo Mega Giga Tera Peta Exa Zetta Yotta Computers shrink to a point • Disks 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk is super computer! • This is already true of printers and “terminals”

Systems 30 Years Ago • MegaBuck per Mega Instruction Per Second (mips) • MegaBuck per MegaByte • Sys Admin & Data Admin per MegaBuck

Disks of 30 Years Ago • 10 MB • Failed every few weeks

1988: IBM DB2 + CICS Mainframe65 tps • IBM 4391 • Simulated network of 800 clients • 2m$ computer • Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB

1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem) • A dozen people (1.8M$/y) • False floor, 2 rooms of machines Admin expert 32 node processor array Performance expert Hardware experts Simulate 25,600 clients Network expert Auditor Manager 40 GB disk array (80 drives) DB expert OS expert

1997: 9 years later1 Person and 1 box = 1250 tps • 1 Breadbox ~ 5x 1987 machine room • 23 GB is hand-held • One person does all the work • Cost/tps is 100,000x less5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays

mainframe mini price micro time What Happened?Where did the 100,000x come from? • Moore’s law: 100X (at most) • Software improvements: 10X (at most) • Commodity Pricing: 100X (at least) • Total 100,000X • 100x from commodity • (DBMS was 100K$ to start: now 1k$ to start • IBM 390 MIPS is 7.5K$ today • Intel MIPS is 10$ today • Commodity disk is 50$/GB vs 1,500$/GB • ...

SGI O2K UE10K DELL 6350 Cray T3E IBM SP2 PoPC per sqft cpus 2.1 4.7 7.0 4.7 5.0 13.3 specint 29.0 60.5 132.7 79.3 72.3 253.3 ram 4.1 4.7 7.0 0.6 5.0 6.8 gb disks 1.3 0.5 5.2 0.0 2.5 13.3 Web & server farms, server consolidation / sqft http://www.exodus.com (charges by mbps times sqft) Standard package, full height, fully populated, 3.5” disks HP, DELL, Compaq are trading places wrt rack mount lead PoPC – Celeron NLX shoeboxes – 1000 nodes in 48 (24x2) sq ft. $650K from Arrow (3yr warrantee!) on chip at speed L2

Application Taxonomy General purpose, non-parallelizable codesPCs have it! Vectorizable Vectorizable & //able(Supers & small DSMs) Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs) DatabaseDatabase/TP Web Host Stream Audio/Video Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters

Peta Scale Computing Peta scale w/ traditional balance 2000 2010 1 PIPS processors 1015 ips 106 cpus @109 ips 104 cpus @1011 ips 10 PB of DRAM 108 chips @107 bytes 106 chips @109 bytes 10 PBps memory bandwidth 1 PBps IO bandwidth 108 disks 107 Bps 107 disks 108 Bps 100 PB of disk storage 105 disks 1010 B 103 disks 1012 B 10 EB of tape storage 107 tapes 1010 B 105 tapes 1012 B 10x every 5 years, 100x every 10 (1000x in 20 if SC) Except --- memory & IO bandwidth

I think there is a world market for maybe five computers. “ ” Thomas Watson Senior, Chairman of IBM, 1943

Building 11 Staging Servers (7) Ave CFG: 4xP6, Internal WWW Ave CFG: 4xP5, European Data Center premium.microsoft.com IDC Staging Servers 512 RAM, www.microsoft.com 30 GB HD (1) MOSWest (3) Ave CFG: 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Servers 512 RAM, SQLNet 30 GB HD Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Router Download 30 GB HD DMZ Staging Servers Ave CFG: Replication 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Router 1 GB RAM, Live SQL Servers 160 GB HD Download Server 160 GB HD SQL Reporting Ave Cost: $83K Ave CFG: 4xP6, (1) MOSWest Switched Ave CFG: FY98 Fcst: 4xP6, 2 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD 512 RAM, 160 GB HD Ave Cost: $83K 50 GB HD FY98 Fcst: 12 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com 50 GB HD Ave CFG: premium.microsoft.com 4xP6, (2) (3) 512 RAM, Ave CFG: 4xP6, (1) 30 GB HD home.microsoft.com 512 RAM, Ave CFG: 4xP6, home.microsoft.com Ave CFG: 4xP6, Ave Cost: $28K 160 GB HD FDDI Ring 512 RAM, (3) 512 RAM, FY98 Fcst: (4) 7 (MIS2) 50 GB HD premium.microsoft.com 30 GB HD Ave CFG: 4xP6 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, 30 GB HD Switched (MIS1) 512 RAM, Ave CFG: 4xP6, Ethernet 30 GB HD 256 RAM, 30 GB HD FTP Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) (2) Router Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router FTP.microsoft.com (45 Mb/Sec Each) (3) FDDI Ring Ave CFG: 4xP5, home.microsoft.com (MIS3) www.microsoft.com msid.msn.com 512 RAM, (2) 30 GB HD (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD register.microsoft.com home.microsoft.com support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD Microsoft.com: ~150x4 nodes: a crowd (3)

HotMail: ~400 Computers Crowd

DB Clusters (crowds) • 16-node Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)

Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware

35 Total Average Peak 71 30 Hits 1,065 m 8.1 m 29 m 25 Queries 877 m 6.7 m 18 m Sessions 20 Hit Count Page View Images DB Query 742 m 5.6m 15 m 15 Image Page Views 170 m 1.3 m 6.6 m 10 Users 76 k 6.4 m 48 k 5 Sessions 10 m 77 k 125 k 0 7/6/98 8/3/98 9/7/98 6/22/98 6/29/98 7/13/98 7/20/98 7/27/98 8/10/98 8/17/98 8/24/98 8/31/98 9/14/98 9/21/98 9/28/98 10/5/98 10/12/98 10/19/98 10/26/98 Date TerraServer: Lots of Web Hits • A billion web hits! • 1 TB, largest SQL DB on the Web • 100 Qps average, 1,000 Qps peak • 877 M SQL queries so far

SQL 7 TerraServer Availability • Operating for 4 months: 3,133 hrs • Unscheduled outage: 36.5 minutes: 99.9905% scheduled up • Scheduled outage: 60 minutes • Availability: 99.96% overall up • No NT failures (ever) • One SQL7 Beta2 bug • No failures in July, Aug, Oct, Dec, Jan, Feb

Backup / Restore

Windows NT Versus UNIXBest Results on an SMP: SemiLog plot shows 3x (2 year) lead by UNIX Does not show Oracle/Alpha Cluster at 100,000 tpmCAll these numbers are off-scale huge (20,000 active users?)

TPC C Improvements (MS SQL) 250%/year on Price, 100%/year performancebottleneck is 3GB address space 40% hardware, 100% software, 100% PC Technology

UNIX (dis) Economy Of Scale

Oracle/NT • Compaq /NT/Oracle • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks=2.7 TB

Oracle: Soak the Rich: 36% software taxMicrosoft: 4% software

Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Los Angeles 100 Memory This Resort 10 min 10 On Board Cache This Room 2 On Chip Cache 1 Registers My Head 1 min Storage Latency: How far away is the data?

I-Cache B-Cache Miss D-Cache Data Miss Miss Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Disc Wait Sort Sort Disc Wait OS Memory Wait

Storage Hierarchy (10 levels) Registers, Cache L1, L2 Main (1, 2, 3 if nUMA). Disk (1 (cached), 2) Tape (1 (mounted), 2)

Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits MemoryRead/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps

Adapter ~70 MBps PCI ~110 MBps Adapter Memory Read/Write ~250 MBps Adapter PCI Adapter Bottleneck Analysis • NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI 64 ~ 155 MBps Unbuffered read (175 raw) ~ 95 MBps Unbuffered write Good, but 10x down from our UNIX brethren (SGI, SUN) 155 MBps

PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE disks • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec

Sandia/Compaq/ServerNet/NT Sort • Sort 1.1 Terabyte (13 Billion records) in 47 minutes • 68 nodes (dual 450 Mhz processors)543 disks, 1.5 M$ • 1.2 GBps network rap (2.8 GBps pap) • 5.2 GBps of disk rap (same as pap) • (rap=real application performance,pap= peak advertised performance)

SP sort • 2 – 4 GBps!

Progress on Sorting: NT now leads both price and performance • Speedup comes from Moore’s law 40%/year • Processor/Disk/Network arrays: 60%/year (this is a software speedup).

Recent Results • NOW Sort: 9 GB on a cluster of 100 UltraSparcs in 1 minute • MilleniumSort: 16x Dell NT cluster: 100 MB in 1.18 Sec (Datamation) • Tandem/Sandia Sort: 68 CPU ServerNet 1 TB in 47 minutes • IBM SPsort 408 nodes, 1952 cpu 2168 disks 17.6 minutes = 1057sec (all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect

Data GravityProcessing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in • Modem • Display • Microphones (speech recognition) & cameras (vision) • Storage: Data storage and analysis • System is “distributed” (a cluster/mob)

RIP FDDI RIP ATM RIP FC RIP SCI RIP ? RIP SCSI SAN: Standard Interconnect Gbps SAN: 110 MBps • LAN faster than memory bus? • 1 GBps links in lab. • 100$ port cost soon • Port is computer • Winsock: 110 MBps(10% cpu utilization at each end) PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps

Disk = Node • has magnetic storage (100 GB?) • has processor & DRAM • has SAN attachment • has execution environment Applications Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel

end

Scaleabilty

Scaleabilty

Presentation Transcript

Scaleabilty

Scaleabilty