Building PetaByte Servers

Building PetaByte Servers Jim Gray Microsoft Research Gray@Microsoft.com http://www.Research.Microsoft.com/~Gray/talks Kilo 103 Mega 106 Giga 109 Tera 1012 today, we are here Peta 1015 Exa 1018

Outline • The challenge: Building GIANT data stores • for example, the EOS/DIS 15 PB system • Conclusion 1 • Think about MOX and SCANS • Conclusion 2: • Think about Clusters

The Challenge -- EOS/DIS • Antarctica is melting -- 77% of fresh water liberated • sea level rises 70 meters • Chico & Memphis are beach-front property • New York, Washington, SF, LA, London, Paris • Let’s study it! Mission to Planet Earth • EOS: Earth Observing System (17B$ => 10B$) • 50 instruments on 10 satellites 1997-2001 • Landsat (added later) • EOS DIS: Data Information System: • 3-5 MB/s raw, 30-50 MB/s processed. • 4 TB/day, • 15 PB by year 2007

The Process Flow • Data arrives and is pre-processed. • instrument data is calibrated, gridded averaged • Geophysical data is derived • Users ask for stored data OR to analyze and combine data. • Can make the pull-push split dynamically Push Processing Pull Processing Other Data

Designing EOS/DIS • Expect that millions will use the system (online)Three user categories: • NASA 500 -- funded by NASA to do science • Global Change 10 k - other dirt bags • Internet 20 m - everyone else Grain speculators Environmental Impact Reports New applications => discovery & access must be automatic • Allow anyone to set up a peer- node (DAAC & SCF) • Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average). => A failure: no one uses the data, in DSS, push is 1% or less. => computation demand is enormous(pull:push is 100: 1)

The architecture • 2+N data center design • Scaleable OR-DBMS • Emphasize Pull vs Push processing • Storage hierarchy • Data Pump • Just in time acquisition

Obvious Point: EOS/DIS will be a cluster of SMPs • It needs 16 PB storage • = 1 M disks in current technology • = 500K tapes in current technology • It needs 100 TeraOps of processing • = 100K processors (current technology) • and ~ 100 Terabytes of DRAM • 1997 requirements are 1000x smaller • smaller data rate • almost no re-processing work

2+N data center design • duplex the archive (for fault tolerance) • let anyone build an extract (the +N) • Partition data by time and by space (store 2 or 4 ways). • Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs). • Clients and Partitions interact via standard protocols • OLE-DB, DCOM/CORBA, HTTP,…

Hardware Architecture • 2 Huge Data Centers • Each has 50 to 1,000 nodes in a cluster • Each node has about 25…250 TB of storage • SMP .5Bips to 50 Bips 20K$ • DRAM 50GB to 1 TB 50K$ • 100 disks 2.3 TB to 230 TB 200K$ • 10 tape robots 25 TB to 250 TB 200K$ • 2 Interconnects 1GBps to 100 GBps 20K$ • Node costs 500K$ • Data Center costs 25M$ (capital cost)

Scaleable OR-DBMS • Adopt cluster approach (Tandem, Teradata, VMScluster,..) • System must scale to many processors, disks, links • OR DBMS based on standard object model • CORBA or DCOM (not vendor specific) • Grow by adding components • System must be self-managing

Storage Hierarchy 10-TB RAM 500 nodes 1 PB of Disk 10,000 drives 15 PB of Tape Robot 4x1,000 robots • Cache hot 10% (1.5 PB) on disk. • Keep cold 90% on near-line tape. • Remember recent results on speculation • (more on this later MOX/GOX/SCANS)

Data Pump • Some queries require reading ALL the data (for reprocessing) • Each Data Center scans the data every 2 weeks. • Data rate 10 PB/day = 10 TB/node/day = 120 MB/s • Compute on demand small jobs • less than 1,000 tape mounts • less than 100 M disk accesses • less than 100 TeraOps. • (less than 30 minute response time) • For BIG JOBS scan entire 15PB database • Queries (and extracts) “snoop” this data pump.

Just-in-time acquisition 30% 5 10 4 10 3 10 2 10 10 1 • Hardware prices decline 20%-40%/year • So buy at last moment • Buy best product that day: commodity • Depreciate over 3 years so that facility is fresh. • (after 3 years, cost is 23% of original). 60% decline peaks at 10M$ EOS DIS Disk Storage Size and Cost assume 40% price decline/year Data Need TB Storage Cost M$ 1994 1996 1998 2000 2002 2004 2006 2008

Problems • HSM • Design and Meta-data • Ingest • Data discovery, search, and analysis • reorg-reprocess • disaster recovery • cost

What this system teaches us • Traditional storage metrics • KOX: KB objects accessed per second • $/GB: Storage cost • New metrics: • MOX: megabyte objects accessed per second • SCANS: Time to scan the archive

Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time

The Pico Processor 1 M SPECmarks 106 clocks/ fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Terror Bytes!

Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Sacramento 100 Memory This Campus 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min

DataFlow ProgrammingPrefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline data to/from the processor Pipe data from source (tape, disc, ram...) to cpu cache

MetaMessage: Technology Ratios Are Important • If everything gets faster&cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER: • communication speed & cost 1,000x • processor speed & cost 100x • storage size & cost 100x • Things staying about the same • speed of light (more or less constant) • people (10x more expensive) • storage speed (only 10x better)

Trends: Storage Got Cheaper 1e 9 1e 8 1e 7 1e 6 1e 5 1e 4 1e 3 • $/byte got 104 better • $/access got 103 better • capacity grew 103 • Latency improved 10 • Bandwidth improved 10 Storage Capacity Tape (kB) Unit Storage Size Year Disk (kB) RAM (b) 1960 1970 1980 1990 2000

Trends: Access Times Improved Little 1e 9 1e 8 1e 7 1e 6 1e 5 1e 4 1e 3 Access Times Improved Little Processor Speedups 1e 3 Tape 1e 2 1e 1 1 Processors 1e 0 1e -1 Disk Instructions / second 1e-2 Bits / second 1e-3 WANs 1e-4 1e-5 RAM 1e-6 1e-7 1960 1970 1980 1990 2000 1960 1970 1980 1990 2000 Year Year

Trends: Storage Bandwidth Improved Little 1e 9 1e 9 1e 8 1e 8 1e 7 1e 7 1e 6 1e 6 1e 5 1e 5 1e 4 1e 4 1e 3 1e 3 Processor Speedups Transfer Rates Improved Little RAM 1e -1 1 Processors Disk Tape WANs 1960 1970 1980 1990 2000 1960 1970 1980 1990 2000 Year Year

Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs 15 4 10 10 12 2 10 10 9 0 10 10 6 -2 10 10 3 -4 10 10 Size vs Speed Price vs Speed Cache Nearline Tape Offline Main Tape Secondary Disc Online Online $/MB Secondary Tape Tape Disc Typical System (bytes) Main Offline Nearline Tape Tape Cache -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds)

Trends: Application Storage Demand Grew • The New World: • Billions of objects • Big objects (1MB) • The Old World: • Millions of objects • 100-byte objects

Trends:New Applications Multimedia: Text, voice, image, video, ... The paperless office Library of congress online (on your campus) All information comes electronically entertainment publishing business Information Network, Knowledge Navigator, Information at Your Fingertips

What's a Terabyte 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes 1997: 200 K$ of magnetic disc 120 discs 300 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes Terror Byte !! .1% of a PetaByte!!!!!!!!!!!!!!!!!! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

The Cost of Storage & Access • File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet • Disk: disk (9 GB =) 2,000$ ASCII: 5 m pages 0.2 ¢/sheet (50x cheaper • Image: 200 k pages 1 ¢/sheet (similar to paper)

Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 10MB & 100$/MB • Disk: GB and $/GB: today at 5GB and 500$/GB • Tape: TB and $/TB: today at .1TB and 100k$/TB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 10 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1 GB/s • Disk: 5 MB/s - - - Arrays can go to 1GB/s • Tape: 3 MB/s - - - not clear that striping works

New Storage Metrics: KOXs, MOXs, GOXs, SCANs? • KOX: How many kilobyte objects served per second • the file server, transaction procssing metric • MOX: How many megabyte objects served per second • the Mosaic metric • GOX: How many gigabyte objects served per hour • the video & EOSDIS metric • SCANS: How many scans of all the data per day • the data mining and utility metric

How To Get Lots of MOX, GOX, SCANS • parallelism: use many little devices in parallel • Beware of the media myth • Beware of the access time myth At 10 MB/s: 1.2 days to scan 1,000 x parallel: 15 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB(2x cheaper than disc) Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).

Tape & Optical Reality: Media is 10% of System Cost • Tape needs a robot (10 k$ ... 3 m$ ) • 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB • (1x…10x cheaper than disc) • Optical needs a robot (100 k$ ) • 100 platters = 200GB ( TODAY ) => 400 $/GB • ( more expensive than mag disc ) • Robots have poor access times • Not good for Library of Congress (25TB) • Data motel: data checks in but it never checks out!

The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server • shorter queues • parallel transfer • lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays

The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" • Life is cheap, its the accessories that cost ya. • Processors are cheap, it’s the peripherals that cost ya • (a 10k$ disc card).

My Solution to Tertiary StorageTape Farms, Not Mainframe Silos 100 robots 1M$ 50TB 50$/GB 3K MOX 10K$ robot 1.5K GOX 10 tapes 1 Scans 500 GB 6 MB/s 20$/GB Scan in 24 hours. many independent tape robots (like a disc farm) 30 MOX 15 GOX

The Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out GB/K$ 1 , 000 , 000 K OX 100 , 000 MOX GOX 10 , 000 SCANS/Day 1 , 000 100 10 1 0.1 0.01 1000 x D i sc Farm 100x DLT Tape Farm STC Tape Robot 6,000 tapes, 8 readers

Cost Per Access (3-year) 540 ,000 500K 67 ,000 100,000 KOX/$ MOX/$ GOX/$ 100 68 SCANS/k$ 23 120 10 4.3 7 7 100 2 1.5 1 0.2 0.1 1000 x Disc Farm STC Tape Robot 100x DLT Tape Farm 6,000 tapes, 16 readers

Summary (of new ideas) • Storage accesses are the bottleneck • Accesses are getting larger (MOX, GOX, SCANS) • Capacity and cost are improving • BUT • Latencies and bandwidth are not improving much • SO • Use parallel access (disk and tape farms)

MetaMessage: Technology Ratios Are Important • If everything gets faster&cheaper at the same rate nothing really changes. • Some things getting MUCH BETTER: • communication speed & cost 1,000x • processor speed & cost 100x • storage size & cost 100x • Some things staying about the same • speed of light (more or less constant) • people (10x worse) • storage speed (only 10x better)

Ratios Changed • 10x better access time • 10x more bandwidth • 10,000x lower media price • DRAM/DISK 100:1 to 10:10 to 50:1

The Five Minute Rule • Trade DRAM for Disk Accesses • Cost of an access (DriveCost / Access_per_second) • Cost of a DRAM page ( $/MB / pages_per_MB) • Break even has two terms: • Technology term and an Economic term • Grew page size to compensate for changing ratios. • Now at 10 minute for random, 2 minute sequential

Shows Best Page Index Page Size ~16KB

The Ideal Interconnect SCSI Comm +++ ---- -- -- + - + - - - - + +++ --- +++ - • High bandwidth • Low latency • No software stack • Zero Copy • User mode access to device • Low HBA latency • Error Free • (required if no software stack) • Flow Controlled • WE NEED A NEW PROTOCOL • best of SCSI and Comm • allow push & pull • industry is doing it SAN + VIA

Outline • The challenge: Building GIANT data stores • for example, the EOS/DIS 15 PB system • Conclusion 1 • Think about MOX and SCANS • Conclusion 2: • Think about Clusters • SMP report • Cluster report

Scaleable ComputersBOTH SMP and Cluster Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts SMP Super Server Departmental Cluster of PCs Server Personal System

TPC-C Current Results • Best Performance is 30,390 tpmC @ $305/tpmC (Oracle/DEC) • Best Price/Perf. is 7,693 tpmC @ $43.5/tpmC (MS SQL/Dell) • Graphs show • UNIX high price • UNIX scaleup diseconomy

Compare SMP Performance

Where the money goes

TPC C improved fast 40% hardware, 100% software, 100% PC Technology

Building PetaByte Servers

Building PetaByte Servers

Presentation Transcript

Building the Fastest SQL Servers

Petabyte-scale computing for LHC

PetaByte Storage Facility at RHIC

DMZ Servers + Streaming Servers

The Atlas Petabyte Datastore

Servers

Building Enterprise Servers on OS/390 with OrbixWeb

Building Enterprise Servers for OS/390 with OrbixWeb 

Servers

The Personal Petabyte The Enterprise Exabyte

Tera/Petabyte data distribution architectures

Building PetaByte Servers

Servers

In Search of PetaByte Databases

VPS servers - Bitcoin dedicated servers

Bitcoin dedicated servers - VPS servers

Building High Throughput, Multi-threaded Servers in C#/.NET

Building BIG Data Servers on the Web

The Atlas Petabyte Datastore

HPE Rack Servers Model | HPE Tower Servers | HPE Blade Servers