Storage Bricks
1 / 27

Helped me sharpen these arguments - PowerPoint PPT Presentation

  • Uploaded on

Storage Bricks Jim Gray Microsoft Research FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen. Helped me sharpen

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Helped me sharpen these arguments' - august

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Storage bricks jim gray microsoft research research micrsoft

Storage Bricks Jim Gray Microsoft Research FAST 2002 Monterey, CA, 29 Jan 2002Acknowledgements:Dave Pattersonexplained this to me long agoLeonard ChungKim Keeton Erik RiedelCatharine Van Ingen

Helped me sharpen

these arguments

First disk 1956
First Disk 1956

  • IBM 305 RAMAC

  • 4 MB

  • 50x24” disks

  • 1200 rpm

  • 100 ms access

  • 35k$/y rent

  • Included computer & accounting software(tubes not transistors)

10 years later
10 years later

1.6 meters

Disk evolution









Disk Evolution

  • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB 1” micro-drive

  • System on a chip

  • High-speed SAN

  • Disk replacing tape

  • Disk is super computer!

Disks are becoming computers
Disks are becoming computers

  • Smart drives

  • Camera with micro-drive

  • Replay / Tivo / Ultimate TV

  • Phone with micro-drive

  • MP3 players

  • Tablet

  • Xbox

  • Many more…

ApplicationsWeb, DBMS, Files


Disk Ctlr + 1Ghz cpu+



Infiniband, Ethernet, radio…

Data gravity processing moves to transducers smart displays microphones printers nics disks



P=50 mips

M= 2 MB


In a few years

P= 500 mips

M= 256 MB


Data Gravity Processing Moves to Transducerssmart displays, microphones, printers, NICs, disks

Processing decentralized

Moving to data sources

Moving to power sources

Moving to sheet metal

? The end of computers ?




It s already true of printers peripheral cyberbrick
It’s Already True of PrintersPeripheral = CyberBrick

  • You buy a printer

  • You get a

    • several network interfaces

    • A Postscript engine

      • cpu,

      • memory,

      • software,

      • a spooler (soon)

    • and… a print engine.

The absurd design

Segregate processing from storage

Poor locality

Much useless data movement

Amdahl’s laws: bus: 10 B/ips io: 1 b/ips


~ 1 TB

The Absurd Design?



100 GBps

10 TBps

~ 1 Tips

~ 100TB

The absurd disk
The “Absurd” Disk

  • 2.5 hr scan time (poor sequential access)

  • 1 aps / 5 GB (VERY cold data)

  • It’s a tape!

  • Optimizations:

    • Reduce management costs

    • Caching

    • Sequential 100x faster than random


1 TB

100 MB/s

200 Kaps

Disk node
Disk = Node

  • magnetic storage (1TB)

  • processor + RAM + LAN

  • Management interface (HTTP + SOAP)

  • Application execution environment

  • Application

    • File

    • DB2/Oracle/SQL

    • Notes/Exchange/TeamServer

    • SAP/Seibold/…

    • Quickbooks /Tivo/ PC.…




RPC, ...

File System

LAN driver

Disk driver

OS Kernel


Offload device handling to NIC/HBA

higher level protocols: I2O, NASD, VIA, IP, TCP…

SMP and Cluster parallelism is important.

Move app to NIC/device controller

higher-higher level protocols: SOAP/DCOM/RMI..

Cluster parallelism is VERY important.




Processor & Memory




Intermediate step shared logic
Intermediate Step: Shared Logic


~1TB 12x80GB NAS

  • Brick with 8-12 disk drives

  • 200 mips/arm (or more)

  • 2xGbpsEthernet

  • General purpose OS

  • 10k$/TB to 50k$/TB

  • Shared

    • Sheet metal

    • Power

    • Support/Config

    • Security

    • Network ports

  • These bricks could run applications (e.g. SQL or Mail or..)



8x70GB NAS


~2TB 12x160GB NAS


  • Homogenous machines leads to quick response through reallocation

  • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives

  • $4k/TB (street),

  • 2.5processors/TB, 1GB RAM/TB

  • JIT storage & processing3 weeks from order to deploy

Slide courtesy of Brewster Kahle, @

What if disk replaces tape how does it work
What if Disk Replaces Tape?How does it work?

  • Backup/Restore

    • RAID (among the federation)

    • Snapshot copies (in most OSs)

    • remote replicas (standard in DBMS and FS)

  • Archive

    • Use “cold” 95% of disk space

  • Interchange

    • Send computers not disks.

It s hard to archive a petabyte it takes a long time to restore it
It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.

  • At 1GBps it takes 12 days!

  • Store it in two (or more) places online A geo-plex

  • Scrub it continuously (look for errors)

  • On failure,

    • use other copy until failure repaired,

    • refresh lost copy from safe copy.

  • Can organize the two copies differently (e.g.: one by time, one by space)

Archive to disk 100tb for 0 5m 1 5 free petabytes
Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes

  • If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC)

  • So you have 1.6 PB of (mirrored) storage(160GB drives)

  • Use the “empty” 95% for archive storage.

  • No extra space or extra power cost.

  • Very fast access (milliseconds vs hours).

  • Snapshot is read-only (software enforced )

  • Makes Admin easy (saves people costs)

Disk as tape archive

Slide courtesy of Brewster Kahle, @

Disk as Tape Archive

  • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive

  • Using removable hard drives to replace tape’s function has been successful

  • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.

  • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Disk as tape interchange
Disk as Tape Interchange

  • Tape interchange is frustrating (often unreadable)

  • Beyond 1-10 GB send media not data

    • FTP takes too long (hour/GB)

    • Bandwidth still very expensive (1$/GB)

  • Writing DVD not much faster than Internet

  • New technology could change this

    • 100 GB DVD @ 10MBps would be competitive.

  • Write 1TB disk in 2.5 hrs (at 100MBps)

  • But, how does interchange work?

Disk as tape interchange what format
Disk As Tape Interchange: What format?

  • Today I send 160GB NTFS/SQL disks.

  • But that is not a good format for Linux/DB2 users.

  • Solution: Ship NFS/CIFS/ODBC servers (not disks)

  • Plug “disk” into LAN.

    • DHCP then file or DB server via standard interface.

    • “pull” data from server.

Some questions
Some Questions

  • What is the product?

  • How do I manage 10,000 nodes (disks)?

  • How do I program 10,000 nodes (disks)?

  • How does RAID work?

  • How do I backup a PB?

  • How do I restore a PB?

What is the product
What is the Product?

  • Concept: Plug it in and it works!

  • Music/Video/Photo appliance (home)

  • Game appliance

  • “PC”

  • File server appliance

  • Data archive/interchange appliance

  • Web server appliance

  • DB server

  • eMail appliance

  • Application appliance



How does scale out work
How Does Scale Out Work?

  • Files: well known designs:

    • rooted tree partitioned across nodes

    • Automatic cooling (migration)

    • Mirrors or Chained declustering

    • Snapshots for backup/archive

  • Databases: well known designs

    • Partitioning, remote replication similar to files

    • distributed query processing.

  • Applications: (hypothetical)

    • Must be designed as mobile objects

    • Middleware provides object migration system

      • Objects externalize methods to migrate ( == backup/restore/archive)

      • Web services seem to have key ideas (xml representation)

    • Example: eMail object is mailbox

Auto manage storage
Auto Manage Storage

  • 1980 rule of thumb:

    • A DataAdmin per 10GB, SysAdmin per mips

  • 2000 rule of thumb

    • A DataAdmin per 5TB

    • SysAdmin per 100 clones (varies with app).

  • Problem:

    • 5TB is 50k$ today, 5k$ in a few years.

    • Admin cost >> storage cost !!!!

  • Challenge:

    • Automate ALL storage admin tasks

Admin tb and guessed tb does not include cost of application overhead not substance
Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”)

  • Google: 1 :100TB 5k$/TB/y

  • Yahoo! 1 : 50TB 20k$/TB/y

  • DB 1 : 5TB 60k$/TB/y

  • Wall St. 1 : 1TB 400k$/TB/y (reported)

  • hardware dominant cost only @ Google.

  • How can we waste hardware to save people cost?

How do i manage 10 000 nodes
How do I manage 10,000 nodes?

  • You can’t manage 10,000 x (for any x).

  • They manage themselves.

    • You manage exceptional exceptions.

  • Auto Manage

    • Plug & Play hardware

    • Auto-load balance & placement storage & processing

    • Simple parallel programming model

    • Fault masking

How do i program 10 000 nodes
How do I program 10,000 nodes?

  • You can’t program 10,000 x (for any x).

  • They program themselves.

    • You write embarrassingly parallel programs

    • Examples: SQL, Web, Google, Inktomi, HotMail,….

    • PVM and MPI prove it must be automatic (unless you have a PhD)!

  • Auto Parallelism is ESSENTIAL


  • Disks will become supercomputers so

    • Lots of computing to optimize the arm

    • Can put app close to the data (better modularity, locality)

    • Storage appliances (self-organizing)

  • The arm/capacity tradeoff: “waste” space to save access.

    • Compression (saves bandwidth)

    • Mirrors

    • Online backup/restore

    • Online archive (vault to other drives or geoplex if possible)

  • Not disks replace tapes: Storage appliances replace tapes.

  • Self-organizing storage servers (file systems)(prototypes of this software exist)