beyond the file system
Download
Skip this Video
Download Presentation
Beyond the File System

Loading in 2 Seconds...

play fullscreen
1 / 101

Beyond the File System - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Beyond the File System. Designing Large Scale File Storage and Serving Cal Henderson. Hello!. Big file systems?. Too vague! What is a file system? What constitutes big? Some requirements would be nice. 1. Scalable Looking at storage and serving infrastructures. 2. Reliable

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Beyond the File System' - tatyana-neal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
beyond the file system

Beyond the File System

Designing Large Scale File Storage and Serving

Cal Henderson

big file systems
Big file systems?
  • Too vague!
  • What is a file system?
  • What constitutes big?
  • Some requirements would be nice
slide4
1

Scalable

Looking at storage and serving infrastructures

slide5
2

Reliable

Looking at redundancy, failure rates, on the fly changes

slide6
3

Cheap

Looking at upfront costs, TCO and lifetimes

four buckets
Four buckets

Storage

Serving

BCP

Cost

the storage stack
File protocol

NFS, CIFS, SMB

File system

ext, reiserFS, NTFS

Block protocol

SCSI, SATA, FC

RAID

Mirrors, Stripes

Hardware

Disks and stuff

The storage stack
hardware overview
Hardware overview

The storage scale

internal storage
Internal storage
  • A disk in a computer
    • SCSI, IDE, SATA
  • 4 disks in 1U is common
  • 8 for half depth boxes
slide12
DAS

Direct attached storage

Disk shelf, connected by SCSI/SATA

HP MSA30 – 14 disks in 3U

slide13
SAN
  • Storage Area Network
  • Dumb disk shelves
  • Clients connect via a ‘fabric’
  • Fibre Channel, iSCSI, Infiniband
    • Low level protocols
slide14
NAS
  • Network Attached Storage
  • Intelligent disk shelf
  • Clients connect via a network
  • NFS, SMB, CIFS
    • High level protocols
meet the lun
Meet the LUN
  • Logical Unit Number
  • A slice of storage space
  • Originally for addressing a single drive:
    • c1t2d3
    • Controller, Target, Disk (Slice)
  • Now means a virtual partition/volume
    • LVM, Logical Volume Management
nas vs san
NAS vs SAN

With a SAN, a single host (initiator) owns a single LUN/volume

With NAS, multiple hosts own a single LUN/volume

NAS head – NAS access to a SAN

san advantages
SAN Advantages

Virtualization within a SAN offers some nice features:

  • Real-time LUN replication
  • Transparent backup
  • SAN booting for host replacement
some practical examples
Some Practical Examples
  • There are a lot of vendors
  • Configurations vary
  • Prices vary wildly
  • Let’s look at a couple
    • Ones I happen to have experience with
    • Not an endorsement ;)
netapp filers
NetApp Filers

Heads and shelves, up to 500TB in 6 Cabs

FC SAN with 1 or 2 NAS heads

isilon iq
Isilon IQ
  • 2U Nodes, 3-96 nodes/cluster, 6-600 TB
  • FC/InfiniBand SAN with NAS head on each node
scaling
Scaling

Vertical vs Horizontal

vertical scaling
Vertical scaling
  • Get a bigger box
  • Bigger disk(s)
  • More disks
  • Limited by current tech – size of each disk and total number in appliance
horizontal scaling
Horizontal scaling
  • Buy more boxes
  • Add more servers/appliances
  • Scales forever*

*sort of

storage scaling approaches
Storage scaling approaches
  • Four common models:
  • Huge FS
  • Physical nodes
  • Virtual nodes
  • Chunked space
huge fs
Huge FS
  • Create one giant volume with growing space
    • Sun’s ZFS
    • Isilon IQ
  • Expandable on-the-fly?
  • Upper limits
    • Always limited somewhere
huge fs1
Huge FS
  • Pluses
    • Simple from the application side
    • Logically simple
    • Low administrative overhead
  • Minuses
    • All your eggs in one basket
    • Hard to expand
    • Has an upper limit
physical nodes
Physical nodes
  • Application handles distribution to multiple physical nodes
    • Disks, Boxes, Appliances, whatever
  • One ‘volume’ per node
  • Each node acts by itself
  • Expandable on-the-fly – add more nodes
  • Scales forever
physical nodes1
Physical Nodes
  • Pluses
    • Limitless expansion
    • Easy to expand
    • Unlikely to all fail at once
  • Minuses
    • Many ‘mounts’ to manage
    • More administration
virtual nodes
Virtual nodes
  • Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
  • Multiple volumes per node
  • Flexible
  • Expandable on-the-fly – add more nodes
  • Scales forever
virtual nodes1
Virtual Nodes
  • Pluses
    • Limitless expansion
    • Easy to expand
    • Unlikely to all fail at once
    • Addressing is logical, not physical
    • Flexible volume sizing, consolidation
  • Minuses
    • Many ‘mounts’ to manage
    • More administration
chunked space
Chunked space
  • Storage layer writes parts of files to different physical nodes
  • A higher-level RAID striping
  • High performance for large files
    • read multiple parts simultaneously
chunked space1
Chunked space
  • Pluses
    • High performance
    • Limitless size
  • Minuses
    • Conceptually complex
    • Can be hard to expand on the fly
    • Can’t manually poke it
slide34
Real Life

Case Studies

gfs google file system
GFS – Google File System
  • Developed by … Google
  • Proprietary
  • Everything we know about it is based on talks they’ve given
  • Designed to store huge files for fast access
gfs google file system1
GFS – Google File System
  • Single ‘Master’ node holds metadata
    • SPF – Shadow master allows warm swap
  • Grid of ‘chunkservers’
    • 64bit filenames
    • 64 MB file chunks
gfs google file system2
GFS – Google File System

Master

1(a)

2(a)

1(b)

gfs google file system3
GFS – Google File System
  • Client reads metadata from master then file parts from multiple chunkservers
  • Designed for big files (>100MB)
  • Master server allocates access leases
  • Replication is automatic and self repairing
    • Synchronously for atomicity
gfs google file system4
GFS – Google File System
  • Reading is fast (parallelizable)
    • But requires a lease
  • Master server is required for all reads and writes
mogilefs omg files
MogileFS – OMG Files
  • Developed by Danga / SixApart
  • Open source
  • Designed for scalable web app storage
mogilefs omg files1
MogileFS – OMG Files
  • Single metadata store (MySQL)
    • MySQL Cluster avoids SPF
  • Multiple ‘tracker’ nodes locate files
  • Multiple ‘storage’ nodes store files
mogilefs omg files2
MogileFS – OMG Files

Tracker

MySQL

Tracker

mogilefs omg files3
MogileFS – OMG Files
  • Replication of file ‘classes’ happens transparently
  • Storage nodes are not mirrored – replication is piecemeal
  • Reading and writing go through trackers, but are performed directly upon storage nodes
flickr file system
Flickr File System
  • Developed by Flickr
  • Proprietary
  • Designed for very large scalable web app storage
flickr file system1
Flickr File System
  • No metadata store
    • Deal with it yourself
  • Multiple ‘StorageMaster’ nodes
  • Multiple storage nodes with virtual volumes
flickr file system3
Flickr File System
  • Metadata stored by app
    • Just a virtual volume number
    • App chooses a path
  • Virtual nodes are mirrored
    • Locally and remotely
  • Reading is done directly from nodes
flickr file system4
Flickr File System
  • StorageMaster nodes only used for write operations
  • Reading and writing can scale separately
amazon s3
Amazon S3
  • A big disk in the sky
  • Multiple ‘buckets’
  • Files have user-defined keys
  • Data + metadata
amazon s31
Amazon S3

Servers

Amazon

amazon s32
Amazon S3

Servers

Amazon

Users

the cost
The cost
  • Fixed price, by the GB
  • Store: $0.15 per GB per month
  • Serve: $0.20 per GB
the cost2
The cost

S3

Regular Bandwidth

end costs
End costs
  • ~$2k to store 1TB for a year
  • ~$63 a month for 1Mb
  • ~$65k a month for 1Gb
serving files
Serving files

Serving files is easy!

Disk

Apache

serving files1
Serving files

Scaling is harder

Disk

Apache

Disk

Apache

Disk

Apache

serving files2
Serving files
  • This doesn’t scale well
  • Primary storage is expensive
    • And takes a lot of space
  • In many systems, we only access a small number of files most of the time
caching
Caching
  • Insert caches between the storage and serving nodes
  • Cache frequently accessed content to reduce reads on the storage nodes
  • Software (Squid, mod_cache)
  • Hardware (Netcache, Cacheflow)
why it works
Why it works
  • Keep a smaller working set
  • Use faster hardware
    • Lots of RAM
    • SCSI
    • Outer edge of disks (ZCAV)
  • Use more duplicates
    • Cheaper, since they’re smaller
two models
Two models
  • Layer 4
    • ‘Simple’ balanced cache
    • Objects in multiple caches
    • Good for few objects requested many times
  • Layer 7
    • URL balances cache
    • Objects in a single cache
    • Good for many objects requested a few times
replacement policies
Replacement policies
  • LRU – Least recently used
  • GDSF – Greedy dual size frequency
  • LFUDA – Least frequently used with dynamic aging
  • All have advantages and disadvantages
  • Performance varies greatly with each
cache churn
Cache Churn
  • How long do objects typically stay in cache?
  • If it gets too short, we’re doing badly
    • But it depends on your traffic profile
  • Make the cached object store larger
problems
Problems
  • Caching has some problems:
    • Invalidation is hard
    • Replacement is dumb (even LFUDA)
  • Avoiding caching makes your life (somewhat) easier
cdn content delivery network
CDN – Content Delivery Network
  • Akamai, Savvis, Mirror Image Internet, etc
  • Caches operated by other people
    • Already in-place
    • In lots of places
  • GSLB/DNS balancing
edge networks1
Edge networks

Cache

Cache

Cache

Origin

Cache

Cache

Cache

Cache

Cache

cdn models
CDN Models
  • Simple model
    • You push content to them, they serve it
  • Reverse proxy model
    • You publish content on an origin, they proxy and cache it
cdn invalidation
CDN Invalidation
  • You don’t control the caches
    • Just like those awful ISP ones
  • Once something is cached by a CDN, assume it can never change
    • Nothing can be deleted
    • Nothing can be modified
versioning
Versioning
  • When you start to cache things, you need to care about versioning
    • Invalidation & Expiry
    • Naming & Sync
cache invalidation
Cache Invalidation
  • If you control the caches, invalidation is possible
  • But remember ISP and client caches
  • Remove deleted content explicitly
    • Avoid users finding old content
    • Save cache space
cache versioning
Cache versioning
  • Simple rule of thumb:
    • If an item is modified, change its name (URL)
  • This can be independent of the file system!
virtual versioning
Database indicates version 3 of file

Web app writes version number into URL

Request comes through cache and is cached with the versioned URL

mod_rewrite converts versioned URL to path

Virtual versioning

Version 3

example.com/foo_3.jpg

Cached: foo_3.jpg

foo_3.jpg -> foo.jpg

authentication
Authentication
  • Authentication inline layer
    • Apache / perlbal
  • Authentication sideline
    • ICP (CARP/HTCP)
  • Authentication by URL
    • FlickrFS
auth layer
Authenticator sits between client and storage

Typically built into the cache software

Auth layer

Authenticator

Cache

Origin

auth sideline
Auth sideline

Cache

Origin

Authenticator

  • Authenticator sits beside the cache
  • Lightweight protocol used for authenticator
auth by url
Auth by URL

Web Server

Cache

Origin

  • Someone else performs authentication and gives URLs to client (typically the web app)
  • URLs hold the ‘keys’ for accessing files
business continuity planning
Business Continuity Planning
  • How can I deal with the unexpected?
    • The core of BCP
  • Redundancy
  • Replication
reality
Reality
  • On a long enough timescale, anything that can fail, will fail
  • Of course, everything can fail
  • True reliability comes only through redundancy
reality1
Reality
  • Define your own SLAs
  • How long can you afford to be down?
  • How manual is the recovery process?
  • How far can you roll back?
  • How many $node boxes can fail at once?
failure scenarios
Failure scenarios
  • Disk failure
  • Storage array failure
  • Storage head failure
  • Fabric failure
  • Metadata node failure
  • Power outage
  • Routing outage
reliable by design
Reliable by design
  • RAID avoids disk failures, but not head or fabric failures
  • Duplicated nodes avoid host and fabric failures, but not routing or power failures
  • Dual-colo avoids routing and power failures, but may need duplication too
tend to all points in the stack
Tend to all points in the stack
  • Going dual-colo: great
  • Taking a whole colo offline because of a single failed disk: bad
  • We need a combination of these
recovery times
Recovery times
  • BCP is not just about continuing when things fail
  • How can we restore after they come back?
  • Host and colo level syncing
    • replication queuing
  • Host and colo level rebuilding
reliable reads writes
Reliable Reads & Writes
  • Reliable reads are easy
    • 2 or more copies of files
  • Reliable writes are harder
    • Write 2 copies at once
    • But what do we do when we can’t write to one?
dual writes
Dual writes
  • Queue up data to be written
    • Where?
    • Needs itself to be reliable
  • Queue up journal of changes
    • And then read data from the disk whose write succeeded
  • Duplicate whole volume after failure
    • Slow!
judging cost
Judging cost
  • Per GB?
  • Per GB upfront and per year
  • Not as simple as you’d hope
    • How about an example
hardware costs
Hardware costs

Single Cost

Cost of hardware

Usable GB

power costs
Power costs

Recurring Cost

Cost of power per year

Usable GB

power costs1
Power costs

Single Cost

Power installation cost

Usable GB

space costs
Space costs

Recurring Cost

]

[

Cost per U

x

U’s needed (inc network)

Usable GB

network costs
Network costs

Single Cost

Cost of network gear

Usable GB

misc costs
Misc costs

Single & Recurring Costs

]

[

Support contracts + spare disks

+ bus adaptors + cables

Usable GB

human costs
Human costs

Recurring Cost

]

[

Admin cost per node

x

Node count

Usable GB

slide98
TCO
  • Total cost of ownership in two parts
    • Upfront
    • Ongoing
  • Architecture plays a huge part in costing
    • Don’t get tied to hardware
    • Allow heterogeneity
    • Move with the market
photo credits
Photo credits
  • flickr.com/photos/ebright/260823954/
  • flickr.com/photos/thomashawk/243477905/
  • flickr.com/photos/tom-carden/116315962/
  • flickr.com/photos/sillydog/287354869/
  • flickr.com/photos/foreversouls/131972916/
  • flickr.com/photos/julianb/324897/
  • flickr.com/photos/primejunta/140957047/
  • flickr.com/photos/whatknot/28973703/
  • flickr.com/photos/dcjohn/85504455/
ad