improving file system reliability with i o shepherding n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Improving File System Reliability with I/O Shepherding PowerPoint Presentation
Download Presentation
Improving File System Reliability with I/O Shepherding

Loading in 2 Seconds...

play fullscreen
1 / 39

Improving File System Reliability with I/O Shepherding - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Improving File System Reliability with I/O Shepherding. Haryadi S. Gunawi , Vijayan Prabhakaran + , Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison. +. Complex Storage Subsystem Mechanical/electrical failures, buggy drivers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Improving File System Reliability with I/O Shepherding' - sakura


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving file system reliability with i o shepherding

Improving File System Reliability with I/O Shepherding

Haryadi S. Gunawi,

Vijayan Prabhakaran+, Swetha Krishnan,

Andrea C. Arpaci-Dusseau,

Remzi H. Arpaci-Dusseau

University of Wisconsin - Madison

+

storage reality
Complex Storage Subsystem

Mechanical/electrical failures, buggy drivers

Complex Failures:

Intermittent faults, latent sector errors, corruption, lost writes, misdirected writes, etc.

FS Reliability is important

Managing disk and individual block failures

Device Driver

Transport

Firmware

Media

Mechanical

Electrical

Storage Reality

File System

file system reality
File System Reality
  • Good news:
    • Rich literature
      • Checksum, parity, mirroring
      • Versioning, physical/logical identity
    • Important for single and multiple disks setting
  • Bad news:
    • File system reliability is broken[SOSP’05]
      • Unlike other components (performance, consistency)
      • Reliability approaches hard-to understand and evolve
broken fs reliability
Broken FS Reliability
  • Lack of good reliability strategy
    • No remapping, checksumming, redundancy
    • Existing strategy is coarse-grained
      • Mount read-only, panic, retry
  • Inconsistent policies
    • Different techniques in similar failure scenarios
  • Bugs
    • Ignored write failures

Let’s fix them!

With current

Framework?

Not so easy …

no reliability framework
Diffused

Handle each fault in each I/O location

Different developers might increase diffusion

Inflexible

Fixed policies, hard to change

But, no policy that fits all diverse settings

Less reliable vs. more reliable drives

Desktop workload vs. web-server apps

The need for new framework

Reliability is a first-class file system concern

No Reliability Framework

Reliability Policy

File System

Disk

Subsystem

localized
Localized
  • I/O Shepherd
    • Localized policies, …
      • More correct, less bug, simpler reliability management

File System

Shepherd

Disk Subsystem

flexible

Add

Mirror

Check-

sum

More

Retry

More

Protection

Less

Protection

ATA

SCSI

Archival

Scientific

Data

Networked

Storage

Less

Reliable

Drive

More

Reliable

Drive

Flexible
  • I/O Shepherd
    • Localized, flexible policies, …

File System

Shepherd

Disk

Subsystem

powerful
Powerful
  • I/O Shepherd
    • Localized, flexible, andpowerful policies

File System

Shepherd

Add

Mirror

Check-

sum

More

Retry

More

Protection

Add

Mirror

Check-

sum

More

Retry

More

Protection

Less

Protection

Compo-

sable

Policies

Disk

Subsystem

ATA

SCSI

Archival

Scientific

Data

Networked

Storage

Less

Reliable

Drive

More

Reliable

Drive

Custom

Drive

outline
Outline
  • Introduction
  • I/O Shepherd Architecture
  • Implementation
  • Evaluation
  • Conclusion
architecture

Primitives

SanityCheck

Lookup

Location

OnlineFsck

Checksum

Write

Read

Policy Metadata

Mirror-Map

Remap-Map

Checksum-Map

Architecture
  • Building reliability framework
    • How to specify reliability policies?
    • How to make powerful policies?
    • How to simplify reliability management?
  • I/O Shepherd layer
  • Four important components
    • Policy table
    • Policy code
    • Policy primitives
    • Policy Metadata

File System

I/O Shepherd

Policy Code

DynMirrorWrite(DiskAddr D,

MemAddr A)

DiskAddr copyAddr;

IOS_MapLookup(MMap, D,

&copyAddr);

if (copyAddr == NULL)

PickMirrorLoc(MMap, D,

&copyAddr);

IOS_MapAllocate(MMap, D,

copyAddr);

return (IOS_Write(D, A,

copyAddr, A));

Disk

Subsystem

policy table

File System

Shepherd

/tmp

/boot

/lib

/archive

High-level

reliability

No protection

Policy Table
  • How to specify reliability policies?
    • Different block types, different levels of importance
    • Different volumes, different reliability levels
    • Needfine-grainedpolicy
  • Policy table
    • Different policies across different block types
    • Different policy tables across different volumes
policy metadata
What support is needed to make powerful policies?

Remapping: track bad block remapping

Mirroring: allocate new block

Sanity check: need on-disk structure specification

Integration with file system

Runtime allocation

Detailed knowledge of on-disk structures

I/O Shepherd Maps

Managed by the shepherd

Commonly used maps:

Mirror-map

Checksum-map

Remap-map

File System

I/O Shepherd

Remap

Mirror-Map

Csum-Map

1001

1001

1001

null

1010

2001

1002

1002

1002

1010

2002

null

1003

1003

1003

1010

null

3003

Policy Metadata
policy primitives and code
Policy Primitives and Code
  • How to make reliability management simple?
  • I/O Shepherd Primitives
    • Rich set and reusable
    • Complexities are hidden
  • Policy writer simply composes primitives into Policy Code

Policy

Primitives

Maps

Computation

Map Update

Checksum

Map Lookup

Parity

FS-Level

Layout

Sanity Check

Allocate Near

Stop FS

Allocate Far

Policy Code

MirrorData(Addr D)

Addr M;

MapLookup(MMap, D, M);

if (M == NULL)

M = PickMirrorLoc(D);

MapAllocate(MMap, D, M);

Copy(D, M);

Write(D, M);

slide14

Mirror-Map

D

R

Mirror-Map

D

NULL

D

R

File System

D

D

I/O Shepherd

Policy Code

MirrorData(Addr D)

Addr R;

R = MapLookup(MMap, D);

if (R == NULL)

R = PickMirrorLoc(D);

MapAllocate(MMap, D, R);

Copy(D, R);

Write(D, R);

Disk Subsystem

D

R

D

summary
Summary
  • Interposition simplifies reliability management
    • Localized policies
    • Simple and extensible policies
  • Challenge: Keeping new data and metadata consistent
outline1
Outline
  • Introduction
  • I/O Shepherd Architecture
  • Implementation
    • Consistency Management
  • Evaluation
  • Conclusion
implementation
Implementation
  • CrookFS
    • (named for the hooked staff of a shepherd)
    • An ext3 variant with I/O shepherding capabilities
  • Implementation
    • Changes in Core OS
      • Semantic information, layout and allocation interface, allocation during recovery
      • Consistency management (data journaling mode)
      • ~900 LOC (non-intrusive)
    • Shepherd Infrastructure
      • Shepherd primitives, thread support, maps management, etc.
      • ~3500 LOC (reusable for other file systems)
  • Well-integrated with the file system
    • Small overhead
data journaling mode

TB

D

I

TC

D

I

Data Journaling Mode

Memory

Bm

I

D

Sync (intent is logged)

Tx Release

Journal

Fixed Location

Checkpoint

(intent is realized)

reliability policy journaling
When to run policies?

Policies (e.g. mirroring) are executed during checkpoint

Is current journaling approach adequate to support reliability policy?

Could we run remapping/mirroring during checkpoint?

No – Problem of failed intentions

Cannot react to checkpoint failures

Reliability Policy + Journaling
failed intentions
Failed Intentions

Example Policy: Remapping

Crash

Memory

I

D

RMDR

Impossible

R

I

Tx Release

Journal

Inconsistencies:

1) Pointer ID invalid

2) No reference to R

TB

D

I

TC

Fixed Location

RMD0

D

I

R

RMD0

Remap-Map

RMDR

Checkpoint completes

Checkpoint

(failed intent)

journaling flaw
Journal: log intent to the journal

If journal write failure occurs? Simply abort the transaction

Checkpoint: intent is realized to final location

If checkpoint failure occurs? No solution!

Ext3, IBM JFS: ignore

ReiserFS: stop the FS (coarse-grained recovery)

Flawin current journaling approach

No consistency for any checkpoint recovery that changes state

Too late, transaction has been committed

Crash could occur anytime

Hopes checkpoint writes always succeed (wrong!)

Consistent reliability + current journal = impossible

Journaling Flaw
chained transactions
Chained Transactions
  • Contains all recent changes (e.g. modified shepherd’s metadata)
  • “Chained” with previous transaction
  • Rule: Only after the chained transaction commits, can we release the previous transaction
chained transactions1
Chained Transactions

Example Policy: Remapping

Memory

I

D

RMDR

RMDR

New: Tx Release after CTx commits

Old : Tx Release

Journal

TB

D

I

TC

TB

TC

Fixed Location

D

I

R

RMD0

Checkpoint completes

summary1
Summary
  • Chained Transactions
    • Handles failed-intentions
    • Works for all policies
    • Minimal changes in the journaling layer
  • Repeatable across crashes
    • Idempotent policy
      • An important property for consistency in multiple crashes
outline2
Outline
  • Introduction
  • I/O Shepherd Architecture
  • Implementation
  • Evaluation
  • Conclusion
evaluation
Evaluation
  • Flexible
    • Change ext3 to all-stop or more-retrypolicies
  • Fine-Grained
    • Implement gracefully-degrade RAID[TOS’05]
  • Composable
    • Perform multiple lines of defense
  • Simple
    • Craft8policies in a simple manner
flexibility

No Recovery

Retry

Stop

Propagate

Not applicable

Flexibility
  • Modify ext3 inconsistent read recovery policies

Workload

Failed Block Type

Failed Block:Indirect block

Workload:

Path traversal

cd /mnt/fs2/test/a/b/

Policy observed:

Detect failure and

propagate failure to app

Propagate

Retry

Ignore

failure

Stop

ext3

flexibility1
Flexibility
  • Modify ext3 policies to all-stop policies

ext3

All-Stop

No Recovery

Retry

Stop

AllStopRead(Block B)

if (Read(B) == OK) return OK;

else Stop();

Propagate

flexibility2
Flexibility
  • Modify ext3 policies to retry-more policies

ext3

Retry-More

No Recovery

Retry

RetryMoreRead (Block B)

for (int i = 0; i < RETRY_MAX; i++)

if (Read(B) == SUCCESS)

return SUCCESS;

return FAILURE;

Stop

Propagate

fine granularity

File System

RAID-0

file1.pdf

/root/…

/root

File System

Shepherd + DGRAID

RAID-0

Fine-Granularity
  • RAID problem
    • Extreme unavailability
      • Partially available data
      • Unavailable root directory
  • DGRAID[TOS’05]
    • Degrade gracefully
      • Fault isolate a file to a disk
      • Highly replicate metadata

f1.pdf

f2.pdf

fine granularity1
Fine-Granularity

F: 1

A: 90%

F: 2

A: 80%

10-way

Linear

X = 1, 5, 10

F: 3

A: ~40%

composability
Composability

ReadInode(Block B)

{

C = Lookup(Ch-Map, B);

Read(B,C);

if ( CompareChecksum(B, C) == OK )

return OK;

M = Lookup(M-Map, B);

Read(M);

if ( CompareChecksum(M, C) == OK )

B = M;

return OK;

if ( SanityCheck(B) == OK )

return OK;

if ( SanityCheck(M) == OK )

B = M;

return OK;

RunOnlineFsck();

return ReadInode(B);

}

Time (ms)

  • Multiple lines of defense
  • Assemble both low-level and high-level recovery mechanism
simplicity
Simplicity
  • Writing reliability policy is simple
    • Implement 8 policies
      • Using reusable primitives
    • Complex one < 80 LOC
conclusion
Conclusion
  • Modern storage failures are complex
    • Not only fail-stop, but also exhibit individual block failures
  • FS reliability framework does not exist
    • Scattered policy code – can’t expect much reliability
    • Journaling + Block Failures  Failed intentions (Flaw)
  • I/O Shepherding
    • Powerful
      • Deploy disk-level, RAID-level, FS-level policies
    • Flexible
      • Reliability as a function of workload and environment
    • Consistent
      • Chained-transactions
ad vanced s ystems l aboratory www cs wisc edu adsl
ADvanced Systems Laboratorywww.cs.wisc.edu/adsl

Thanks to:

I/O Shepherd’s shepherd – Frans Kaashoek

ScholarshipSponsor:

ResearchSponsor:

slide37

Mirror-Map

Mirror-Map

D

D

Q

R

Mirror-Map

D

NULL

D

Q

D

Policy Code

RemapMirrorData(Addr D)

Addr R, Q;

MapLookup(MMap, D, R);

if (R == NULL)

R = PickMirrorLoc(D);

MapAllocate(MMap, D, R);

Copy(D, R);

Write(D, R);

if (Fail(R))

Deallocate(R);

Q = PickMirrorLoc(D);

MapAllocate(MMap, D, Q);

Write(Q);

Disk Subsystem

D

R

Q

chained transactions 2
Chained Transactions (2)

Example Policy: RemapMirrorData

Memory

I

D

MDR1

MDR2

MDR2

Journal

TB

D

I

TC

TB

TC

Fixed Location

MD0

D

I

R1

R2

MD0

Checkpoint

completes

existing solution enough
Existing Solution Enough?
  • Is machinery in high-end systems enough (e.g. disk scrubbing, redundancy, end-to-end checksums)?
    • Not pervasive in home environment (store photos, tax returns)
    • New trend: commodity storage clusters (Google, EMC Centera)
  • Is RAID enough?
    • Requires more than one disk
    • Does not protect faults above disk system
    • Focus on whole disk failure
    • Does not enable fine-grained policies