Disk scrubbing in large archival storage systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Disk Scrubbing in Large Archival Storage Systems PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on
  • Presentation posted in: General

Disk Scrubbing in Large Archival Storage Systems. Thomas Schwarz, S.J. 1,2 Qin Xin 1,3 , Ethan Miller 1 , Darrell Long 1 , Andy Hospodor 1,2 , Spencer Ng 3 1 Storage Systems Resource Center, U. of California, Santa Cruz 2 Santa Clara University, Santa Clara, CA

Download Presentation

Disk Scrubbing in Large Archival Storage Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Disk scrubbing in large archival storage systems

Disk Scrubbing in Large Archival Storage Systems

Thomas Schwarz, S.J.1,2 Qin Xin1,3, Ethan Miller1, Darrell Long1, Andy Hospodor1,2, Spencer Ng3

1 Storage Systems Resource Center, U. of California, Santa Cruz

2 Santa Clara University, Santa Clara, CA

3 Hitachi Global Storage Technologies, San Jose Research Center,


Introduction

Introduction

  • Large archival storage systems:

    • Protect data more proactively

    • Keep disks powered off for long periods of time

    • Have low rate of data access

  • Protect data by storing it redundantly.


Introduction1

Introduction

  • Failures can happen

    • At the block level.

    • At the device level.

  • Failures may remain undetected for long periods of time.

  • A failure may unmask one or more additional failures.

    • Reconstruction procedure accesses data on other devices.

    • Those devices can have suffered previous failures.


Introduction2

Introduction

  • We investigate the efficacy of disk scrubbing.

  • Disk Scrubbing accesses a disk to see whether the data can still be read.

    • Reading a single block shows that the device still works.

    • Reading all blocks shows that we can read all the data on the block.


Contents

Contents

  • Disk Failure Taxonomy

  • System Overview

  • Disk Scrubbing Modeling

  • Power Cycles and Reliability

  • Optimal Scrubbing Interval

  • Simulation Results


Disk failure taxonomy

Disk Failure Taxonomy

  • Disk Blocks

    • 512B sector uses error control coding

    • Read to a block successfully either

      • corrects all errors, or retries and then:

        • flags block as unreadable, or

        • misreads block.

  • Disk Failure Rates

    • Depend highly on

      • Environment:

        • Temperature, Vibrations, Air quality

      • Age.

      • Vintage.


Disk failure taxonomy1

Disk Failure Taxonomy

  • Block Failure Rate estimate:

    • Since:

      • 1/3 of all field returns for server drives are due to hard errors.

      • RAID users (90%) do not return drives with hard errors.

      • 10% of all disks sold account for 1/3 of all errors.

    • Hence:

      • Mean Time between Block Failures is 3/10 MTBF of all disk failures.

      • Mean time to disk failure is 3/2 of MTBF.

      • 1 million hour rated drive has

        • 3*105 mean time between block failure.

        • 1.5*106 mean time between disk failure.

This is one back of the envelope calculation based on numbers by one anonymous disk manufacturer. The results seem to be accepted by many.


System overview

System Overview

  • Disks are powered down when not in use.

  • Use m+k redundancy scheme:

    • Store data in large blocks.

    • m blocks grouped into an r-group.

    • Add k parity data blocks to r-group.

      • Small blocks lead to fast reconstruction and good reconstruction load distribution.

      • Large blocks have slightly better reliability.


System overview1

System Overview

  • Disk Scrubbing

  • Scrub an S - block

    • Can read one block  device not failed.

    • Can read all blocks  can access all data.

    • Can read and verify all blocks  data can be read correctly.

      • Use “algebraic signatures” for that.

      • Can even verify that parity data accurately reflects client data.


System overview2

System Overview

  • If a bad block is detected, we usually can reconstruct its contents with parity / mirrored data.

  • Scrubbing finds the error before it can hurt you.


Modeling scrubbing

Modeling Scrubbing

Random Scrubbing: Scrub an S-block at random. (Exponential distribution).

Deterministic Scrubbing: Scrub an S-block at regular intervals.


Modeling scrubbing1

Modeling Scrubbing

Opportunistic Scrubbing:

Try to scrub when you access the disk anyway.

“Piggyback scrubs on disk accesses”

Efficiency depends on the frequency of accesses.

MTBA: Mean Time Between Accesses (103 hours).

Average scrub interval 104 hours.

Block MTBF 105 hours.


Power cycling and reliability

Power Cycling and Reliability

  • Turning a disk on or off has a significant impact.

    • Even if disks move actuators away from surface (laptop disks).

  • No direct data to measure impact of Power On Hours (POH).

  • Extrapolate from Seagate data:

    • One on / off cycle is roughly equivalent to running a disk for eight hours.


Determining scrubbing intervals

Determining Scrubbing Intervals

  • Interval too short:

    • Too much traffic  Disks busy  Increased error rate  Lower system MTBF.

  • Interval too long:

    • A failure more likely to unmask other failures.  More failures catastrophic.  Lower system MTBF.


Determining scrubbing intervals1

Determining Scrubbing Intervals

N = 250 disks.

Device failure rate: 5·105 hours

Block failure rate: 10-5

Time to read disk: 4 hours.

Deterministic: without considering power-up effects.

Deterministic with cycling: considering power-up effects.

Opportunistic does not pay power-on penalty, but runs disk longer.

Random does not pay power-on penalty. Random with cycling would be below the deterministic with cycling graph.

Mirrored reliability block


Determining scrubbing intervals2

Determining Scrubbing Intervals

Scrub frequently: You never know what you might find.

Mirrored disks using opportunistic scrubbing (no power-on penalty).

Assumes a high disk access rate.


Simulation results

Simulation Results

  • 1PB archival data store.

  • Disks have MTBF of 105 hours.

  • 10,000 disk drives

  • 10GB reliability blocks.

  • ~1TB/day traffic


Simulation results1

Simulation Results

Two-way Mirroring


Simulation results2

Simulation Results

RAID 5 redundancy scheme


Simulation results3

Simulation Results

Mirroring.

Opportunistic scrubbing with ~ three disk accesses per year.

Observe that additional scrubbing leads to more power-on cycles that slightly increase occurrence of data losses.


Conclusions

Conclusions

  • We have shown that disk scrubbing is a necessity for very large scale storage systems.

  • Our simulations show the impact of power-on / power-off on reliability.

  • We also note that lack of numbers on disk drive reliability prevents public research.


  • Login