Parity Declustering for Continous Operation in Redundant Disk Arrays

Parity Declustering for Continous Operation in Redundant Disk Arrays • Mark Holland, Garth A. Gibson

Purpose of Parity Declustering • Parity Declustering is designed to balance cost against data reliability and performance during failure recovery. • It improves on standard parity organizations by reducing the additional load on surviving disks during the reconstruction of a failed disk’s contents. And it yields higher user throughput during recovery, and/or shorter recovery time.

Declustered Parity Layout • RAID is a special case of declustered Parity Layout, in RAID G=C

Data unit is the minimum amount of contiguous user data allocated to one disk before any data is allocated to any other disk. Parity unit is a block of parity information that is the size of a data stripe unit. Parity stripe is the set of data units over which a parity unit is computed, plus the parity unit itself. e.g. An S in the layout is either a data unit or parity unit. Four S’s together is called parity stripe. Definition of some terms

Di,jrepresents one of the four data units in parity stripe number i,and Pi represents the parity unit for parity stripe i. Declustering ratio is defined as α=(G-1)/(C-1). It indicates the fraction of each surviving disk that must be read during the reconstruction of a failed disk. For example, D1.0, D1.1, D1.2, P1 together is a parity stripe. So G=4 , C = 5, α = 75% In RAID5, α = 100% Example declustered layout

Data layout strategy • How to layout data in parity declustered disk arrays? • Our goals: • Single failure correcting: No two stripe units in the same parity stripe may reside on the same physical disk. • 2. Distributed reconstruction:When any disk fails, its user workload should be evenly distributed across all other disks in the array. • 3. Distributed parity: Parity information should be evenly distributed across the array. • Efficient mapping:The function mapping a file system’s logical block address to physical disk addresses is efficient. • Large write optimization: Don’t need 4 access operation. • Maximal parallelism: Read of contiguous data can have max parallelism.

Layout strategy • The distributed reconstrucition criterion requires that the same number of unites be read from each surviving disk during the reconstruction of a failed disk. This will be achieved if the number of times that a pair of disks contain stripe units from the same parity stripe is constant across all pairs of disks.Such layout can be implemented by balanced incomplete block design. • A block design is an arrangment of v distinct objects into b tuples, each containing k elements, such that each object appears in exactly r tuples, and each pair of objects appears in exactly λp tuples.

Complete block design • It’s simpler than incomplete block design. • A block design is called a complete block design which includes all combinations of exactly k distinct elements selected from the set of v objects. The number of these combinations is .

In this example, we arrange 5 distinct objects(numbers) into 5 tuples, such that each object appears in exactly 4 tuples, and each pair of objects appears in exactly 3 tuples. e.g. number 0 appears in 4 tuples, (0,1) appears in tuple 0, 1, 2. (1,4) appears in tuple 1,2,4. It’s complete because it includes all combinations of exactly 4 distinct elements selected from the set of 5 elements. Example complete block design

Tuple 0:0,1,2,3 Tuple 1:0,1,2,4 Tuple 2:0,1,3,4 Tuple 3:0,2,3,4 Tuple 4:1,2,3,4 If we associates disks with objects(numbers) and parity stripes with tuples. We get Although it’s complete, it violates the design goals 3. It doesn’t distributed parity evenly. Parity on disk 4 is the bottleneck for write operation. Layout with complete block design

We duplicate previous layout G times, assigning parity to a different element of each tuple in each duplication, then we get above full block design table.

Problem with full block design • The size of the block design table may be very large. So it’s not guaranteed that the layout will have an efficient mapping. But it’s required by our fourth criterion. • Our fifth and sixth criteria depend on the data mapping function used by higher levels of software. • Large-write opimization is guaranteed. • But parallel read cannot achieve maximal parallelism. • That’s, not all sets of five adjacent data units from the mapping, D0.0, D0.1, D0.2, D1.0, D1.1, D1.2, D2.0 etc., are allocated on five different disks. Reading five adjacent data units starting at data unit 0 causes disk 0 and 1 to be used twice, and disk3 and 4 not at all.

Problem with full block design • In addition, in the case the number of disks in an array( C ) is large relative to the number of stripe units in a parity stripe( G), the full block design cannot be implemented. • e.g. a 41 disk array with 20% parity overhead(G =5) allocated by a complete block design will have about 3,750,000 tuples. It cannot be implemented, because even large disks rarely have more than a few million sectors.

Balanced Incomplete block design • Our goal is to find a small block design on C objects with a tuple size of G. Hall presents a list containing a large number of known block designs, and states that , within the bounds of this list, a solution is given in every case where one is known to exit • Sometimes a balanced incomplete block design with the required parameters may not be known, we resort to choosing the closest feasible design point; that’s the point which yield a value of α closest to what is desired.

Balanced Incomplete block design • We can choose the closest feasible design point from the subset of Hall’s list of design.

These two figure show that, except for writes with α =0.1, fault-free performance is essentially independent of parity declustering. It may lead to slightly better average response time in the degraded rather than fault-free mode.(A user write may induces only one write access) Average reponse time

Reconstruction Performance • Higher user performance during recovery compared to RAID 5. • Simplest Reconstruction involves a single sweep through the contents of a failed disk. For each stripe unit on a replacement disk, the reconstruction process reads all other stripe units in the corresponding parity stripe and computes an exclusive-or on these units. The resulting unit is then written to the replacement disk. • The time needed to entirely repair a failed disk is equal to the time needed to replace it in the array plus the time needed to reconstruct its entire contents and store them on the replacement. • Continuous-operation system require data availability during reconstruction.

Four reconstruction algorithm • Minial-update algorithm: No extra work is sent; whenever possible user writes are folded into the parity unit, and neither reconstruction optimization is enabled • User-writes algorithm:All user writes explicitly targeted at the replacement disk are sent directly to the replacement. • Redirection of reads:user accesses to data that has already been reconstructed are serviced by (redirected to ) the replacement disk, rather than invoking on-the-fly reconstruction as they would if the data were not yet available. • Piggybacking of writes:User reads that cause on-the-fly reconstruction also cause the reconstructed data to be written to the replacement disk. This is targeted at speeding reconstruction. • (Redirection of reads and Piggybacking of writes are proposed by Muntz and Lui.)

Comparison of four algorithm • The testing result showed that Muntz and Lui’s redirection of reads and redirect+piggyback don’t consistently decrease reconstruction time relative to the simpler algorithm. • The reason is that loading the replacement disk with random work penalizes the reconstruction writes to this disk more than off-loading benefits the surviving disks unless the surviving disks are highly utilized. • Even a small amount of random load imposed on the replacement disk may greatly increase its average access times because reconstruction writes are sequential and don’t require long seeks.

Conclusion • We demonstrated: • Parity declustering, a strategy for allocating parity in a single-failure-correcting redundant disk array that trades increased parity overhead for reduced user-performance degradation during on-line failure recovery, can be effectively implemented in array-controlling software. • Using block design to map parity stripes onto a disk array insures that both the parity update load and the on-line reconstruction load is balanced over all disks in the array.

Questions • 1.What’s parity declustering? • 2. What’s the data layout goals? • 3. What’s the disadvantage of complete block design?

Parity Declustering for Continous Operation in Redundant Disk Arrays

Parity Declustering for Continous Operation in Redundant Disk Arrays

Presentation Transcript

RAID Redundant Arrays of Independent Disks

Destage Algorithms for Disk Arrays with Non-Volatile Caches

Disk Storage Arrays

Raid: redundant arrays of inexpensive disks INDEPENDENT

A Case for Redundant Arrays Of Inexpensive Disks

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

Prefetching with Adaptive Cache Culling for Striped Disk Arrays

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

TOWARDS CONTINOUS LEARNING ORGANIZATION

Disk Arrays

PRESENT CONTINOUS

Parity Logging O vercoming the Small Write Problem in Redundant Disk Arrays

Exploiting Flash for Energy Efficient Disk Arrays

Disk Arrays

Disk Arrays Mar. 26, 2004

A Case for Heterogeneous Disk Arrays

SRINKAGE FOR REDUNDANT REPRESENTATIONS ?

Configuring Large Disk Arrays in an Oracle Environment