Fast, Inexpensive Content-Addressed Storage in Foundation

Fast, Inexpensive Content-Addressed Storage in Foundation 2008년 11월 11일 컴퓨터과학과 72071834 김 진 성

※ 용어 정리 ◎ CAS(Content Addressed Storage) • Content-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. It is typically used for high-speed storage and retrieval of fixed content, such as documents stored for compliance with government regulations. Roughly speaking, content-addressable storage is the permanent-storage analogue to content-addressable memory. ◎ Venti • Venti is a network storage system that permanently stores data blocks. A 160-bit SHA-1 hash of the data (called score by Venti) acts as the address of the data. This enforces a write-once policy since no other data block can be found with the same address. The addresses of multiple writes of the same data are identical, so duplicate data is easily identified and the data block is stored only once. Data blocks cannot be removed, making it ideal for permanent or backup storage. Venti is typically used with Fossil to provide a file system with permanent snapshots. 2

※ 아카이브 스토리지 현황 ◎ 국내외 현황 • 스토리지 벤더들이 고정 컨텐츠(fixed-content) 데이터 아카이빙 제품에 상당히 많은 관심을 가지고 있음. 이는 데이터 아카이빙 시장 자체가 성장하고 있음을 의미하는 것이기도 하지만 그 이면에 제도적으로 아카이빙 솔루션이 필요하기도 한 것이 이유이기도 함 • 국내의 경우 제도적으로 아카이빙 솔루션이 필요하지는 않음 • 다만 공인전자문서보관소라는 아주 굵직한 프로젝트가 있기 때문에 스토리지 하드웨어 기반의 아카이브 스토리지가 필요한 것은 분명 • 소프트웨어 솔루션이라고 하는 면에서 볼 때 이는 어떻게 보면 기존 ECM(Enterprise Contents Management) 분야의 한 부속품이라고 해도 크게 달라 보이지 않음 3

※ 아카이브 스토리지 현황 ◎ 국내외 현황 • 새로운 아카이브 스토리지 솔루션 회사 - Caringo (http://www.caringo.com) • 비교적 최근 CAS(Content-addressed storage) 제품을 출시한 회사 • 설립자가 폴 카펜티어(Paul Carpentier)라는 사람으로서 '파일풀(FilePool)'이라는 제품을 만들어 EMC로 팔아 넘겼던 인물 • EMC로의 인수는 2001년, 약 5천만 달러의 금액으로 인수되었으며 이 '파일풀'이 요즘의 Centera 제품이 됨. • 카펜티어는 지난 해 가을 카스토어(CAStor)라는 이름으로 제품을 출시 • 당초 카스토어(CAStor)는 메모리 스틱으로 제품을 설치하고 운영할 수 있는 정도였으나 2.0이 되면서 여전히 메모리 스틱으로운용을 할 수 있지만 이제는 여러 개의 노드(nodes)에 설치되어 중앙의 관리 콘솔을 통해 관리할 수 있도록 하여 그 완성도를 더욱 높였다고 함 • 가장 큰 단점은 지원되는 컨텐츠 관리 솔루션과의 호환 매트릭스가 매우 적다는 점 4

※ 아카이브 스토리지 현황 ◎ 국내외 현황 • CAS 프론티어 - EMC Centera • 2007년 7월 말, EMC는 센테라(Centera)의 새로운 모습을 발표 • 하드웨어에 초점을 맞추고 보다 더 많은 용량과 전력 효율성이 높은 노드들로 Upgrade • EMC에서는 4세대 저전력 노드(Generation 4 Low-Power nodes)라고 함 • 해시 알고리즘으로는 MD-5를 사용하고 있으며, 시스템 성능이 향상됨 • 후발 주자로서의 HDS • 기능상으로 볼 때 해시 알고리즘을 사용하며, 싱글 인스턴스만을 저장하기 때문에 용량 최적화(Capacity Optimization)을 기대할 수 있음 • 이는 EMC의 Centera나 Caingo의 CAStor와는 다른 방식이라고 함 • HCAP은 고정 컨텐츠 보관을 위해 파일의 원본 이름(original file name)은 그대로 두고 해당 파일이나 중복되는 파일에 숫자를 부여(numbering)하는 방식이지만 EMC나 Caringo의 경우에는 파일의 원본 이름을 아예 바꿈. 이 부분이 데이터에 대한 다른 관리적 접근 방법을 채택하고 있음 5

※ 아카이브 스토리지 현황 ◎ 국내외 현황 • 그 외의 선수(Player)들 • EMC, HDS, Caringo 이외의 아카이브 스토리지 솔루션 선수들이라고 한다면 HP나 IBM, 넥산(Nexan) 등을 들 수있음 • . HP의 제품명은 RISS(Reference Information Storage System)이라고 함 • IBM의 제품명은 DR55 • 넥산(Nexan)의 제품명은 어슈어온(Assureon) • 가격적으로는 넥산의 제품이 아주 저렴한 편 • HP나 IBM의 경우 그다지 잘 하고 있다는 생각은 안 드네요(출처에 있는 저자 생각) • 특히나 한국에서는 더더군다나 말이죠. 출처 : http://koreaceladon.tistory.com/85 6

※ 스토리지 아키텍처에 영향을 미칠 기술들 ◎ 스토리지 아키텍처에 영향을 미치게 될 기술(포춘 1000개 기업) 7

※ 스토리지 아키텍처에 영향을 미칠 기술들 ◎ 스토리지 아키텍처에 영향을 미치게 될 기술(중견기업, MSE(Mid-Sized Enterprise; )) 8

※ 스토리지 아키텍처에 영향을 미칠 기술들 ◎ 기대할 만한 스토리지 기업에 관한 조사(포춘 1000대 기업) 9

※ 스토리지 아키텍처에 영향을 미칠 기술들 ◎ 기대할 만한 스토리지 기업에 관한 조사(MSE) 10

※주요 웹 아카이빙 사례 11

※주요 웹 아카이빙 사례 12

※ 디지털 아카이빙 요소기술 ◎ 보존처리기술 • 기술보존(Technology Preservation) • 원본을 접근하는데 요구되는 모든 기술(H/W, S/W, O/S, 이를 구동하는데 필요한 기술 등)을 보존하는 방식 • 비용이 많이 들고 기술적으로 해결해야할 어려운 문제가 많음 • 에뮬레이션(Emulation) • 디지털원본에 적용된 기술적인 조건들에 변경이 있어도 인코딩되어 있는 콘텐츠를 재생할 수 있는 환경을 프로그램으로 만들어내어 콘텐츠의 접근을 보장해주는 방식 • IBM과 CAMiLEON 프로젝트에서 검토한 방식 • 아직 해결해야 할 많은 과제들이 있는 방식 • H/W, S/W를 유사하게 흉내를 내어 처리하므로, 포맷이나 인코딩 방식을 변화하지 않고 원본을 그대로 보존할 수 있는 장점이 있음 • 목표로 하는 컴퓨팅 플랫폼에서 에뮬레이션 원본재현의 성공률이 높도록 기술적용이 되어야 하며, 많은 파일포맷들을 처리하더라도 비용측면에서 효율적이여야 함 • 에뮬레이션을 위한 메타데이터를 충분하게 유지하여 에뮬레이션 S/W와 시스템의 활용에 도움이 되어야 함 13

※ 디지털 아카이빙 요소기술 ◎ 보존처리기술(cnnt) • 가상기계 프로그램(Virtual Machine Machine Software) • 에뮬레이션의 한 변형으로 내용해석을 미래의 범용 가상컴퓨터(UVC:Universal Virtual Computer)의 기계언어로 처리할 수 있도록 프로그램을 작성하여 해결하려는 시도 • 범용 가상컴퓨터는 IBM에서 설계한 플랫폼 • 이론적으로는 장기적 쟁점들에 부합하는 유일한 전략으로 인식할 수 있지만, 구현이 어려움 • 아날로그 백업(Analog Backup) • 디지털 객체를 보존성이 높은 매체인 아날로그 형태로 변환하는 방식 • 인쇄 형태 출력하거나, 디지털 이미지 파일을 마이크로 형태로 변환하는 경우 • 다소 고전적인 방식, 디지털 객체의 콘텐츠를 보존하고 기술적인 퇴화에 대응할 수 있음 • 기능적 또는 관련 행동에 관한 정보의 손실이 발생할 수 있음 • 접수단계 포맷전환(Migration on Ingest) • 요청단계 포맷전환(Migration on Request) • 캡슐화(Encapsulation) 14

“Digital Dark Ages?” • Users increasingly store their most valuable data digitally • Wedding/baby photographs • Letters (now called email) • Diaries, scrapbooks, tax returns • Yet digital information remains especially vulnerable • Terry Kuny: “We are living in the midst of digital Dark Ages” • Hard drives crash • Removable media evolve (e.g., 5 ¼” floppies) • File formats become obsolete (e.g., WordStar, Lotus 1-2-3) • What will the world remember of the late 20th century?

As a community, we’re not bad at storing important data over the long term. We’ve only just begun to think about how we’ll interpret that data 30 years from now.

For Example… • Viewing an old PowerPoint presentation • Do we still have PowerPoint at all? And Windows? • Does the presentation use non-standard fonts/codecs? • Has some newer application overwritten a shared library with an incompatible version (“DLL Hell”)? • Not just a Microsoft problem: consider a web page • Even current IE/Safari/Firefox don’t agree on formatting • All kinds of plugins necessary: sound, video, Flash

The Foundation Idea • Make daily backups of entire software stack • Archives users’ applications, OS, and configuration state • Don’t worry about identifying dependencies • Just save it all: “Every byte, every night” • To recover an obscure file, boot the relevant stack in an emulator • View file with the application that created it

Foundation FAQ • Why preserve the entire disk? • Preserve software stack dependencies: preserve the data with the right application, libraries, and operating system as a single unit • Works for all applications, not just ones designed for preservation • Why daily images? • Want to preserve machine state as close as possible to last write of user’s data (i.e., preserve image before something changes) • Also allows recovery from user errors • Why emulate hardware? • Much better track record than emulating software • Software example: OpenOffice emulating Microsoft Word (yikes) • Hardware emulators available today for Amiga, PDP-11, Nintendo…

I would love to give a talk about why Foundation is a great solution to the digital preservation problem. Really, though, I think it’s just a pretty good start. Instead, I’m going to talk about a fun problem we had to solve to make it work.

Every Byte, Every Night?Indefinitely? Really? • Plan 9 did exactly that • Archive changed blocks every night to optical jukebox • Found that storage capacity grew faster than usage • Later with Content-Addressable Storage (Venti) • Automatically coalesces duplicate data to save space • Required multiple, high-speed disks for performance • Challenge for Foundation: provide similar storage efficiency on consumer hardware • “Time Machine model”: one external USB drive

Venti Review • Plan 9 file system was two-level • Spinning storage, mostly a normal file system • Archival storage, optical write-once jukebox • Venti replaced optical jukebox • Still write-once • Chunks of data named by their SHA-1 hashes “Content-Addressable Storage (CAS)” • Automatically coalesces duplicate writes

Archival Process Venti Review seen it before? update index Hash  Offset reads 4th block reads 2nd block reads 1st block 5:h( )1 6: 7: 8: 9: 0: 1: 2: 3:h( )0 4: h( )2 append hash to summary Summary h( ) ,h( ) ,h( ) , h( ) append to log no log write! Data Log User’s Hard Drive RAM External USB Drive

Venti Review Restore Process map hash to log offset Hash  Offset restore block 5:h( )1 6:h( )6 7:h( )5 8:h( )2 9: 0:h( )4 1: 2:h( )3 3:h( )0 4:h( )7 lookup hash of 1st block Crash! Summary h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ) read block from log Final step (not shown): archive summary in data log as well Data Log User’s Hard Drive RAM External USB Drive

Notes on Venti • The Good News: • CAS stores each block with particular contents only once • Changing any one block and re-archiving uses only one more block in archive • Adding a duplicate file from a different source uses no additional storage • The Bad News: • Synchronous, random reads to on-disk index

Archival Process seen it before? Venti Review Hash  Offset reads 4th block 5:h( )1 6: 7: 8: 9: 0: 1: 2: 3:h( )0 4: h( )2 Summary h( ) ,h( ) ,h( ) Have to seek to the right bucket Data Log User’s Hard Drive RAM External USB Drive

Venti Review Restore Process map hash to log offset Hash  Offset 5:h( )1 6:h( )6 7:h( )5 8:h( )2 9: 0:h( )4 1: 2:h( )3 3:h( )0 4:h( )7 lookup hash of 1st block Summary h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ) Have to seek to the right bucket Data Log User’s Hard Drive RAM External USB Drive

Notes on Venti • The Good News: • CAS stores each block with particular contents only once • Changing any one block and re-archiving uses only one more block in archive • Adding a duplicate file from a different source uses no additional storage • The Bad News: • Synchronous, random reads to on-disk index • Best case, one-disk performance for 512-byte blocks: one 5 ms seek per 512 bytes archived = 100 kB/s • That’s 12 days to archive a 100 GB disk! • Larger blocks give better throughput, less sharing

Notes on Venti (con’t.) • Venti’s solution: use 8 high-speed disks for index • Untennable in consumer space • Wears disks out pretty quickly, too • The “compare-by-hash” controversy: • Fear of hash collisions: two different blocks with same hash breaks Venti • May be very unlikely, but cost (data corruption) is huge Does CAS really require a cryptographically strong hash?

Making Inexpensive CAS Fast • The problem: disk seeks • Secure hash randomizes an otherwise sequential disk-to-disk transfer • To reduce seeks, must reduce hash table lookups • When do hash table lookups occur? • When writing data, to determine if we’ve seen it before • When writing data, to update the index • When reading data, to map hashes to disk locations

2. Updating the Index • After appending a block to the data log, must update the index • Psuedorandom hash causes a seek

Updating the Index Archival Process update index Hash  Offset reads 2nd block 5:h( )1 6: 7: 8: 9: 0: 1: 2: 3:h( )0 4: Summary h( ) Have to seek to the right bucket append to log Data Log User’s Hard Drive RAM External USB Drive

2. Updating the Index • After appending a block to the data log, must update the index • Psuedorandom hash causes a seek • Easy to fix: use a write-back index cache • Store index writes in memory • Flush to disk sequentially in large batches • On crash, reconstruct index from the data log

3. Mapping Hashes to Disk Locations During Reads • To restore disk • Start with the list of original blocks’ hashes • Lookup each block in index • Read block from data log and restore to disk

Restore Process map hash to log offset Hash  Offset 5:h( )1 6:h( )6 7:h( )5 8:h( )2 9: 0:h( )4 1: 2:h( )3 3:h( )0 4:h( )7 lookup hash of 1st block Summary h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ) Have to seek to the right bucket Data Log User’s Hard Drive RAM External USB Drive

3. Mapping Hashes to Disk Locations During Reads • To restore disk • Start with the list of original blocks’ hashes • Lookup each block in index • Read block from data log and restore to disk • Observation: data log is mostly ordered • Duplicate blocks often occur as part of duplicate files

Ordering in Data Log Hash  Offset 5:h( )1 6:h( )6 7:h( )5 8:h( )2 9: 0:h( )4 1: 2:h( )3 3:h( )0 4:h( )7 Summary h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ) Data Log User’s Hard Drive RAM External USB Drive

3. Mapping Hashes to Disk Locations During Reads • To restore disk • Start with the list of original blocks’ hashes • Lookup each block in index • Read block from data log and restore to disk • Observation: data log is mostly ordered • Duplicate blocks often occur as part of duplicate files • Idea: add another index, ordered by log offset • Read-ahead in this index to eliminate future lookups in original index

Index by Offset Restore Process Offset  Hash 0:h( ) 1:h( ) 2:h( ) 3:h( ) 4:h( ) 5:h( ) 6:h( ) 7:h( ) 8: 9: 10: 11: h( )0 h( )1 h( )2 h( )3 h( )4 map hash to log offset (seek!) Hash  Offset restore block 5:h( )1 6:h( )6 7:h( )5 8:h( )2 9: 0:h( )4 1: 2:h( )3 3:h( )0 4:h( )7 lookup hash of 2nd block lookup hash of 1st block Crash! Summary h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ) new index, sorted by offset read block from log (seek!) read block from log (no seek!) prefetch hashes for next few offsets from secondary index (seek!) find log offset in secondary index – no seek! Hash  Offset Data Log User’s Hard Drive RAM External USB Drive

1. Is a Block New, or Duplicate? • Optimization for reads also helps duplicate writes • Index misses on first duplicate block • Hits on subsequent blocks rewritten in same order • Doesn’t help for new data • Every lookup in primary index fails • Still suffer a seek for every new block

1. Is a Block New, or Duplicate? • Idea: use a Bloom filter to identify new blocks • Lossy representation of the primary index • Uses much less memory than index itself • For any given block, Bloom filter tells us: • It’s definitely new  append to log, update index • It might be duplicate  lookup in index • If it really is a duplicate, we get the prefetch benefit • Otherwise, called a “false positive” • Using enough memory keeps false positives at ~1%

Results • Do these optimizations pay off? • Buffering index writes is an obvious win • Bloom filter is, too: removes 99% of seeks when writing new data • Both trade RAM for seeks • Benefit of secondary index less clear • If duplicate data comes in long sequences, it reduces index seeks to two per sequence • If duplicate data comes in little fragments, it doubles the number of index seeks • Need traces of real data to answer this question

Results (con’t.) • Research group at MIT has been running Venti as its backup server for two years • We looked at 400 nightly snapshots • Simulated archiving and restoring these in both Venti and Foundation

Eliminating “Compare by Hash” • Some worried that same SHA-1 doesn’t imply same contents (i.e., hash collisions are possible) • Even if very rare, consequences (corruption) too great • Stepping back a bit, CAS as a black box: • Give it a data block, get back an opaque ID • Give it an opaque ID, get back the data block • Do we care that the ID is a SHA-1 hash? • What if the “opaque” ID was just the block’s location in the data log?

Using Locations As IDs • Pros + Reads require no index lookups at all + System can still find potential duplicates using hashing (with a weaker, faster hash function) • Cons • Need another mechanism to check integrity • Since hash untrusted, must compare suspected duplicates byte-by-byte • Others have claimed these byte-by-byte comparisons are a non-starter

2nd Disk Arm to the Rescue • Once we eliminate most index reads (via our previous optimizations), the backup disk is otherwise idle while backing up duplicate data • Can instead put it to work doing byte-by-byte comparisons of suspected duplicates

Related Work • Apple Time Machine • Duplicates coalesced at file level via hard links • Netapp WAFL, ZFS • Copy-on-write coalesces blocks at the FS level • Misses duplicates that come into system separately • Data Domain Deduplication FS • Very similar to Foundation, in enterprise context • Depends on collision-freeness of hash function • Lots of other Content-Addressed Storage work • LBFS, SUNDR, Peabody

Conclusions • Consumer-grade CAS works now • A single, external USB drive is enough • Just have to be crafty about avoiding seeks • Lots of uses other than preservation • E.g., inexpensive household backup server that automatically coalesces duplicate media collections • Doesn’t require a collision-free hash function

Thank You! www.dkucti.com

※ 논문 목표 및 진행상황 ◎ 논문 내용 이전으로 가기 50

Fast, Inexpensive Content-Addressed Storage in Foundation