optimizing zfs for block storage n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Optimizing ZFS for Block Storage PowerPoint Presentation
Download Presentation
Optimizing ZFS for Block Storage

Loading in 2 Seconds...

play fullscreen
1 / 41

Optimizing ZFS for Block Storage - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Optimizing ZFS for Block Storage. Will Andrews, Justin Gibbs Spectra Logic Corporation. Talk Outline. Quick Overview of ZFS Motivation for our Work Three ZFS Optimizations COW Fault Deferral and Avoidance Asynchronous COW Fault Resolution Asynchronous Read Completions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Optimizing ZFS for Block Storage


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Optimizing ZFS for Block Storage Will Andrews, Justin Gibbs Spectra Logic Corporation

    2. Talk Outline • Quick Overview of ZFS • Motivation for our Work • Three ZFS Optimizations • COW Fault Deferral and Avoidance • Asynchronous COW Fault Resolution • Asynchronous Read Completions • Validation of the Changes • Performance Results • Commentary • Further Work • Acknowledgements

    3. ZFS Feature Overview • File System/Object store + Volume Manager + RAID • Data Integrity via RAID, checksums stored independently of data, and metadata duplication • Changes are committed via transactions allowing fast recovery after an unclean shutdown • Snapshots • Deduplication • Encryption • Synchronous Write Journaling • Adaptive, tiered caching of hot data

    4. Simplified ZFS Block Diagram File, Block, or Object Access ZFS POSIX Layer ZFS Volumes Lustre CAM Target Layer Presentation Layer TX Management & Object Coherency Spectra Optimizations Here Data Management Unit Objects and Caching Configuration & Control Volumes, RAID, Snapshots, I/O Pipeline zfs(8), zpool(8) Storage Pool Allocator Layout Policy

    5. ZFS Records or Blocks • ZFS’s unit of allocation and modification is the ZFS record. • Records range from 512B to 128KB. • Checksum for each record are verified when the record is read to ensure data integrity. • Checksums for a record are stored in the parent record (indirect block, or DMU node) that reference it, which are themselves checksummed.

    6. Copy-on-Write, Transactional, Semantics • ZFS never overwrites a currently allocated block • A new version of the storage pool is built in free space • The pool is atomically transitioned to the new version • Free space from the old version is eventually reused • Atomicity of the version update is guaranteed by transactions, just like in databases.

    7. ZFS Transactions • Each write is assigned a transaction. • Transactions are written in batches called “transaction groups” that aggregate the I/O into sequential streams for optimum write bandwidth. • TXGs are pipelined to keep the I/O subsystem saturated • Open TXG: Current version of Objects. Most changes happen here. • Quiescing TXG: Waiting for writers to finish changes to in-memory buffers. • Synching TXG: buffers being committed to disk.

    8. Copy on Write In Action Root of Storage Pool überblock überblock Indirect Linkage for Object Expansion DMU Node Root of an Object (file) DMU Node Write Indirect Block Indirect Block Indirect Block Data Block Data Block Data Block Data Block Data Block

    9. Tracking Transaction Groups Time • DMU Buffer (DBUF): Metadata for ZFS blocks being modified • Dirty Record: Syncher information for committing the data. Open TXG Syncing TXG Quiescing TXG DMU Buffer Dirty Record Dirty Record Dirty Record Record Data Record Data Record Data Current Object Version

    10. Performance Demo

    11. Performance Analysis When we write an existing block, we must mark it dirty… void dbuf_will_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx) { intrf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH; ASSERT(tx->tx_txg != 0); ASSERT(!refcount_is_zero(&db->db_holds)); DB_DNODE_ENTER(db); if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock)) rf |= DB_RF_HAVESTRUCT; DB_DNODE_EXIT(db); (void) dbuf_read(db, NULL, rf); (void) dbuf_dirty(db, tx); }

    12. Doctor, it hurts when I do this… • Why does ZFS Read on Writes? • ZFS records are never overwritten directly • Any missing old data must be read before the new version of the record can be written • This behavior is a COW Fault • Observations • Block consumers (Databases, Disk Images, FC LUN, etc.) are always overwriting existing data. • Why read data in a sequential workload when you are destined to discard it? • Why force the writer to wait to read data?

    13. Optimization #1 Deferred Copy On Write Faults How Hard Can It Be? Famous Last Words

    14. DMU Buffer State Machine (Before) Read Issued READ Read Complete Truncate Teardown EVICT CACHED UNCACHED Copy Complete Full Block Write FILL

    15. DMU Buffer State Machine (After)

    16. Tracking Transaction Groups Time Open TXG DMU Buffer UNCACHED Dirty Record

    17. Tracking Transaction Groups Time Open TXG DMU Buffer PARTIAL|FILL Dirty Record Record Data

    18. Tracking Transaction Groups Time Open TXG DMU Buffer PARTIAL Dirty Record Record Data

    19. Tracking Transaction Groups Time Quiescing TXG Open TXG DMU Buffer PARTIAL Dirty Record Dirty Record Record Data Record Data

    20. Tracking Transaction Groups Time Open TXG Syncing TXG Quiescing TXG DMU Buffer PARTIAL Syncer Processes Record Dirty Record Dirty Record Dirty Record Record Data Record Data Record Data

    21. Tracking Transaction Groups Time Open TXG Syncing TXG Quiescing TXG DMU Buffer READ Syncer Processes Record Dirty Record Dirty Record Dirty Record Record Data Record Data Record Data Read Buffer Dispatch Synchronous Read

    22. Tracking Transaction Groups Time Open TXG Syncing TXG Quiescing TXG DMU Buffer READ Syncer Processes Record Dirty Record Dirty Record Dirty Record Record Data Record Data Record Data Merge Read Buffer Merge Synchronous Read Returns

    23. Tracking Transaction Groups Time Open TXG Syncing TXG Quiescing TXG DMU Buffer CACHED Dirty Record Dirty Record Dirty Record Record Data Record Data Record Data

    24. Optimization #2 Asynchronous Fault Resolution

    25. Issues with Implementation #1 • Syncer stalls due to synchronous resolve behavior. • Resolving reads that are known to be needed are delayed. • Example: a modified version of the record is created in a new TXG • Writers should be able to cheaply start the resolve process without blocking. • The syncer should operate on multiple COW faults in parallel.

    26. Complications • Split Brain • ZFS record can have multiple personality disorder • Example: Write, truncate, write again, all in flight at the same time with a resolving read. • Term reflects how dealing with this issue made us feel. • Chaining syncer’s write to the resolving read • This read may have been started in advance of syncer processing due to a writer noticing that resolution is necessary.

    27. Optimization #3 Asynchronous Reads

    28. Thread Blocking Semantics ZFS – Block Diagram Callback Semantics ZFS Posix Layer ZFS Volumes Lustre CAM Target Layer Presentation Layer Data Management Unit Objects and Caching Configuration & Control zfs(8), zpool(8) Storage Pool Allocator Layout Policy

    29. Asynchronous DMU I/O • Goal: Get as much I/O in flight as possible • Uses Thread Local Storage (TLS) • Avoid lock order reversals • Avoid modifications in APIs just to pass down a queue. • No lock overhead due to it being per-thread • Refcountingwhile issuing I/Os to make sure callback is not called until entire I/O completes

    30. Results

    31. Bugs, bugs, bugs… Deadlocks Data corruption Missed events Sleeping holding non-sleepable locks wrong arguments to bcopy Page faults Bad comments Unprotected critical sections Incorrect refcounting Invalid state machine transitions Memory leaks Split brain conditions Insufficient interlocking Disclaimer: This is not a complete list.

    32. Validation • ZFS has many complex moving parts • Simply thrashing a ZFS is not a sufficient test • Many hidden parts make use of the DMU layer and are not directly involved in data I/O or at all • Extensive modifications of the DMU layer require thorough verification • Every object in ZFS uses the DMU layer to support its transactional nature

    33. Testing, testing, testing… • Many more asserts added • Solaris Test Framework ZFS test suite • Extensively modified to (mostly) pass on FreeBSD • Has ~300 tests, needs more • ztest: Unit (ish) test suite • Element of randomization requires multiple test runs • Some test frequencies increased to verify fixes • xdd: Performance tests • Finds bugs involving high workloads

    34. Cleanup & refactoring • DMU I/O APIs rewritten to allow issuing async IOs, minimize hold/release cycles, & unify API for all callers • DBUF dirty restructured • Now looks more like a checklist than an organically grown process • Broken apart to reduce complexity and ease understanding of its many nuances

    35. Performance results almost ^ • It goes 3-10X faster! Without breaking anything! • Results that follow are for the following config: • RAIDZ2 of 4 2TB SATA drives on 6Gb LSI SAS HBA • Xen HVM DomU w/ 4GB RAM, 4 cores of 2GHz Xeon • 10GB ZVOL, 128KB record size • Care taken to avoid cache effects

    36. Commentary • Commercial consumption of open source works best when it is well written and documented • Drastically improved comments, code readability • Community differences & development choices • Sun had a small ZFS team that stayed together • FreeBSD has a large group of people who will frequently work on one area and move on to another • Clear coding style, naming conventions, & test cases are required for long-term maintainability

    37. Further Work • Apply deferred COW fault optimization to indirect blocks • Uncached metadata still blocks writers and this can cut write performance in half • Required indirect blocks should be fetched asynchronously • Eliminate copies and allow larger I/O cluster sizes in the SPA clustered I/O implementation • Improve read prefetch performance for sequential read workloads • Hybrid RAIDZ and/or more standard RAID 5/6 transform • All the other things that have kept Kirk working on file systems for 30 years.

    38. Acknowledgments • Sun’s original ZFS team for developing ZFS • PawelDawidek for the FreeBSD port • HighCloud Security for the FreeBSD port of the STF ZFS test suite • Illumos for continuing open source ZFS development • Spectra Logic for funding our work

    39. Questions? Preliminary Patch Set: http://people.freebsd.org/~will/zfs/