1 / 74

Conquest : Preparing for Life After Disks

Conquest : Preparing for Life After Disks. An-I Andy Wang Geoff Kuenning, Peter Reiher, Gerald Popek. Conquest Overview. File systems are optimized for disks Performance problem Complexity Now we have tons of inexpensive RAM What can we do with that RAM?. Conquest Approach.

Download Presentation

Conquest : Preparing for Life After Disks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conquest: Preparing for Life After Disks An-I Andy Wang Geoff Kuenning, Peter Reiher, Gerald Popek

  2. Conquest Overview • File systems are optimized for disks • Performance problem • Complexity • Now we have tons of inexpensive RAM • What can we do with that RAM?

  3. Conquest Approach • Combine disk and persistent RAM (e.g., battery-backed RAM) in a novel way • Simplification • > 20% fewer semicolons than ext2, reiserfs, and SGI XFS • Performance (under popular benchmarks) • 24% to 1900% faster than LRU disk caching

  4. Outline of the Talk • Motivation • Conquest design (high level) • Conquest components • Performance evaluation • Conclusion Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  5. Motivation • Most file systems are built for disks • Problems with the disk assumption: • Performance • Complexity Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  6. 106 105 Hardware Evolution CPU (50% /yr) 1 GHz memory (50% /yr) accesses per second (log scale) 1 MHz 1 KHz disk (15% /yr) 1990 1995 2000 (1 sec : 6 days) (1 sec : 3 months) Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  7. Disk arm Disk platters Access time = seek time (disk arm) + rotational delay (disk platter) + transfer time Inside Pandora’s Box Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  8. Disk arm scheduling Group information on disk Disk readahead Buffered writes Disk caching Disk Optimization Methods • Data mirroring • Hardware parallelism Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  9. predictive readahead synchronization cache replacement elevator algorithm data clustering data consistency asynchronous write Complexity Bytes Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  10. magnetic RAM? battery-backed DRAM (write once) flash memory disk tape persistent RAM Storage Media Alternatives $/MB (log scale) 10-3 106 100 103 accesses/sec (log scale) 10-3 Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Micron Semiconductor Products 2000; Quantum 2000]

  11. booming of digital photography 4 to 10 GB of persistent RAM paper/film persistent RAM 1" HDD 2.5" HDD 3.5" HDD Price Trend of Persistent RAM 102 101 $/MB (log scale) 100 10-1 10-2 1995 2000 2005 year Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Grochowski 2000]

  12. Old Order; New World • Disk will stay around • Cost, capacity, power, heat • RAM as a viable storage alternative • PDAs, digital cameras, MP3 players • More architectural changes due to RAM • A big assumption change from disk • Rethink data structures, interfaces, applications Motivation– Conquest Design – Conquest Components – Performance Evaluation – Conclusion

  13. Getting a Fresh Start What does it take to design and build a system that assumes amplepersistent RAM as the primary storage medium? Motivation –Conquest Design– Conquest Components – Performance Evaluation – Conclusion

  14. Design and build a disk/persistent-RAM hybrid file system Deliver all file system services from memory, with the exception of high-capacity storage Two separate data paths to memory and disk Benefits: Simplicity Performance Conquest Design Motivation –Conquest Design– Conquest Components – Performance Evaluation – Conclusion

  15. Simplicity • Remove disk-related complexities for most files • Make things simpler for disk as well • Less complexity • Fewer bugs • Easier maintenance • Shorter data paths Motivation –Conquest Design– Conquest Components – Performance Evaluation – Conclusion

  16. Performance • Overall • All management performed in memory • Memory data path • No disk-related overhead • Disk data path • Faster speed due to simpler access models Motivation –Conquest Design– Conquest Components – Performance Evaluation – Conclusion

  17. Conquest Components • Media management • Metadata representation • Directory service • Allocation service • Persistence support • Resiliency support Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  18. User Access Patterns • Small files • Take little space (10%) • Represent most accesses (90%) • Large files • Take most space • Mostly sequential accesses • Not characteristic of database applications Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion [Iram 1993; Douceur et al., 1999; Roselli et al., 2000]

  19. Files Stored in Persistent RAM • Small files (< 1MB) • No seek time or rotational delays • Fast byte-level accesses • Contiguous allocation • Metadata • Fast synchronous update • No dual representations • Executables and shared libraries • In-place execution Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  20. Conquest Memory Data Path storage requests persistence support battery-backed RAM small file and metadata storage Memory Data Path of Conquest ConventionalFile Systems storage requests IO buffer management IO buffer persistence support disk management disk Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  21. Large-File-Only Disk Storage • Allocate in big chunks • Lower access overhead • Reduced management overhead • No fragmentation management • No tricks for small files • Storing data in metadata • No elaborate data structures • Wrapping a balanced tree onto disk cylinders Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion [Devlinux.com 2000]

  22. Sequential-Access Large Files • Sequential disk accesses • Near-raw bandwidth • Well-defined readahead semantics • Read-mostly • Little synchronization overhead (between memory and disk) Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  23. Disk Data Path of Conquest ConventionalFile Systems Conquest Disk Data Path storage requests storage requests IO buffer management IO buffer management IO buffer battery-backed RAM IO buffer small file and metadata storage persistence support disk management disk management disk disk large-file-only file system Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  24. Random-Access Large Files • Random access? • Common definition: nonsequential access • A typical movie has 150 scene changes • MP3 stores the title at the end of the files • Near sequential access? • Simplifies large-file metadata representation significantly Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  25. i-node • File attributes • Data Name(s) Logical File Representation File Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  26. i-node • File attributes • Data locations • Data blocks Physical File Representation Name(s) File Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  27. data block location data block location data block location data block location data block location data block location index block location index block location data block location index block location index block location index block location data block location index block location 12 Ext2 Data Representation i-node Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  28. Disadvantages with Ext2 Design • Designed for disk storage • Optimization for small files makes things complex • Random-access data structure for large files that are accessed mostly sequentially • Data access time dependent on the byte position in a file • Maximum file size is limited Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  29. Conquest Representation • Persistent RAM • Hash(file name) = location of data • Offset(location of data) • Disk storage • Per-file, doubly linked list of disk block segments (stored in persistent RAM) Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  30. Advantages Conquest Design • Direct data access for in-core files • Worse case: sequential memory search for random disk locations • Maximum file size limited by physical storage Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  31. Directory Service • Requirements • Fast sequential traversal (e.g., ls) • Fast random lookup (e.g., locate file x) • Hard links (apply multiple names to data) Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  32. First Design • A doubly hashed table for each directory • Conserves space • Problems: • Dynamic resizing of directories • Need to handle the current file position • Important for rm -fr Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  33. empty empty 0100 | file_1 1001 | file_2 empty empty 0100 | file1 1001 | file2 empty 0011 | dir1 1110 | file2_hardlink Second Design • A variant of extensible hash table for each directory • An old data structure fits nicely Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion [Fagin et al., 1979]

  34. Additional Engineering Details • Popular hash functions randomize lower bits • Dynamic file positioning • Need to handle collisions • Memory overhead and complexity tradeoffs Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  35. ID: 1| free ID: 2| in use ID: 3| free ID: 4| free ID: 5| in use ID: 6| free Metadata Allocation • Requirements • Keep track of usage status of metadata entries • Avoid duplicate allocation with unique IDs • Fast retrieval of metadata with a given ID Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  36. ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use Existing Memory Allocation • Services • Keep track of unallocated memory • No duplicate allocation of physical addresses • Hmm… Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  37. ID: 1| free Unique IDs and fast retrieval ID: 2| in use ID: 3| free ID: 4| free Usage status ID: 5| in use ID: 6| free Conquest Metadata Management • Metadata = memory allocated by memory manager • Metadata ID = physical address of metadata ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  38. Persistence Support • Restore file system states after a reboot • Data • Metadata • Memory manager • Keep track of metadata allocation Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  39. Linux Memory Manager (1) • Page allocator maintains individual pages Page allocator Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  40. Zone allocator Linux Memory Manager (2) • Zone allocator allocates memory in power-of-two sizes Page allocator Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  41. Slab allocator Linux Memory Manager (3) • Slab allocator groups allocations by sizes to reduce internal memory fragmentation Zone allocator Page allocator Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  42. Linux Memory Manager (4) • Difficult to restore the persistent states • Three layers of pointer-rich mappings • Mixing of persistent and temporary allocations Slab allocator Zone allocator Page allocator Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  43. Conquest Persistence • Create memory zones with own instantiations of memory managers Slab allocator Zone allocator Page allocator Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  44. Conquest Persistence • Encapsulate all pointers within each zone • Pointers can survive reboots • No serialization and deserialization • Swapping and paging • Disabled for Conquest memory zones • Enabled for non-Conquest zones Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  45. pointer pointer Resiliency Support • Instantaneous metadata commit • No fsck (ad hoc metadata consistency check) • Built-in checkpointing • Pointer-switch commit semantics Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  46. Implementation Status • Kernel module under Linux 2.4.2 • Fully functional and POSIX compliant • Modified memory manager to support Conquest persistence • Need to overcome BIOS limitations for distribution • Looking for licensing opportunities Motivation – Conquest Design –Conquest Components– Performance Evaluation – Conclusion

  47. Performance Evaluation • Architectural simplification • Feature count • Performance improvement • Memory-only workload • Memory and disk workload Motivation – Conquest Design – Conquest Components –Performance Evaluation – Conclusion

  48. Conventional Data Path • Buffer allocation management • Buffer garbage collection • Data caching • Metadata caching • Predictive readahead • Write behind • Cache replacement • Metadata allocation • Metadata placement • Metadata translation • Disk layout • Fragmentation management ConventionalFile Systems storage requests IO buffer management IO buffer persistence support disk management disk Motivation – Conquest Design – Conquest Components –Performance Evaluation – Conclusion

  49. Memory Path of Conquest • Buffer allocation management • Buffer garbage collection • Data caching • Metadata caching • Predictive readahead • Write behind • Cache replacement • Metadata allocation • Metadata placement • Metadata translation • Disk layout • Fragmentation management Conquest Memory Data Path storage requests Persistence support battery-backed RAM small file and metadata storage • Memory manager encapsulation Motivation – Conquest Design – Conquest Components –Performance Evaluation – Conclusion

  50. Disk Path of Conquest • Buffer allocation management • Buffer garbage collection • Data caching • Metadata caching • Predictive readahead • Write behind • Cache replacement • Metadata allocation • Metadata placement • Metadata translation • Disk layout • Fragmentation management Conquest Disk Data Path storage requests IO buffer management battery-backed RAM IO buffer small file and metadata storage disk management disk large-file-only file system Motivation – Conquest Design – Conquest Components –Performance Evaluation – Conclusion

More Related