580 likes | 747 Views
Panasas Parallel File System. Brent Welch Director of Software Architecture, Panasas Inc October 16, 2008 HPC User Forum. Outline. Panasas background Storage cluster background, hardware and software Technical Topics High Availability Scalable RAID pNFS. Brent’s garden, California.
 
                
                E N D
Panasas Parallel File System Brent Welch Director of Software Architecture, Panasas Inc October 16, 2008HPC User Forum
Outline • Panasas background • Storage cluster background, hardware and software • Technical Topics • High Availability • Scalable RAID • pNFS Brent’s garden, California
The New ActiveStor Product Line Design, Modeling and Visualization Applications Simulation andAnalysis Applications Tiered QOS Storage Backup/ Secondary • ActiveStor 4000 • 20 or 15TB shelves • 5 GB cache/shelf • Integrated 10GigE • 600 MB/sec/shelf • Tiered Parity • Options: • ActiveImage • ActiveGuard • ActiveMirror • ActiveStor 200 • 104 TBs (3DBs x 52SBs) • 5 shelves (20U, 35”) • Single 1GigE port/shelf • 350 MB/sec aggregate • 20 GB aggregate cache • Tiered Parity • ActiveStor 6000 • 20, 15 or 10TB shelves • 20 GB cache/shelf • Integrated 10GigE • 600 MB/sec/shelf • Tiered Parity • ActiveImage • ActiveGuard • ActiveMirror ActiveScale 3.2
Panasas Leadership Role in HPC • US DOE: Panasas Selected for Roadrunner – Top of the Top 500 • LANL $130M system will deliver 2x performance over current top BG/L • SciDAC: Panasas CTO selected to lead Petascale Data Storage Inst • CTO Gibson leads PDSI launched Sep 06, leveraging experience from PDSI members: LBNL/Nersc; LANL; ORNL; PNNL Sandia NL; and UCSC • Aerospace: Airframes and engines, both commercial and defense • Boeing HPC file system; major engine mfg; top 3 U.S. defense contractors • Formula-1: HPC file system for Top 2 clusters – 3 teams in total • Top clusters at Renault F-1 and BMW Sauber, Ferrari also on Panasas • Intel: Certifies Panasas storage for broad range of HPC applications, now ICR • Intel uses Panasas storage for EDA design, and in HPC benchmark center • SC07: Six Panasas customers won awards at SC07 (Reno) conference • Validation: Extensive recognition and awards for HPC breakthroughs
Panasas Joint Alliance Investments with ISVs Panasas ISV Alliances Vertical Focus Energy Manufacturing; Government; Higher Ed and Res; Energy Semiconductor Financial Applications Seismic Processing; Interpretation; Reservoir Modeling Computational Fluid Dynamics (CFD); Comp Structural Mechanics (CSM) Electronic Data Automation (EDA) Trading; Derivatives Pricing; Risk Analysis
Intel Use of Panasas in the IC Design Flow Panasas critical to tape-out stage
Description of Darwin at U of Cambridge University of Cambridge HPC Service, Darwin Supercomputer Darwin Supercomputer Computational Units Nine repeating units, each consists of 64 nodes (2 racks) providing 256 cores each, 2340 cores total All nodes within a CU connected to a full bisectional bandwidth Infiniband 900 MB/s, MPI latency of 2 ms Source: http://www.hpc.cam.ac.uk
Details of the FLUENT 111M Cell Model Unsteady external aero for 111 MM cell truck; 5 time steps with 100 iterations, and a single.dat file write Truck 111M Cells DARWIN 585 nodes; 2340 cores Panasas: 4 Shelves, 20 TB
Scalability of Solver + Data File Write FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster Time of Solver + Data File Write Lower is better 1.7x Time (Seconds) of Solver + Data File Write Truck Aero 111M Cells 1.5x NOTE: Read times are not included in these results 1.9x 1.7x Number of Cores
Performance of Data File Write in MB/s FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster Effective Rates of I/O for Data File Write Higher Is Better Effective Rates of I/O (MB/s) for Data Write Truck Aero 111M Cells 39x 31x NOTE: Data File Write Only 20x Number of Cores
Panasas Architecture • Cluster technology provides scalable capacity and performance: capacity scales symmetrically with processor, caching, and network bandwidth • Scalable performance with commodity parts provides excellent price/performance • Object-based storage provides additional scalability and security advantages over block-based SAN file systems • Automatic management of storage resources to balance load across the cluster • Shared file system (POSIX) with the advantages of NAS, with direct-to-storage performance advantages of DAS and SAN Disk CPU Memory Network
Panasas bladeserver building block Power Supplies Embedded Switch Battery 4U high Mid Plane Rails 11 slots for blades The Shelf DirectorBlade StorageBlade 14 Garth Gibson, July 2008
Panasas Blade Hardware Integrated GE Switch Battery Module (2 Power units) Shelf Front 1 DB, 10 SB Shelf Rear StorageBlade DirectorBlade Midplane routes GE, power
Panasas Product Advantages Proven implementation with appliance-like ease of use/deployment Running mission-critical workloads at global F500 companies Scalable performance with Object-based RAID No degradation as the storage system scales in size Unmatched RAID rebuild rates – parallel reconstruction Unique data integrity features Vertical parity on drives to mitigate media errors and silent corruptions Per-file RAID provides scalable rebuild and per-file fault isolation Network verified parity for end-to-end data verification at the client Scalable system size with integrated cluster management Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers Simultaneous access from over 12000 servers
Linear Performance Scaling Breakthrough data throughput AND random I/O Performance and scalability for all workloads
Proven Panasas Scalability • Storage Cluster Sizes Today (e.g.) • Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25 DirectorBlades and 250 StorageBlades each in two other smaller systems.) • LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system today, planning to increase to 144 shelves next year. • Intel has 5,000 active DF clients against 10-shelf systems, with even more clients mounting DirectorBlades via NFS. They have qualified a 12,000 client version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this year. • BP uses 200 StorageBlade storage pools as their building block • LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades • Most customers run systems in the 100 to 200 blade size range
Emphasis on Data Integrity • Horizontal Parity • Per-file, Object-based RAID across OSD • Scalable on-line performance • Scalable parallel RAID rebuild • Vertical Parity • Detect and eliminate unreadable sectors and silent data corruption • RAID at the sector level within a drive / OSD • Network Parity • Client verifies per-file parity equation during reads • Provides only truly end-to-end data integrity solution available today • Many other reliability features… • Media scanning, metadata fail over, network multi-pathing, active hardware monitors, robust cluster management
High Availability • Quorum based cluster management • 3 or 5 cluster managers to avoid split brain • Replicated system state • Cluster manager controls the blades and all other services • High performance file system metadata fail over • Primary-backup relationship controlled by cluster manager • Low latency log replication to protect journals • Client-aware fail over for application-transparency • NFS level fail over via IP takeover • Virtual NFS servers migrate among DirectorBlade modules • Lock services (lockd/statd) fully integrated with fail over system
Turn-key deployment and automatic resource configuration Scalable Object RAID Very fast RAID rebuild Vertical Parity to trap silent corruptions Network parity for end-to-end data verification Distributed system platform with quorum-based fault tolerance Coarse grain metadata clustering Metadata fail over Automatic capacity load leveling Storage Clusters scaling to ~1000 nodes today Compute clusters scaling to 12,000 nodes today Blade-based hardware with 1Gb/sec building block Bigger building block going forward Technology Review
The pNFS Standard The pNFS standard defines the NFSv4.1 protocol extensions between the server and client The I/O protocol between the client and storage is specified elsewhere, for example: SCSI Block Commands (SBC) over Fibre Channel (FC) SCSI Object-based Storage Device (OSD) over iSCSI Network File System (NFS) The control protocol between the server and storage devices is also specified elsewhere, for example: SCSI Object-based Storage Device (OSD) over iSCSI Client Storage NFS 4.1 Server
Key pNFS Participants • Panasas (Objects) • Network Appliance (Files over NFSv4) • IBM (Files, based on GPFS) • EMC (Blocks, HighRoad MPFSi) • Sun (Files over NFSv4) • U of Michigan/CITI (Files over PVFS2)
pNFS Status pNFS is part of the IETF NFSv4 minor version 1 standard draft Working group is passing draft up to IETF area directors, expect RFC later in ’08 Prototype interoperability continues San Jose Connect-a-thon March ’06, February ’07, May ‘08 Ann Arbor NFS Bake-a-thon September ’06, October ’07 Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08) Availability TBD – gated behind NFSv4 adoption and working implementations of pNFS Patch sets to be submitted to Linux NFS maintainer starting “soon” Vendor announcements in 2008 Early adoptors in 2009 Production ready in 2010
Questions? Thank you for your time!
Deep Dive: Reliability • High Availability • Cluster Management • Data Integrity
Vertical Parity • “RAID” within an individual drive • Seamless recovery from media errors by applying RAID schemes across disk sectors • Repairs media defects by writing through to spare sectors • Detects silent corruptions and prevents reading wrong data • Independent of horizontal array-based parity schemes Vertical Parity
Network Parity • Extends parity capability across the data path to the client or server node • Enables End-to-End data integrity validation • Protects from errors introduced by disks, firmware, server hardware, server software, network components and transmission • Client either receives valid data or an error notification Network Parity Vertical Parity Horizontal Parity
Panasas Scalability • Two Layer Architecture • Division between these two layers is an important separation of concerns • Platform maintains a robust system model and provides overall control • File system is an application layered over the distributed system platform • Automation in the distributed system platform helps the system adapt to failures without a lot of hand-holding by administrators • The file system uses protocols optimized for performance, and relies on the platform to provide robust protocols for failure handling Applications Parallel File System Distributed System Platform Hardware
Model-based Platform • The distributed system platform maintains a model of the system • what are the basic configuration settings (networking, etc.) • which storage and manager nodes are in the system • where services are running • what errors and faults are present in the system • what recovery actions are in progress • Model-based approach mandatory to reduce administrator overhead • Automatic discovery of resources • Automatic reaction to faults • Automatic capacity balancing • Proactive hardware monitoring Manual Automatic
Quorum-based Cluster Management • The system model is replicated on 3 or 5 (or more) nodes • Maintained via a Lamport’s PTP (Paxos) quorum voting protocol • PTP handles quorum membership change, brings partitioned members back up to date, provides basic transaction model • Each member keeps the model in a local database that is updated within a PTP transaction • 7 or 14 msec update cost, which is dominated by 1 or 2 synchronous disk IOs • This robust mechanism is not on the critical path of any file system operation • Yes: change admin password (involves cluster manager) • Yes: configure quota tree • Yes: initiate service fail over • No: open close read write etc. (does not involve cluster manager)
File Metadata • PanFS metadata manager stores metadata in object attributes • All component objects have simple attributes like their capacity, length, and security tag • Two component objects store replica of file-level attributes, e.g. file-length, owner, ACL, parent pointer • Directories contain hints about where components of a file are stored • There is no database on the metadata manager • Just transaction logs that are replicated to a backup via a low-latency network protocol
File Metadata • File ownership divided along file subtree boundaries (“volumes”) • Multiple metadata managers, each own one or more volumes • Match with the quota-tree abstraction used by the administrator • Creating volumes creates more units of meta-data work • Primary-backup failover model for MDS • 90us remote log update over 1GE, vs. 2us local in-memory log update • Some customers introduce lots of volumes • One volume per DirectorBlade module is ideal • Some customers are stubborn and stick with just one • E.g., 75 TB and 144 StorageBlade modules and a single MDS
Scalability over time • Software baseline • Single software product, but with branches corresponding to major releases that come out every year (or so) • Today most customers are on 3.0.x, which had a two year lifetime • We are just introducing 3.2, which has been in QA and beta for 7 months • There is forward development on newer features • Our goal is a major release each year, with a small number of maintenance releases in between major releases • Compatibility is key • New versions upgrade cleanly over old versions • Old clients communicate with new servers, and vice versa • Old hardware is compatible with new software • Integrated data migration to newer hardware platforms
Deep Dive: Networking • Scalable networking infrastructure • Integration with compute cluster fabrics
LANL Systems • Tourquoise Unclassified • Pink 10TF - TLC 1TF - Coyote 20TF (compute) • 1 Lane PaScalBB ~ 100 GB/sec (network) • 68 Panasas Shelves, 412 TB, 24 GBytes/sec (storage) • Unclassified Yellow • Flash/Gordon 11TF - Yellow Roadrunner Base 5TF – (4 RRp3 CU’s 300 TF) • 2 Lane PaScalBB ~ 200 GB/sec • 24 Panasas Shelves - 144 TB - 10 GBytes/sec • Classified Red • Lightning/Bolt 35TF - Roadrunner Base 70TF (Accelerated 1.37PF) • 6 Lane PaScalBB ~ 600 GB/sec • 144 Panasas Shelves - 936 TB - 50 GBytes/sec Top500 #1
IB and other network fabrics • Panasas is a TCP/IP, GE-based storage product • Universal deployment, Universal routability • Commodity price curve • Panasas customers use IB, Myrinet, Quadrics, … • Cluster interconnect du jour for performance, not necessarily cost • IO routers connect cluster fabric to GE backbone • Analogous to an “IO node”, but just does TCP/IP routing (no storage) • Robust connectivity through IP multipath routing • Scalable throughput at approx 650 MB/sec IO router (PCI-e class) • Working on a 1GB/sec IO router • IB-GE switching platforms • QLogic or Voltare switch provides wire-speed bridging
Petascale Red Infrastructure Diagram with Roadrunnner Accelerated FY08 Secure Core switches NFS and other network services, WAN Archive Roadruner Phase 3 1.026 PF Nx10GE NxGE IB4X FatTree Nx10GE FTA’s Compute Unit fail over 10GE IONODES Site wide Shared Global Parallel File System Compute Unit Roadrunner Phase 1 70TF Myr inet 10GE IONODES 4 GE per 5-8 TB CU IO Unit CU Lightning/Bolt35 TF Scalable to 600 GB/sec before adding Lanes Myr inet 1GE IONODES CU IO Unit CU
LANL Petascale (Red) FY08 NFS complex and other network services, WAN Road Runner Base, 70TF, 144 node units, 12 I/O nodes/unit, 4 socket dual core AMD nodes, 32 GB mem/node, full fat tree, 14 units, Acceleration to 1 PF sustained N gb/s N gb/s Archive FTA’s IB4X FatTree N gb/s N gb/s 156 I/O nodes, 1 – 10gbit link each, 195 GB/sec, planned for Accelerated Road Runner 1PF sustained Compute Unit fail over Site wide Shared Global Parallel File System 650-1500 TB 50-200 GB/s (spans lanes) Compute Unit Bolt, 20TF, 2 socket, singe/dual core AMD, 256 node units, reduced fat tree, 1920 nodes Myr inet 96 I/O nodes, 2 – 1gbit links each, 24 GB/sec CU IO Unit 4 gb/s per 5-8 TB CU 20 gb/s per link Lane switches, 6 X 105 = 630 GB/s Lightning, 14 TF, 2 socket single core AMD, 256 node units, full fat tree, 1608 nodes Lane passthru switches, to connect legacy Lightning/Bolt 64 I/O nodes, 2 – 1gbit links each, 16 GB/sec Myr inet CU IO Unit If more bandwidth is needed we just need to add more lanes and add more storage, scales nearly linearly, not N2 like a fat tree Storage Area Network. CU
Multi-Cluster sharing: scalable BW with fail over Archive KRB DNS1 NFS Cluster A To Site Network Panasas Storage Colors depict subnets Compute Nodes I/O Nodes Layer 2 switches Cluster B Cluster C
Deep Dive: Scalable RAID • Per-file RAID • Scalable RAID rebuild
Automatic per-file RAID System assigns RAID level based on file size <= 64 KB RAID 1 for efficient space allocation > 64 KB RAID 5 for optimum system performance > 1 GB two-level RAID-5 for scalable performance RAID-1 and RAID-10 for optimized small writes Automatic transition from RAID 1 to 5 without re-striping Programmatic control for application-specific layout optimizations Create with layout hint Inherit layout from parent directory Small File RAID 1 Mirroring Large File RAID 5 Striping Very Large File 2-level RAID 5 Striping Clients are responsible for writing data and its parity
Declustered RAID • Files are striped across component objects on different StorageBlades • Component objects include file data and file parity for reconstruction • File attributes are replicated with two component objects • Declustered, randomized placement distributes RAID workload H G k E Read about half of each surviving OSD Write a little to each OSD Scales linearly 2-shelf BladeSet Mirrored or 9-OSD Parity Stripes FAIL C F E
Shorter repair time in larger storage pools Customers report 30 minute rebuilds for 800GB in 40+ shelf blade set Variability at 12 shelves due to uneven utilization of DirectorBlade modules Larger numbers of smaller files was better Reduced rebuild at 8 and 10 shelves because of wider parity stripe Rebuild bandwidth is the rate at which data is regenerated (writes) Overall system throughput is N times higher because of the necessary reads Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel Declustering spreads disk I/O over more disk arms (StorageBlades) Scalable RAID Rebuild Rebuild MB/sec
Having more drives increases risk, just like having more light bulbs increases the odds one will be burnt out at any given time Larger storage pools must mitigate their risk by decreasing repair times The math says if (e.g.) 100 drives are in 10 RAID sets of 10 drives each and each RAID set has a rebuild time of N hours The risk is the same if you have a single RAID set of 100 drives and the rebuild time is N/10 Block-based RAID scales the wrong direction for this to work Bigger RAID sets repair more slowly because more data must be read Only declustering provides scalable rebuild rates Scalable rebuild is mandatory Total number of drives Drives per RAID set Repair time
Deep Dive: pNFS • Standards-based parallel file systems: NFSv4.1
pNFS: Standard Storage Clusters pNFS is an extension to the Network File System v4 protocol standard Allows for parallel and direct access From Parallel Network File System clients To Storage Devices over multiple storage protocols Moves the Network File System server out of the data path data metadata Block (FC) / Object (OSD) / File (NFS) Storage pNFS Clients control NFSv4.1 Server
pNFS Layouts • Client gets a layout from the NFS Server • The layout maps the file onto storage devices and addresses • The client uses the layout to perform direct I/O to storage • At any time the server can recall the layout • Client commits changes and returns the layout when it’s done • pNFS is optional, the client can always use regular NFSv4 I/O layout Storage Clients NFSv4.1 Server
pNFS Client • Common client for different storage back ends • Wider availability across operating systems • Fewer support issues for storage vendors Client Apps pNFS Client 1. SBC (blocks)2. OSD (objects)3. NFS (files) 4. PVFS2 (files)5. Future backend… Layout Driver NFSv4.1 pNFS Server Layout metadatagrant & revoke Cluster Filesystem