1 / 43

The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

Peter J. Braam Tim Reddin braam@clusterfs.com tim.reddin@hp.com http://www.clusterfs.com. The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003. Topics. History of project High level picture Networking Devices and fundamental API’s File I/O

cirila
Download Presentation

The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peter J. Braam Tim Reddin braam@clusterfs.com tim.reddin@hp.com http://www.clusterfs.com The Lustre Storage ArchitectureLinux Clusters for Super ComputingLinköping 2003

  2. Topics • History of project • High level picture • Networking • Devices and fundamental API’s • File I/O • Metadata & recovery • Project status • Cluster File Systems, Inc

  3. Lustre’s History

  4. Project history • 1999 CMU & Seagate • Worked with Seagate for one year • Storage management, clustering • Built prototypes, much design • Much survives today

  5. 2000-2002 File system challenge • First put forward Sep 1999 Santa Fe • New architecture for National Labs • Characteristics: • 100’s GB’s/sec of I/O throughput • trillions of files • 10,000’s of nodes • Petabytes • From start Garth & Peter in the running

  6. 2002 – 2003 fast lane • 3 year ASCI Path Forward contract • with HP and Intel • MCR & ALC, 2x 1000 node Linux Clusters • PNNL HP IA64, 1000 node Linux cluster • Red Storm, Sandia (8000 nodes, Cray) • Lustre Lite 1.0 • Many partnerships (HP, Dell, DDN, …)

  7. 2003 – Production, perfomance • Spring and summer • LLNL MCR from no, to partial, to full time use • PNNL similar • Stability much improved • Performance • Summer 2003: I/O problems tackled • Metadata much faster • Dec/Jan • Lustre 1.0

  8. High level picture

  9. Lustre Systems – Major Components • Clients • Have access to file system • Typical role: compute server • OST • Object storage targets • Handle (stripes of, references to) file data • MDS • Metadata request transaction engine. • Also: LDAP, Kerberos, routers etc.

  10. OST 1 MDS 1 (active) MDS 2 (failover) OST 2 OST 3 OST 4 OST 5 OST 6 OST 7 Linux OST Servers with disk arrays QSW Net SAN Lustre Clients (1,000 Lustre Lite) Up to 10,000’s GigE 3rd party OST Appliances Lustre Object Storage Targets (OST)

  11. LDAP Server Clients Meta-data Server (MDS) Object Storage Targets (OST) configuration information, network connection details, & security management directory operations, meta-data, & concurrency file I/O & file locking recovery, file status, & file creation

  12. Networking

  13. Lustre Networking • Currently runs over: • TCP • Quadrics Elan 3 & 4 • Lustre can route & can use heterogeneous nets • Beta • Myrinet, SCI • Under development • SAN (FC/iSCSI), I/B • Planned: • SCTP, some special NUMA and other nets

  14. Device Library (Elan,Myrinet,TCP,...) Portal NAL’s Portal Library NIO API Lustre Request Processing Lustre Network Stack - Portals 0-copy marshalling libraries, Service framework, Client request dispatch, Connection & address naming, Generic recovery infrastructure Move small & large buffers, Remote DMA handling, Generate events Sandia’s API, CFS improved impl. Network Abstraction Layer for TCP, QSW, etc. Small & hard Includes routing api.

  15. Devices and API’s

  16. Lustre Devices & API’s • Lustre has numerous driver modules • One API - very different implementations • Driver binds to named device • Stacking devices is key • Generalized “object devices” • Drivers currently export several API’s • Infrastructure - a mandatory API • Object Storage • Metadata Handling • Locking • Recovery

  17. Lustre File System (Linux) or Lustre Library (Win, Unix, Micro Kernels) Logical Object Volume (LOV driver) Clustered MD driver MDC MDC OSC1 OSCn Lustre Clients & API’s Data Object & Lock Metadata & Lock … …

  18. Object Storage Api • Objects are (usually) unnamed files • Improves on the block device api • create, destroy, setattr, getattr, read, write • OBD driver does block/extent allocation • Implementation: • Linux drivers, using a file system backend

  19. Networking Recovery Load Balancing MDS Server Lock Server Lock Client Ext3, Reiser, XFS, … FS Bringing it all together Recovery Lustre Client File System Metadata WB cache Request Processing NIO API Portal Library System & Parallel File I/O, File Locking OSC’s MDC Lock Client Portal NAL’s Networking Device (Elan,TCP,…) Directory Metadata & Concurrency OST MDS Networking Recovery Object-Based Disk Server (OBD server Lock Server Recovery, File Status, File Creation Ext3, Reiser, XFS,… FS Fibre Channel Fibre Channel

  20. File I/O

  21. File I/O – Write Operation • Open file on meta-data server • Get information on all objects that are part of file: • Objects id’s • What storage controllers (OST) • What part of the file (offset) • Striping pattern • Create LOV, OSC drivers • Use connection to OST • Object writes to OST • No MDS involvement at all

  22. Lustre Client Meta-data Server File system File open request MDS LOV MDC File meta-data OSC 1 OSC 2 Inode A {(O1,obj1),(O3, obj2)} Write (obj 1) Write (obj 2) OST 1 OST 2 OST 3

  23. I/O bandwidth • 100’s GB/sec => saturate many100’s OSTs • OST’s: • Do ext3 extent allocation, non-caching direct I/O • Lock management spread over cluster • Achieve 90-95% of network throughput • Single client, single thread Elan3: W 269MB/sec • OST’s handle up to 260MB/sec • W/O extent code, on 2 way 2.4GHz Xeon

  24. Metadata

  25. Intent locks & Write Back caching • Clients – MDS: protocol adaptation • Low concurrency - write back caching • Client in memory updates • delayed replay to MDS • High concurrency (mostly merged in 2.6) • Single network request per transaction • No lock revocations to clients • Intent based lock includes complete request

  26. Lustre Client Network Meta-data Server lookup intent mkdir lookup lock module exercise the intent lookup mkdir Lustre_mkdir Mds_mkdir b) Lustre mkdir Client Client Network File Server File Server Network lookup lookup lookup lookup create dir mkdir mkdir mkdir mkdir a) Conventional mkdir a) Conventional mkdir

  27. Lustre 1.0 • Only has high concurrency model • Aggregate throughput (1,000 clients): • Achieve ~5000 file creations (open/close) /sec • Achieve ~7800 stat’s in 10 x1M file directories • Single client: • Around 1500 creations or stat’s /sec • Handling 10M file directories is effortless • Many changes to ext3 (all merged in 2.6)

  28. Metadata Future • Lustre 2.0 – 2004 • Metadata clustering • Common operations will parallelize • 100% WB caching in memory or on disk • Like AFS

  29. Recovery

  30. Recovery approach • Keep it simple! • Based on failover circles: • Use existing failover software • Left working neighbor is failover node for you • At HP we use failover pairs • Simplify storage connectivity • I/O failure triggers • Peer node serves failed OST • Retry from client routed to new OST node

  31. OST Server – redundant pair OST1 OST 2 FC Switch FC Switch C1 C2 C1 C2

  32. Configuration

  33. Lustre 1.0 • Good tools to build configuration • Configuration is recorded on MDS • Or on dedicated management server • Configuration can be changed, • 1.0 requires downtime • Clients auto configure • mount –t lustre –o … mds://fileset/sub/dir /mnt/pt • SNMP support

  34. Futures

  35. Advanced Management • Snapshots • All features you might expect • Global namespace • Combine best of AFS & autofs4 • HSM, hot migration • Driven by customer demand (we plan XDSM) • Online 0-downtime re-configuration • Part of Lustre 2.0

  36. Security • Authentication • POSIX style authorization • NASD style OST authorization • Refinement: use OST ACL’s and cookies • File crypting with group key service • STK secure file system

  37. Project status

  38. Lustre Feature Roadmap

  39. Cluster File Systems, Inc.

  40. Cluster File Systems • Small service company: 20-30 people • Software development & service (95% Lustre) • contract work for Government labs • OSS but defense contracts • Extremely specialized and extreme expertise • we only do file systems and storage • Investments - not needed. Profitable. • Partners: HP, Dell, DDN, Cray

  41. Lustre – conclusions • Great vehicle for advanced storage software • Things are done differently • Protocols & design from Coda & InterMezzo • Stacking & DB recovery theory applied • Leverage existing components • Initial signs promising

  42. HP & Lustre • Two projects • ASCI PathForward – Hendrix • Lustre Storage product • Field trial in Q1 of 04

  43. Questions?

More Related