1 / 45

Andrei Maslennikov CASPUR Consortium May 2004

ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab . Andrei Maslennikov CASPUR Consortium May 2004. Participated : ADIC Software : E.Eastman CASPUR : A.Maslennikov (*) , M.Mililotti, G.Palumbo

javan
Download Presentation

Andrei Maslennikov CASPUR Consortium May 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGINew results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

  2. Participated: • ADIC Software : E.Eastman • CASPUR : A.Maslennikov(*), M.Mililotti, G.Palumbo • CERN : C.Curran, J.Garcia Reyero, M.Gug, A.Horvath, J.Iven, • P.Kelemen, G.Lee, I.Makhlyueva, B.Panzer-Steindel, • R.Többicke, L.Vidak • DataDirect Networks : L.Thiers • ENEA : G.Bracco, S.Pecoraro • IBM : F.Conti, S.De Santis, S.Fini • RZ Garching : H.Reuter • SGI : L.Bagnaschi, P.Barbieri, A.Mattioli • (*) Project Coordinator A.Maslennikov - May 2004 - SLAB update

  3. Sponsors for these test sessions: • ACAL Storage Networking : Loaned a 16-port Brocade switch • ADIC Soiftware : Provided the StorNext file system product, • actively participated in tests • DataDirect Networks : Loaned an S2A 8000 disk system, • actively participated in tests • E4 Computer Engineering : Loaned 10 assembled biprocessor nodes • Emulex Corporation : Loaned 16 fibre channel HBAs • IBM : Loaned a FASTt900 disk system and • SANFS product complete with 2 MDS units, • actively participated in tests • Infortrend-Europe : Sold 4 EonStor disk systems at discount price • INTEL : Donated 10 motherboards and 20 CPUs • SGI : Loaned the CXFS product • Storcase : Loaned an InfoStation disk system A.Maslennikov - May 2004 - SLAB update

  4. Contents • Goals • Components under test • Measurements: • - SATA/FC systems • - SAN File Systems • - AFS Speedup • - Lustre (preliminary) • - LTO2 • Final remarks A.Maslennikov - May 2004 - SLAB update

  5. Goals for these test series • Performance of low-cost SATA/FC disk systems • Performance of SAN File Systems • AFS Speedup options • Lustre • Performance of LTO-2 tape drive A.Maslennikov - May 2004 - SLAB update

  6. Disk systems: • 4x Infortrend EonStor A16F-G1A2 16 bay SATA-to-FC arrays: • Maxtor Maxline Plus II 250 GB SATA disks (7200 rpm) • Dual Fibre Channel outlet at 2 Gbit • Cache: 1 GB • 2x IBM FAStT900 dual controller arrays with SATA expansion units: • 4 x EXP100 expansion units with 14 Maxtor SATA disks of the same type • Dual Fibre Channel outlet at 2 Gbit • Cache: 1 GB • 1x StorCase InfoStation 12 bay array: • same Maxtor SATA disks • Dual Fibre Channel outlet at 2 Gbit • Cache: 256 MB • 1x DataDirect S2A 8000 System: • 2 controllers with 74 FC disks of 146GB • 8 Fibre Channel outlets at 2 Gbit • Cache: 2.56 GB Components A.Maslennikov - May 2004 - SLAB update

  7. Infortrend EonStor A16F-G1A2 • - Two 2Gbps Fibre Host Channels • - RAID levels supported: RAID 0, 1 (0+1), 3, 5, 10, 30, 50, NRAID and JBOD • - Multiple arrays configurable with dedicated or global hot spares • - Automatic background rebuild • - Configurable stripe size and write policy per array • - Up to 1024 LUNs supported • - 3.5", 1" high 1.5Gbps SATA disk drives • - Variable stripe size per logical drive • - Up to 64TB per LD • - Up to 1GB SDRAM

  8. FAStT900 Storage Server • - 2 Gbps SFP • - Expansion units: EXP700 FC / EXP100 sATA • - Four SAN (FW-SW), or eight direct (FC-AL) • - Four (redundant) 2 Gbps drive channels • - Capacity: min 250GB – max 56TB (14 disks x EXP100 sATA) • min 32GB – max 32TB (14 disks x EXP700 FC) • - Dual-active controllers • - Cache: 2GB • - RAID support 0, 1, 3, 5, 10 EXP100 FAStT900

  9. STORCase Fibre-to-SATA • - SATA and Ultra ATA/133 Drive Interface • - 12 hot swappable drives • - Switched or FC-AL host connections • - RAID levels: 0, 1, 0+1, 3, 5, 30, 50 and JBOD • - Dual Fibre 2Gbps host ports • - Support up to 8 arrays and 128 LUNs • - Up to 1GB PC200 DDR cache memory

  10. DataDirect S²A8000 • - Single 2U S2A8000 with Four 2Gb/s Ports or Dual 4U • with Eight 2Gb/s Ports • - Up to 1120 Disk Drives; 8192 LUNs supported • - 5TB to 130TB with FC Disks, 20TB to 250TB with SATA disks • - Sustained Performance well over 1GB/s (1.6 GB/s theoretical) • - Full Fibre-Channel Duplex Performance on every port • - PowerLUN™ 1 GB/s+ individual LUNs without host-based striping • - Up to 20GB of Cache, LUN-in-Cache Solid State Disk functionality • - Real time Any to Any Virtualization • - Very fast rebuild rate

  11. Components • High-end Linux units for both servers and clients • Biprocessor Pentium IV Xeon 2.4+ GHz, 1GB RAM • Qlogic QLA2300 2Gbit or Emulex LP9xxx Fibre Channel HBAs • Network • 2x Dell 5224 GigE switches • SAN • Brocade 3800 switch – 16 ports (test series 1) • Qlogic Sanbox 5200 – 32 ports (test series 2) • Tapes • 2x IBM Ultrium LTO2 (3580-TD2, Rev: 36U3 ) A.Maslennikov - May 2004 - SLAB update

  12. Qlogic SANbox 5200 Stackable Switch • - 8, 12 or 16 auto-detecting 2Gb/1Gb device ports with 4-port incremental upgrade • - Stacking of up to 4 units for 64 available user ports • - Interoperable with all FC SW-2 compliant Fibre Channel switches • - Full-fabric, public-loop or switch-to-switch connectivity on 2Gb or 1Gb front ports • - "No-Wait" routing - guaranteed maximum performance independent of data traffic • - Support traffic between switches, servers and storage at up to 10Gb/s • - Low cost: 5200/16p is at least twice less expensive than Brocade 3800/16p • - May be upgraded in 8p steps

  13. IBM LTO Ultrium 2 Tape Drive Features • - 200 GB Native Capacity (400 GB compressed) • - 35 MB/s native (70 MB/s compressed) • - Read/Write LTO 1 Cartridge • - Native 2Gb FC Interface • - Backward read/write with Ultrium 1 cartridge • - 64 MB buffer (vs 32 MB buffer in Ultrium 1) • - Speed Matching, Channel Calibration • - 512 Tracks vs. 384 Tracks in Ultrium 1 • - 64 MB Buffer vs. 32 MB in Ultrium 1 - Enhanced Capacity (200GB) - Enhanced Performance (35 MB/s) - Backward Compatible - Faster Load/Unload Time, Data Access Time, Rewind Time

  14. SATA / FC Systems A.Maslennikov - May 2004 - SLAB update

  15. SATA / FC Systems – hw details • Typical array features: • - single o dual (active-active) controller • - up to 1GB of Raid Cache • - battery to keep the cache afloat during power cuts • - 8 through 16 drive slots • - cost: 4-6 KUSD per 12/16 bay unit (Infortrend, Storcase) • Case and backplane directly impact on the disks’ lifetime: • - protection against inrush currents • - protection against the rotational vibration • - orientation (H better than V – remark by A.Sansum) • Infortrend EonStor: well engineered (removable controller module, • lower vibration, H orientation) • Storcase: special protection against inrush currents • (“soft-start” drive power circuitry), low vibration A.Maslennikov - May 2004 - SLAB update

  16. High capacity ATA/SATA disk drives: • - 250GB (Maxtor, IBM), 400GB (Hitachi) • - RPM: 7200 • - improved quality: • warranty 3 years, • component design lifetime : 5 years • CASPUR experience with Maxtor drives: • - In 1.5 years lost 5 drives out of ~100, 2 of which due to power cuts • - Factory quality for recent Maxtor Maxline Plus II 250 GB disks: • out of 66 disks purchased, 4 were shortly replaced. Others stand • the stress very well • Learned during this meeting: • - RAL annual failure rate is 21 out of 920 Maxtor Maxline drives SATA / FC Systems – hw details A.Maslennikov - May 2004 - SLAB update

  17. SATA / FC Systems – test setup 4x IFT A16F- G1A2 Qlogic 2x 5200 16 2x2.4+ GHz Nodes Qlogic 2310F HBA Dell 5224 4x IBM FASTt 900 Storcase Infostation • Parameters to select / tune: • - stripe size for RAID-5 • - SCSI queue depth on controller and on Qlogic HBAs • - number of disks per logical drive • In the end, we were working with RAID-5 LUNs composed of 8 HDs each • Stripe size: 128K (and 256K, in some tests) A.Maslennikov - May 2004 - SLAB update

  18. SATA / FC tests – kernel and fs details • Kernel settings: • - Kernels: 2.4.20-30.9smp, 2.4.20-20.9.XFS1.3.1smp • - vm.bdflush: “2 500 0 0 500 1000 20 10 0” • - vm.max(min)-readahead: 256(127) (large streaming writes) • 4(3) (random reads with small blksize) • File Systems: • - EXT3 (128k RAID-5 stripe size): • fs options: “-m O –j –J size=128 –R stride=32 –T largefile4” • mount options: “data=writeback” • - XFS 1.3.1 (128k RAID-5 stripe size): • fs options: “-i size=512 –d agsize=4g,su=128k,sw=7,unwritten=0 –l su=128k” • mount options: “logbsize=262144,logbufs=8” A.Maslennikov - May 2004 - SLAB update

  19. SATA / FC tests – benchmarks used • Large serial writes and reads: • - “lmdd” from “lmbench” suite: http://sourceforge.net/projects/lmbench • typical invocation: • lmdd of=/fs/file bs=1000k count=8000 fsync=1 • Random reads: • - Pileup benchmark (Rainer.Toebbicke@cern.ch) • designed to emulate the disk activity for multiple data analysis jobs • 1) series of 2GB files are being created in the desination directory • 2) these files are then being read in a random way, in many threads A.Maslennikov - May 2004 - SLAB update

  20. EXT3 results – filling 1.7 TB with 8GB files • IFT systems show anomalous behaviour with EXT3 file system: performance • varies along the file system. The effect visibly depends on the RAID-5 stripe size: SATA / FC results 32K 128k 256K ! The problem was reproduced and understood by Infortrend New firmware is due in July A.Maslennikov - May 2004 - SLAB update

  21. SATA / FC results • IBM FAStT and Storcase behave in a more predictable manner with EXT3. • Both these systems may however lose up to 20% in performance along the • file system: A.Maslennikov - May 2004 - SLAB update

  22. XFS results – filling 1.7 TB with 8GB files • The situation changes radically with this file system. The curves are now becoming • almost flat, everything is much faster compared with EXT3: SATA / FC results • IBM STORCASE INFORTREND • Infortrend and Storcase show compatible write speeds of about 135-140 MB/sec, • IBM is much slower on writes (below 100 MB/sec). • Read speeds are visibly higher thanks to the read-ahead function of controller • (IBM and IFT systems had 1 GB of raid cache, Storcase had only 256 MB) A.Maslennikov - May 2004 - SLAB update

  23. SATA / FC results • Pileup tests: • These tests were done only on IFT and Storcase systems. Results to a large • extent depend on the number of threads that access the previously prepared • files (after a certain number of threads performance may drop since the test • machine’s may have problems to handle many threads at a time). • The best result was obtained with the Infortrend array for XFS file system: A.Maslennikov - May 2004 - SLAB update

  24. SATA / FC results • Operation in degraded mode: • We have tried it on a single Infortrend LUN of 5HDs and EXT3. • One of the disks was removed, and rebuild process was started. • The Write speed went down from 105 to 91 MB/sec • The Read speed went down from 105 to 28 MB/sec and even less A.Maslennikov - May 2004 - SLAB update

  25. SATA / FC results - conclusions • 1) The recent low-cost SATA-to-FC disk arrays (Infortrend, Storcase) operate • very well and are able to deliver excellent I/O speeds far exceeding that of • Gigabit Ethernet. • Cost of such systems may be as low as 2.5 USD/rawGB. • Quality of these systems is dominated by the quality of SATA disks. • 2) The choice of local file system is fundamental. XFS easily outperforms EXT3. • In one occasion we have observed an XFS hang under a very heavy load. • “xfs_repair” was run, and the error had never reappeared again. • We are now planning to investigate this in deep. CASPUR AFS and NFS • servers are all XFS-based, and there was only one XFS-related problem • since we have put XFS in production 1.5 years ago. But probably we were • simply lucky. A.Maslennikov - May 2004 - SLAB update

  26. SAN File Systems A.Maslennikov - May 2004 - SLAB update

  27. SAN FS Placement • These advanced distributed file systems allow clients to operate directly • with block devices (block-level file access). Metadata traffic: via GigE. • Required: Storage Area Network. • Current cost of a single fibre channel connection > 1000 USD: • Switch port, min ~ 500 USD including GBIC • Host Based Adapter, min ~ 800 USD • Special discounts for massive purchases are not impossible, • but it is very hard to imagine that the cost of connection will • become less than 600-700 USD in the close future.. •  SAN FS with native fibre channel connection is still not • an option for large farms. SAN FS with iSCSI connection • may be re-evaluated in combination with new iSCSI-SATA • disk arrays. SAN File Systems A.Maslennikov - May 2004 - SLAB update

  28. SAN File Systems • Where SAN File Systems with FC connection may be used: • 1) High Performance Computing – fast parallel I/O, faster sequential I/O • 2) Hybrid SAN / NAS systems: relatively small number of SAN clients • acting as (also redundant) NAS servers • 3) HA Clusters with file locking : Mail (shared pool), Web etc A.Maslennikov - May 2004 - SLAB update

  29. SAN File Systems • So far, we have tried these products: • 0) Sistina GFS (see our 2002 and 2003 reports) • 1) ADIC StorNext File System • 2) IBM SANFS (StorTank) (preliminary, continue looking into it) • 3) SGI CXFS (work in progress) A.Maslennikov - May 2004 - SLAB update

  30. SAN File Systems A.Maslennikov - May 2004 - SLAB update

  31. SAN File Systems 16 2x2.4+ GHz Nodes Qlogic 2310F HBA 4x IFT A16F- G1A2 Qlogic 2x 5200 Dell 5224 4x IBM FASTt 900 IA32 IBM StorTank MDS Origin 200 CXFS MDS • What was measured (StorNext and StorTank): • 1) Aggregate write and read speeds on 1, 7 and 14 clients • 2) Aggregate Pileup speed on 1,7, and 14 clients accessing: • A) different sets of files • B) same set of files • During these tests we used 4 LUNS of 13 HDs each as recommended by IBM • For each SAN FS we have tried both IFT and FAStT disk systems A.Maslennikov - May 2004 - SLAB update

  32. SAN File Systems • Large sequential files: • StorNext and StorTank behave in a similar manner on writes. StorNext does better • on reads. IBM disk systems are performing better than IFT on reads for multiple clients: IBM StorTank ADIC StorNext All numbers in MB/sec A.Maslennikov - May 2004 - SLAB update

  33. Pileup tests: • StorTank is definitevely outperforming StorNext for this type of benchmark. • The results are very interesting as it comes out that peak Pileup speeds with • StorTank on a single client may reach the GigE speed (case of IFT disk): SAN File Systems IBM StorTank ! Unstable for IFT with more than 1 client ADIC StorNext All numbers in MB/sec A.Maslennikov - May 2004 - SLAB update

  34. SAN File Systems • CXFS experience: • MDS: on SGI Origin 200 with 1 GB of RAM (IRIX 6.5.22), 4 IFT arrays • First numbers were not so bad, but with 4 clients or more the system • becomes unstable (when they are used all at a time, one client will hang). • That is what we have observed so far: • We are currently investigating the problem together with SGI. A.Maslennikov - May 2004 - SLAB update

  35. SAN File Systems • StorNext on DataDirect system 2x S2A8000 8 FC outlets 2x Brocade 3800 16 2x2.4+ GHz Nodes Emulex LP9xxx HBAs Dell 5224 • - S2A 8000 came with FC disks, although we asked for SATA • - Quite easy in configuration, extremly flexible • - Multiple levels of redundancy, small declared performance degradation on rebuilds • - We ran only large serial wrirte and read 8GB lmdd tests using all the available power: A.Maslennikov - May 2004 - SLAB update

  36. SAN File Systems – some remarks • - Performance of a SAN File System is quite close to that of disk hardware • it is built upon (case of native FC connection). • - StorNext is easiest in configuration. It does not require a standalone MDS. • Works smoothly with all kinds of disk systems, fc switches etc We were able to • export it via NFS, but with the loss of 50% of available bandwidth. iSCSI=? • - StorTank is probably the most solid implementation of SAN FS, and it has • a lot of useful options. It delivers the best numbers for random reads, and • probably may be considered as a good candidate for relatively small clusters • with native FC connection destinated for express data analysis. May have issues • with 3rd party disks. Supports iSCSI. • - CXFS uses the very performant XFS base and hence should have a good • potential, although the 2 TB file system size on Linux/32bit is a real limitation • (same is true for GFS). Some functions like MDS fencing require particular • hardware. iSCSI=? • - MDS loads: small for StorNext, CXFS and quite high for StorTank. A.Maslennikov - May 2004 - SLAB update

  37. AFS Speedup A.Maslennikov - May 2004 - SLAB update

  38. - AFS performance for large files is quite poor (max 35-40 MB/sec even on a very • performant hardware). To a large extent this is due to the limitations of Rx RPC • protocol, and to the not most optimal implementation of the file server. • - One possible workaround is to replace the Rx protocol with an alternative one in • all cases where it is used for file serving. We were evaluating two such • experimental implementations: • 1) AFS with OSD support (Rainer Toebbicke). Rainer stores AFS data • inside the Object-based Storage Devices (OSDs) which should not • necessarily reside inside the AFS File Servers. The OSD performs • basic space management and access control and is implemented • as Linux daemon in user space on an EXT2 file system. AFS file • server acts only as an MDS. • 2) Reuter’s Fast AFS (Hartmut Reuter). In this approach, AFS partitions • (/vicepXX) are made visible on the clients with fast SAN or NAS mechanism. • As in the case 1), AFS file sever acts as an MDS and directs the clients • to the right files inside the /vicepXX for faster data acess. AFS speedup options A.Maslennikov - May 2004 - SLAB update

  39. Both methods worked! • The AFS/OSD scheme was tested during the Fall 2003 test session, • the tests were done with the DataDirect’s S2A 8000 system. In one particular • test we were able to achieve 425 MB/sec write speed for both native EXT2 • and AFS/OSD configurations. • The Reuter AFS was evaluated during the Spring 2004 session. StorNext • SAN File System was used to distribute a /vicepX partition among several • clients. Like in the previous case, AFS/Reuter performance was practically • equal to the native performance of StorNext for large files. • To learn more on the DataDirect system and the Fall 2003 session, • please visit the following site: http://afs.caspur.it/slab2003b. AFS speedup options A.Maslennikov - May 2004 - SLAB update

  40. Lustre! A.Maslennikov - May 2004 - SLAB update

  41. - Lustre 1.0.4 • - We used 4 Object Storage Targets on 4 Infortrend arrays, no striping • - Very interesting numbers for sequential I/O (8GB files, MB/sec): Lustre – preliminary results • - These numbers may be directly compared with SAN FS results obtained • with the same disk arrays: A.Maslennikov - May 2004 - SLAB update

  42. LTO-2 Tape Drive A.Maslennikov - May 2004 - SLAB update

  43. LTO-2 tape drive • The drive is a “Factor 2” evolution of its predecessor, LTO-1. • According to the specs, it should be able to deliever up to 35 MB/sec • native I/O speed, and 200 GB of native capacity. • We were mainly interested to check the following (see next page): • - write speed as a function of block size • - time to write a tape mark • - positioning times • The overall judgement: quite positive. The drive fits well for backup • applications, and is acceptable for staging systems. Its strong point • Is definitively a relatively low cost (10-11 KUSD) which makes it quite • competitive (cmp with ~30 KUSD for STK 9940B). A.Maslennikov - May 2004 - SLAB update

  44. LTO-2 • Write speed as a function of blocksize: • > 31 MB/sec native for large blocks, very stable • Tape mark writing is rather slow, 1.4-1.5 sec/TM • Positioning: it may take up to 1.5 minutes to fsf • to the needed file (Average= 1minute) A.Maslennikov - May 2004 - SLAB update

  45. Final remarks • Our immediate plans include: • - Further investigation of StorTank, CXFS and yet another • SAN file system (Veritas) including NFS export • - Evaluation of iSCSI-enabled SATA RAID arrays • in combination with SAN file systems • - Further Lustre testing on IFT and IBM hardware • (new version: 1.2, striping, other benchmarks) • Feel free to join us at any moment ! A.Maslennikov - May 2004 - SLAB update

More Related