Optimizing Performance of HPC Storage Systems

Optimizing Performance of HPC Storage Systems Torben Kling Petersen, PhD Principal Architect High Performance Computing

Current Top10 …..

Performance testing Lies Bigger lies Benchmarks

Storage benchmarks • IOR • IOzone • Bonnie++ • spgdd-survey • obdfilter-survey • FIO • dd/xdd • Filebench • dbench • Iometer • MDstat • metarates …….

Lustre® Architecture – High Level Object Storage Servers (OSS) 1-1,000s Object Storage Target (OST) CIFS Client Gateway OSS disk NFS Client Client Router OSS disk Support multiple network types IB, X-GigE OSS disk Client … OSS OSS OSS Metadata Servers (MDS) Client MDS MDS OSS Disk arrays & SAN Fabric Lustre Client 1-100,000 Metadata Target (MDT) disk

Dissecting benchmarking

The chain and the weakest link … Non-blocking fabric ? TCP-IP Overhead ?? Routing ? SAS port over subscription Cabling … RAID controller (SW/HW) Client OSS Memory Memory disk interconnect bus network bus bus CPU Memory PCI-E bus FS client MPI stack SAS Controller/Expander PCI-E bus CPU Memory OS Interconnect Disk drive perf RAID sets SAS or S-ATA File system ?? Only a balanced system will deliver performance …..

Server Side Benchmark • Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance • This is a good benchmark to isolate network and client from the server. • Example of obdfilter-survey parameters [root@oss1 ~]# nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey • Parameters Defined • size=65536 // file size (2x Controller Memory is good practice) • nobjhi=1 nobjlo=1 // number of files • thrhi=256 thrlo=256 // number of worker threads when testing OSS • If you see results significantly lower than what is expected, rerun the test multiple times to ensure those results are not consistent. • This benchmark can also target individual OSTs if we see an OSS node performing lower than expected, it can be because of a single OST performing lower due to drive issue, RAID array rebuilds, etc. [root@oss1 ~]# targets=“fsname-OST0000 fsname-OST0002” nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey

Client Side Benchmark • IOR uses MPI-IO to execute the benchmark tool across all nodes and mimics typical HPC applications running on Clients • Within IOR, one can configure the benchmark for File-Per-Process, and Single-Shared-File • File-Per-Process: Creates a unique file per task and most common way to measure peak throughput of a Lustre parallel Filesystem • Single-Shared-File: Creates a Single File across all tasks running on all clients • Two primary modes for IOR • Buffered: This is default and takes advantage of Linux page caches on the Client • DirectIO: Bypasses Linux page caching and writes directly to the filesystem

Typical Client Configuration • At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory. • With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking • Example for 200 Clients • Number of Cores per Client: 16 (# nproc) • Amount of Memory per Client 32GB (cat /proc/meminfo)

IOR Rule of Thumb • Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect • In our example: • (200 Clients*32 GB)*2 = 12,800 GB • Total file size for the IOR benchmark will be 12.8 TB • NOTE: Typically all nodes are uniform.

Lustre Configuration

Lustre Server Caching Description • Lustre read_cache_enable • controls whether data read from disk during a read request is kept in memory and available for later read requests for the same data, without having to re-read it from disk. By default, read cache is enabled (read_cache_enable = 1). • Lustre writethrough_cache_enable • controls whether data sent to the OSS as a write request is kept in the read cache and available for later reads, or if it is discarded from cache when the write is completed. By default, writethrough cache is enabled (writethrough_cache_enable = 1) • Lustre readcache_max_filesize • controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory. Files larger than readcache_max_filesize will not be kept in cache for either reads or writes. Default is all file sizes are cached.

Client Lustre Parameters • Network Checksums • Default is turned on and impacts performance. Disabling this is first thing we do for performance • LRU Size • Typically we disable this parameter • Parameter used to control the number of client-side locks in an LRU queue • Max RPCs in Flight • Default is 8, increase to 32 • RPC is remote procedure call • This tunable is the maximum number of concurrent RPCs in flight from clients. • Max Dirty MB • Default is 32, good rule of thumb is 4x the value of max_rpcs_in_flight. • Defines the amount of MBs of dirty data can be written and queued up on the client

Lustre Striping • Default Lustre Stripe size is 1M and stripe count is 1 • Each file is written to 1 OST with a stripe size of 1M • When multiple files are created and written, MDS will do best effort to distribute the load across all available OSTs • The default stripe size and count can be changed. Smallest Stripe size is 64K and can be increased by 64K and stripe count can be increased to include all OSTs • Changing stripe count to all OSTs indicates each file will be created using all OSTs. This is best when creating a single shared file from multiple Lustre Clients • One can create multiple directories with various stripe sizes and counts to optimize for performance

Experimental setup&Results

Equipment used • ClusterStor with 2x CS6000 SSUs • 2TB NL-SAS Hitachi Drives • 4U CMU • Neo 1.2.1, HF applied • Clients • 64 Clients, 12 Cores, 24GB Memory, QDR • Mellanox FDR core switch • Lustre Client: 1.8.7 • Lustre • Server version: 2.1.3 • 4 OSSes • 16 OSTs(RAID 6)

Subset of test parameters • Disk backend testing – obdfilter-survey • Client based testing – IOR • I/O mode • I/O Slots per client • IOR transfer size • Number of Client threads • Lustre tunings • writethrough cache enabled • read cache enabled • read cache max filesize = 1M Client Settings • LRU Disabled • Checksums Disabled • MAX RPCs in Flight = 32

Lustre obdfilter-survey # pdsh -g oss "TERM=linuxthrlo=256 thrhi=256 nobjlo=1 nobjhi=1 rsz=1024K size=32768 obdfilter-survey" cstor01n04: ost 4 sz 134217728K rsz 1024K obj4 thr 1024 write 3032.89 [ 713.86, 926.89] rewrite 3064.15 [ 722.83, 848.93] read 3944.49 [ 912.83,1112.82] cstor01n05: ost 4 sz 134217728K rsz 1024K obj4 thr 1024 write 3022.43 [ 697.83, 819.86] rewrite 3019.55 [ 705.15, 827.87] read 3959.50 [ 945.20,1125.76] This means that a single SSU have a • write performance of 6,055 MB/s (75,9 MB/s per disk) • read performance of 7,904 MB/s (98.8 MB/s per disk )

Buffered I/O

Direct I/O

Summary

Reflections on the results • Never trust marketing numbers … • Testing all stage of the data pipeline is essential • Optimal parameters and/or methodology for read and write are seldom the same • Real life applications can often be configured accordingly • Balanced architectures will deliver performance • Client based IOR performs within 5% of backend • In excess of 750 MB/s per OST … -> 36 GB/s per rack … • A well designed solution will scale linearly using Lustre • cf. NCSA BlueWaters

Optimizing Performance of HPC Storage Systems Torben_Kling_Petersen@xyratex.com Thank You

Optimizing Performance of HPC Storage Systems

Optimizing Performance of HPC Storage Systems

Presentation Transcript

Optimizing Network Performance

Optimizing System Performance

Engenio 7900 HPC Storage System

HPC-VMs: Virtual Machines in High Performance Computing Systems

Optimizing Performance

Optimizing Performance

Highest performance parallel storage for HPC environments

Optimizing Herbicide Performance

Performance Analysis of HPC with Lmbench

Optimizing TCP Forwarder Performance

Performance Evaluation of Commodity iSCSI-based Storage Systems

Storage Systems in HPC

Optimizing Virtual Machine Storage

Optimizing Performance 2

Interconnection Networks for Server, Storage and HPC Systems

70-432 – Optimizing Performance

High-Performance Reliable Distributed Storage Systems

Optimizing Pipeline Performance Market

Storage Systems Performance

Optimizing System Performance

Human Performance Center (HPC)