Gfarm Grid File System

Phoenix Workshop on Global Computing Systems PARIS Seminar Dec 8, 2006, IRISA, Rennes Gfarm Grid File System Osamu Tatebe University of Tsukuba

Petascale Data Intensive Computing • High Energy Physics • CERN LHC, KEK-B Belle • ~MB/collision, 100 collisions/sec • ~PB/year • 2000 physicists, 35 countries Detector forLHCb experiment Detector for ALICE experiment • Astronomical Data Analysis • data analysis of the whole data • TB~PB/year/telescope • Subaru telescope • 10 GB/night, 3 TB/year

Petascale Data Intensive ComputingRequirements • Storage Capacity • Peta/Exabyte scale files, millions of millions of files • Computing Power • > 1TFLOPS, hopefully > 10TFLOPS • I/O Bandwidth • > 100GB/s, hopefully > 1TB/s within a system and between systems • Global Sharing • group-oriented authentication and access control

Goal and feature of Grid Datafarm • Project • Began in summer 2000, collaboration with KEK, Titech, and AIST • Goal • Dependable data sharing among multiple organizations • High-speed data access, High-performance data computing • Grid Datafarm • Gfarm Grid File System – Global dependable virtual file system • Federates local disks in PCs • Scalable parallel & distributed data computing • Associates Computational Grid with Data Grid

/gfarm ggf jp file1 file2 aist gtrc file2 file1 file3 file4 Gfarm Grid File System [CCGrid 2002] • Commodity-based distributed file system that federates storage of each site • It can be mounted from all cluster nodes and clients • It provides scalable I/O performance wrt the number of parallel processes and users • It supports fault tolerance and avoids access concentration by automatic replica selection Global namespace mapping File replica creation Gfarm File System

Gfarm Grid File System (2) • Files can be shared among all nodes and clients • Physically, it may be replicated and stored on any file system node • Applications can access it regardless of its location • File system nodes can be distributed Client PC /gfarm Gfarm file system metadata File A File A Note PC File B File C File C File A File B File B … US File C Japan

Scalable I/O Performance • Decentralization of disk access putting priority to local disk • When a new file is created, • Local disk is selected when there is enough space • Otherwise, near and the least busy node is selected • When a file is accessed, • Local disk is selected if it has one of the file replicas • Otherwise, near and the least busy node having one of file replicas is selected • File affinity scheduling • Schedule a process on a node having the specified file • Improve the opportunity to access local disk

Scalable I/O performance in distributed environment User’s view Physical execution view in Gfarm (file-affinity scheduling) User A submits that accesses is executed on a node that has File A File A Job A Job A User B submits that accesses is executed on a node that has File B File B Job B Job B network Cluster, Grid CPU CPU CPU CPU Gfarm file system File system nodes = compute nodes Shared network file system Do not separate storage and CPU (SAN not necessary) Move and execute program instead of moving large-scale data exploiting local I/O is a key for scalable I/O performance

Example - Gaussian 03 in Gfarm • Ab initio quantum chemistry Package • Install once and run everywhere • No modification required to access Gfarm • Test415 (IO intensive test input) • 1h 54min 33sec (NFS) • 1h 0min 51sec (Gfarm) • Parallel analysis of all 666 test inputs using 47 nodes • Write error! (NFS) • Due to heavy IO load • 16h 6m 58s (Gfarm) • Quite good scalability of IO performance • Longest test input takes 15h 42m 10s (test 574) Compute node NFS vs GfarmFS Compute node Compute node Compute node . . . NFS vs GfarmFS *Gfarm consists of local disks of compute nodes

Example 2 – Trans-Pacific testbed at SC2003 SuperSINET # nodes 236 Storage capacity 70 TBytes I/O performance 13 GB/sec Indiana Univ Titech 147 nodes 16 TBytes 4 GB/sec 10G (1G) NII SuperSINET 1G 2.4G (1G) SC2003 Phoenix 2.4G Abilene UnivTsukuba New York 10G (1G) 10 nodes 1 TBytes 300 MB/sec 2.4G [950 Mbps] KEK 10G (1G) 1G [500 Mbps] 1G 7 nodes 4 TBytes 200 MB/sec Chicago 622 M 10G APAN Tokyo XP Maffin APAN/TransPAC 1G 32 nodes 24 TBytes 4 GB/sec Los Angeles AIST 2.4 G 5G 10G [2.34 Gbps] Tsukuba WAN 10G 24 nodes 13 TBytes 3 GB/sec 16 nodes 12 TBytes 2 GB/sec

Example 2 – Performance of US-Japan file replication 3.79 Gbps!! • Efficient use around the peak rate in long fat networks • IFG-based precise pacing with GNET-1 • -> packet loss free network • Disk I/O performance improvement • Parallel disk access using Gfarm • Trans-pacific file replication:stable 3.79 Gbps • 11 node pairs • 97% of theoretical peak performance（MTU 6000B） • 1.5TB data was transferred in an hour

Software Stack Gfarm Application UNIX Application Windows Appli. (Word,PPT, Excel etc) FTP Client SFTP / SCP Web Browser / Client Gfarm Command UNIX Command NFS Client Windows Client nfsd samba ftpd sshd httpd Syscall-hook、GfarmFS-FUSE (that mounts FS in User-space) libgfarm - Gfarm I/O Library Gfarm Grid File System

GfarmTM File System Software Available at http://datafarm.apgrid.org/ • libgfarm – Gfarm client library • Gfarm API • Metadata server, metadata cache servers • Namespace, replica catalog, host information, process information • gfsd – I/O server • file access Metadata cache server Metadata cache server Metadata server File information application Gfarm client library Metadata cache server file access CPU CPU CPU CPU gfsd gfsd gfsd gfsd . . . Compute and file system nodes

Demonstration • File manipulation • cd, ls, cp, mv, cat, . . . • grep • Gfarm command • File replica creation, node & process information • Remote (parallel) program execution • gfrun prog args . . . • gfrun -N #procs prog args . . . • gfrun -G filename prog args . . .

I/O sequence example in Gfarm v1 open Metadata server Application FSN1, FSN2 close Gfarm library Update metadata Node selection read, write, seek, . . . File system node FSN1

Opening files in read-write mode (1) • Semantics in Gfarm version 1 • Updated content is available only when opening the file after a writing process opens the file • No file locking (will be introduced by version 2) • It supports exclusive file creation (O_EXCL)

Opening files in read-write mode (2) Process 1 Process 2 Metadata server fopen(“/gfarm/jp/file2”, “rw”) fopen(“/gfarm/jp/file2”, “r”) file2 file2 /gfarm FSN1, 2 FSN1, 2 File access File access ggf jp fclose() file1 file2 Delete invalid file copy in metadata, but file access is continued FSN1 FSN2

Access from legacy applications • Mounting Gfarm file system • GfarmFS-FUSE for Linux and FreeBSD • Need volunteers for other operationg systems • libgfs_hook.so – system call hooking library • It hooks open(2), read(2), write(2), … • When it accesses under /gfarm, call appropriate Gfarm I/O API • Otherwise, call ordinal system call • Re-link not necessary by specifying LD_PRELOAD • Higher portability than developing kernel module

More Feature of Gfarm Grid File System • Commodity PC based scalable architecture • Add commodity PCs to increase storage capacity in operation much more than petabyte scale • Even PCs at distant locations via internet • Adaptive replica creation and consistent management • Create multiple file replicas to increase performance and reliability • Create file replicas at distant locations for disaster recovery • Open Source Software • Linux binary packages, ports for *BSD, . . . • It is included in Naregi, Knoppix HTC edition, and Rocks distribution • Existing applications can access w/o any modification SC03 Bandwidth SC06 HPC Storage Challenge Winner SC05 StorCloud

SC06 HPC Storage Challenge High Performance Data Analysisfor Particle Physicsusing the Gfarm file system Osamu Tatebe, Mitsuhisa Sato, Taisuke Boku, Akira Ukawa (Univ. of Tsukuba) Nobuhiko Katayama, Shohei Nishida, Ichiro Adachi (KEK)

3km Belle Experiment • Elementary Particle Physics Experiment, in KEK, Japan • 13 countries, 400 physicists • KEKB Accelerator (3km circumference) • highest luminosity (performance) machine in the world • Collide 8GeV electron and 3.5GeV positron to produce large amount of B mesons • Study of “CP violation” (difference of laws for matters and anti-matters)

Accumulated Data • In the present computing system in Belle: • Data are stored under ~40 file servers (storage with 1PB disk + 3.5PB tape) • 1100 computing servers (CS) for analysis, simulations…. • Data are transferred from file servers to CS using Belle home grown TCP/socket application • It takes 1~3 weeks for one physicist to read all the 30TB data The experiment started in 1999 More than 1 billion B mesons are produced Total recorded (raw) data ~ 0.6 PB “mdst” data (data after process) ~ 30 TB Physicists analyze “mdst” data to obtain physics results 650fb-1 KEKB/Belle Integrated Luminosity(fb-1) PEP-II/BaBar year

HPC Storage Challenge • Scalability of Gfarm File System up to 1000 nodes • Scalable capacity federating local disk of 1112 nodes • 24 GByte x 1112 nodes = 26 TByte • Scalable disk I/O bandwidth up to 1112 nodes • 48 MB/sec x 1112 nodes = 52 GB/sec • KEKB/Belle data analysis challenge • Read 25 TB of reconstructed real data taken by the KEKB B factory within 10 minutes • For now, it takes 1 ~ 3 weeks • Search for the Direct CP asymmetry in b -> s g decays Goal: 1/1000 Analysis time w Gfarm

Hardware configuration • 1140 compute nodes • DELL PowerEdge 1855 blade server • 3.6GHz Dual Xeon • 1GB Memory • 72GB RAID-1 • 80 login nodes • Gigabit Ethernet • 24 EdgeIron 48GS • 2 BigIron RX-16 • Bisection 9.8 GB/s • Total: 45662 SPECint2000 Rate Computing Servers in New B Factory Computer System 1 enclosure = 10 nodes / 7U space 1 rack = 50 nodes

Gfarm file system for the challenge • 1112 file system nodes • =1112 compute nodes • 23.9 GB x 1061 + 29.5 GB x 8 + 32.0 GB x 43 = 26.3 TB • 1 metadata server • 3 metadata cache server • one for 400 clients • All hardware is commodity

Disk Usage of each local disk 24.6 TByte 24.6 TB of reconstructed data is stored on local disks of 1112 compute nodes

Read I/O Bandwidth 52.0 GB/sec We succeeded in reading all the reconstructed data within 10 minutes 47.9 MB/sec/node 1112 Completely scalable bandwidth improvement, and 100% of peak disk I/O bandwidth are obtained

Breakdown of read benchmark 32 GB data is read in roughly 700 sec 24 GB data is read in roughly 500 sec Small amountof data is stored

I/O bandwidth breakdown 47.9 MB/sec in avarage

Skimming the reconstructed data Real application 24.0 GB/sec 34 MB/sec/node 704 Scalable data rate improvement is observed

Breakdown of skimming time It takes from600 to 1100 secneed more investigation Small amountof data is stored

Conclusion • Scalability of Commodity-based Gfarm File System up to 1000 nodes is shown • Scalable capacity • 24 GByte x 1112 nodes = 26 TByte • Scalable disk I/O bandwidth • 48 MB/sec x 1112 nodes = 52 GB/sec • Novel file system approach enables such scalability!!! • KEKB/Belle data analysis challenge • Read 24.6 TB of reconstructed real data taken at the KEKB B factory within 10 minutes (52.0 GB/sec) • 3,024 times speedup for disk I/O • 3 weeks becomes 10 minutes • Search for the Direct CP asymmetry in b -> s g decays • Skim process achieves 24.0 GB/s of data rate using 704 clients • No reason to prevent scalability up to 1000 clients • It may provide clues about physics beyond the standard model

Gfarm Grid File System

Gfarm Grid File System

Presentation Transcript

FILE SYSTEM

Recent Development of Gfarm File System

Grid File System Group Proposal BOF

Grid Datafarm and File System Services

Gfarm Grid File System for Distributed and Parallel Data Computing

Belle/Gfarm Grid Experiment at SC04

File System

Grid File Access Library

File-System

Grid File System RG BoF

File System

File System

File System

File System

Gfarm Grid File System for Distributed and Parallel Data Computing

Belle/Gfarm Grid Experiment at SC04

File System

File System