gfarm grid file system n.
Download
Skip this Video
Download Presentation
Gfarm Grid File System

Loading in 2 Seconds...

play fullscreen
1 / 32

Gfarm Grid File System - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

Phoenix Workshop on Global Computing Systems PARIS Seminar Dec 8, 2006, IRISA, Rennes. Gfarm Grid File System. Osamu Tatebe University of Tsukuba. Petascale Data Intensive Computing. High Energy Physics CERN LHC, KEK-B Belle ~MB/collision, 100 collisions/sec ~PB/year

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gfarm Grid File System' - stu


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gfarm grid file system

Phoenix Workshop on Global Computing Systems

PARIS Seminar

Dec 8, 2006, IRISA, Rennes

Gfarm Grid File System

Osamu Tatebe

University of Tsukuba

petascale data intensive computing
Petascale Data Intensive Computing
  • High Energy Physics
    • CERN LHC, KEK-B Belle
      • ~MB/collision, 100 collisions/sec
      • ~PB/year
      • 2000 physicists, 35 countries

Detector forLHCb experiment

Detector for

ALICE experiment

  • Astronomical Data Analysis
    • data analysis of the whole data
    • TB~PB/year/telescope
    • Subaru telescope
      • 10 GB/night, 3 TB/year
petascale data intensive computing requirements
Petascale Data Intensive ComputingRequirements
  • Storage Capacity
    • Peta/Exabyte scale files, millions of millions of files
  • Computing Power
    • > 1TFLOPS, hopefully > 10TFLOPS
  • I/O Bandwidth
    • > 100GB/s, hopefully > 1TB/s within a system and between systems
  • Global Sharing
    • group-oriented authentication and access control
goal and feature of grid datafarm
Goal and feature of Grid Datafarm
  • Project
    • Began in summer 2000, collaboration with KEK, Titech, and AIST
  • Goal
    • Dependable data sharing among multiple organizations
    • High-speed data access, High-performance data computing
  • Grid Datafarm
    • Gfarm Grid File System – Global dependable virtual file system
      • Federates local disks in PCs
    • Scalable parallel & distributed data computing
      • Associates Computational Grid with Data Grid
gfarm grid file system ccgrid 2002

/gfarm

ggf

jp

file1

file2

aist

gtrc

file2

file1

file3

file4

Gfarm Grid File System [CCGrid 2002]
  • Commodity-based distributed file system that federates storage of each site
  • It can be mounted from all cluster nodes and clients
  • It provides scalable I/O performance wrt the number of parallel processes and users
  • It supports fault tolerance and avoids access concentration by automatic replica selection

Global

namespace

mapping

File replica creation

Gfarm File System

gfarm grid file system 2
Gfarm Grid File System (2)
  • Files can be shared among all nodes and clients
  • Physically, it may be replicated and stored on any file system node
  • Applications can access it regardless of its location
  • File system nodes can be distributed

Client

PC

/gfarm

Gfarm

file system

metadata

File A

File A

Note

PC

File B

File C

File C

File A

File B

File B

US

File C

Japan

scalable i o performance
Scalable I/O Performance
  • Decentralization of disk access putting priority to local disk
    • When a new file is created,
      • Local disk is selected when there is enough space
      • Otherwise, near and the least busy node is selected
    • When a file is accessed,
      • Local disk is selected if it has one of the file replicas
      • Otherwise, near and the least busy node having one of file replicas is selected
  • File affinity scheduling
    • Schedule a process on a node having the specified file
      • Improve the opportunity to access local disk
scalable i o performance in distributed environment
Scalable I/O performance in distributed environment

User’s view

Physical execution view in Gfarm

(file-affinity scheduling)

User A submits that accesses

is executed on a node that has

File A

File A

Job A

Job A

User B submits that accesses

is executed on a node that has

File B

File B

Job B

Job B

network

Cluster, Grid

CPU

CPU

CPU

CPU

Gfarm file system

File system nodes = compute nodes

Shared network file system

Do not separate storage and CPU (SAN not necessary)

Move and execute program instead of moving large-scale data

exploiting local I/O is a key for scalable I/O performance

example gaussian 03 in gfarm
Example - Gaussian 03 in Gfarm
  • Ab initio quantum chemistry Package
    • Install once and run everywhere
    • No modification required to access Gfarm
  • Test415 (IO intensive test input)
    • 1h 54min 33sec (NFS)
    • 1h 0min 51sec (Gfarm)
  • Parallel analysis of all 666 test inputs using 47 nodes
    • Write error! (NFS)
      • Due to heavy IO load
    • 16h 6m 58s (Gfarm)
      • Quite good scalability of IO performance
      • Longest test input takes 15h 42m 10s (test 574)

Compute

node

NFS vs GfarmFS

Compute

node

Compute

node

Compute

node

.

.

.

NFS vs GfarmFS

*Gfarm consists of local

disks of compute nodes

example 2 trans pacific testbed at sc2003
Example 2 – Trans-Pacific testbed at SC2003

SuperSINET

# nodes 236

Storage capacity 70 TBytes

I/O performance 13 GB/sec

Indiana

Univ

Titech

147 nodes

16 TBytes

4 GB/sec

10G

(1G)

NII

SuperSINET

1G

2.4G

(1G)

SC2003

Phoenix

2.4G

Abilene

UnivTsukuba

New

York

10G

(1G)

10 nodes

1 TBytes

300 MB/sec

2.4G

[950 Mbps]

KEK

10G

(1G)

1G

[500 Mbps]

1G

7 nodes

4 TBytes

200 MB/sec

Chicago

622 M

10G

APAN

Tokyo XP

Maffin

APAN/TransPAC

1G

32 nodes

24 TBytes

4 GB/sec

Los Angeles

AIST

2.4 G

5G

10G

[2.34 Gbps]

Tsukuba

WAN

10G

24 nodes

13 TBytes

3 GB/sec

16 nodes

12 TBytes

2 GB/sec

example 2 performance of us japan file replication
Example 2 – Performance of US-Japan file replication

3.79 Gbps!!

  • Efficient use around the peak rate in long fat networks
  • IFG-based precise pacing with GNET-1
        • -> packet loss free network
  • Disk I/O performance improvement
  • Parallel disk access using Gfarm
  • Trans-pacific file replication:stable 3.79 Gbps
  • 11 node pairs
  • 97% of theoretical peak performance(MTU 6000B)
  • 1.5TB data was transferred in an hour
software stack
Software Stack

Gfarm

Application

UNIX

Application

Windows

Appli.

(Word,PPT,

Excel etc)

FTP

Client

SFTP

/ SCP

Web

Browser

/ Client

Gfarm

Command

UNIX Command

NFS

Client

Windows

Client

nfsd

samba

ftpd

sshd

httpd

Syscall-hook、GfarmFS-FUSE (that mounts FS in User-space)

libgfarm - Gfarm I/O Library

Gfarm Grid File System

gfarm tm file system software
GfarmTM File System Software

Available at http://datafarm.apgrid.org/

  • libgfarm – Gfarm client library
    • Gfarm API
  • Metadata server, metadata cache servers
    • Namespace, replica catalog, host information, process information
  • gfsd – I/O server
    • file access

Metadata

cache server

Metadata

cache server

Metadata

server

File information

application

Gfarm client library

Metadata

cache server

file access

CPU

CPU

CPU

CPU

gfsd

gfsd

gfsd

gfsd

. . .

Compute and file system nodes

demonstration
Demonstration
  • File manipulation
    • cd, ls, cp, mv, cat, . . .
    • grep
  • Gfarm command
    • File replica creation, node & process information
  • Remote (parallel) program execution
    • gfrun prog args . . .
    • gfrun -N #procs prog args . . .
    • gfrun -G filename prog args . . .
i o sequence example in gfarm v1
I/O sequence example in Gfarm v1

open

Metadata server

Application

FSN1, FSN2

close

Gfarm library

Update metadata

Node selection

read, write, seek, . . .

File system node

FSN1

opening files in read write mode 1
Opening files in read-write mode (1)
  • Semantics in Gfarm version 1
    • Updated content is available only when opening the file after a writing process opens the file
    • No file locking (will be introduced by version 2)
      • It supports exclusive file creation (O_EXCL)
opening files in read write mode 2
Opening files in read-write mode (2)

Process 1

Process 2

Metadata server

fopen(“/gfarm/jp/file2”, “rw”)

fopen(“/gfarm/jp/file2”, “r”)

file2

file2

/gfarm

FSN1, 2

FSN1, 2

File access

File access

ggf

jp

fclose()

file1

file2

Delete invalid

file copy

in metadata,

but file access

is continued

FSN1

FSN2

access from legacy applications
Access from legacy applications
  • Mounting Gfarm file system
    • GfarmFS-FUSE for Linux and FreeBSD
    • Need volunteers for other operationg systems
  • libgfs_hook.so – system call hooking library
    • It hooks open(2), read(2), write(2), …
    • When it accesses under /gfarm, call appropriate Gfarm I/O API
    • Otherwise, call ordinal system call
    • Re-link not necessary by specifying LD_PRELOAD
    • Higher portability than developing kernel module
more feature of gfarm grid file system
More Feature of Gfarm Grid File System
  • Commodity PC based scalable architecture
    • Add commodity PCs to increase storage capacity in operation much more than petabyte scale
      • Even PCs at distant locations via internet
  • Adaptive replica creation and consistent management
    • Create multiple file replicas to increase performance and reliability
    • Create file replicas at distant locations for disaster recovery
  • Open Source Software
    • Linux binary packages, ports for *BSD, . . .
      • It is included in Naregi, Knoppix HTC edition, and Rocks distribution
    • Existing applications can access w/o any modification

SC03 Bandwidth

SC06 HPC Storage Challenge Winner

SC05 StorCloud

high performance data analysis for particle physics using the gfarm file system

SC06 HPC Storage Challenge

High Performance Data Analysisfor Particle Physicsusing the Gfarm file system

Osamu Tatebe, Mitsuhisa Sato, Taisuke Boku, Akira Ukawa

(Univ. of Tsukuba)

Nobuhiko Katayama, Shohei Nishida, Ichiro Adachi

(KEK)

belle experiment

3km

Belle Experiment
  • Elementary Particle Physics Experiment, in KEK, Japan
  • 13 countries, 400 physicists
  • KEKB Accelerator (3km circumference)
    • highest luminosity (performance) machine in the world
    • Collide 8GeV electron and 3.5GeV positron to produce large amount of B mesons
  • Study of “CP violation” (difference of laws for matters and anti-matters)
accumulated data
Accumulated Data
  • In the present computing system in Belle:
    • Data are stored under ~40 file servers (storage with 1PB disk + 3.5PB tape)
    • 1100 computing servers (CS) for analysis, simulations….
    • Data are transferred from file servers to CS using Belle home grown TCP/socket application
    • It takes 1~3 weeks for one physicist to read all the 30TB data

The experiment started in 1999

More than 1 billion B mesons are produced

Total recorded (raw) data ~ 0.6 PB

“mdst” data (data after process) ~ 30 TB

Physicists analyze “mdst” data to obtain physics results

650fb-1

KEKB/Belle

Integrated Luminosity(fb-1)

PEP-II/BaBar

year

hpc storage challenge
HPC Storage Challenge
  • Scalability of Gfarm File System up to 1000 nodes
    • Scalable capacity federating local disk of 1112 nodes
      • 24 GByte x 1112 nodes = 26 TByte
    • Scalable disk I/O bandwidth up to 1112 nodes
      • 48 MB/sec x 1112 nodes = 52 GB/sec
  • KEKB/Belle data analysis challenge
    • Read 25 TB of reconstructed real data taken by the KEKB B factory within 10 minutes
      • For now, it takes 1 ~ 3 weeks
    • Search for the Direct CP asymmetry in b -> s g decays

Goal: 1/1000 Analysis time w Gfarm

hardware configuration
Hardware configuration
  • 1140 compute nodes
    • DELL PowerEdge 1855 blade server
    • 3.6GHz Dual Xeon
    • 1GB Memory
    • 72GB RAID-1
  • 80 login nodes
  • Gigabit Ethernet
    • 24 EdgeIron 48GS
    • 2 BigIron RX-16
    • Bisection 9.8 GB/s
  • Total: 45662 SPECint2000 Rate

Computing Servers in New B Factory Computer System

1 enclosure = 10 nodes / 7U space

1 rack = 50 nodes

gfarm file system for the challenge
Gfarm file system for the challenge
  • 1112 file system nodes
    • =1112 compute nodes
    • 23.9 GB x 1061 + 29.5 GB x 8 + 32.0 GB x 43 = 26.3 TB
  • 1 metadata server
  • 3 metadata cache server
    • one for 400 clients
  • All hardware is commodity
disk usage of each local disk
Disk Usage of each local disk

24.6 TByte

24.6 TB of reconstructed data is stored

on local disks of 1112 compute nodes

read i o bandwidth
Read I/O Bandwidth

52.0 GB/sec

We succeeded in reading all the reconstructed data

within 10 minutes

47.9 MB/sec/node

1112

Completely scalable bandwidth improvement,

and 100% of peak disk I/O bandwidth are obtained

breakdown of read benchmark
Breakdown of read benchmark

32 GB data is read in roughly 700 sec

24 GB data is read in roughly 500 sec

Small amountof data is stored

i o bandwidth breakdown
I/O bandwidth breakdown

47.9 MB/sec in avarage

skimming the reconstructed data
Skimming the reconstructed data

Real application

24.0 GB/sec

34 MB/sec/node

704

Scalable data rate improvement is observed

breakdown of skimming time
Breakdown of skimming time

It takes from600 to 1100 secneed more investigation

Small amountof data is stored

conclusion
Conclusion
  • Scalability of Commodity-based Gfarm File System up to 1000 nodes is shown
    • Scalable capacity
      • 24 GByte x 1112 nodes = 26 TByte
    • Scalable disk I/O bandwidth
      • 48 MB/sec x 1112 nodes = 52 GB/sec
    • Novel file system approach enables such scalability!!!
  • KEKB/Belle data analysis challenge
    • Read 24.6 TB of reconstructed real data taken at the KEKB B factory within 10 minutes (52.0 GB/sec)
    • 3,024 times speedup for disk I/O
      • 3 weeks becomes 10 minutes
    • Search for the Direct CP asymmetry in b -> s g decays
      • Skim process achieves 24.0 GB/s of data rate using 704 clients
        • No reason to prevent scalability up to 1000 clients
      • It may provide clues about physics beyond the standard model
ad