Gfarm grid file system
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Gfarm Grid File System PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Phoenix Workshop on Global Computing Systems PARIS Seminar Dec 8, 2006, IRISA, Rennes. Gfarm Grid File System. Osamu Tatebe University of Tsukuba. Petascale Data Intensive Computing. High Energy Physics CERN LHC, KEK-B Belle ~MB/collision, 100 collisions/sec ~PB/year

Download Presentation

Gfarm Grid File System

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gfarm grid file system

Phoenix Workshop on Global Computing Systems

PARIS Seminar

Dec 8, 2006, IRISA, Rennes

Gfarm Grid File System

Osamu Tatebe

University of Tsukuba


Petascale data intensive computing

Petascale Data Intensive Computing

  • High Energy Physics

    • CERN LHC, KEK-B Belle

      • ~MB/collision, 100 collisions/sec

      • ~PB/year

      • 2000 physicists, 35 countries

Detector forLHCb experiment

Detector for

ALICE experiment

  • Astronomical Data Analysis

    • data analysis of the whole data

    • TB~PB/year/telescope

    • Subaru telescope

      • 10 GB/night, 3 TB/year


Petascale data intensive computing requirements

Petascale Data Intensive ComputingRequirements

  • Storage Capacity

    • Peta/Exabyte scale files, millions of millions of files

  • Computing Power

    • > 1TFLOPS, hopefully > 10TFLOPS

  • I/O Bandwidth

    • > 100GB/s, hopefully > 1TB/s within a system and between systems

  • Global Sharing

    • group-oriented authentication and access control


Goal and feature of grid datafarm

Goal and feature of Grid Datafarm

  • Project

    • Began in summer 2000, collaboration with KEK, Titech, and AIST

  • Goal

    • Dependable data sharing among multiple organizations

    • High-speed data access, High-performance data computing

  • Grid Datafarm

    • Gfarm Grid File System – Global dependable virtual file system

      • Federates local disks in PCs

    • Scalable parallel & distributed data computing

      • Associates Computational Grid with Data Grid


Gfarm grid file system ccgrid 2002

/gfarm

ggf

jp

file1

file2

aist

gtrc

file2

file1

file3

file4

Gfarm Grid File System [CCGrid 2002]

  • Commodity-based distributed file system that federates storage of each site

  • It can be mounted from all cluster nodes and clients

  • It provides scalable I/O performance wrt the number of parallel processes and users

  • It supports fault tolerance and avoids access concentration by automatic replica selection

Global

namespace

mapping

File replica creation

Gfarm File System


Gfarm grid file system 2

Gfarm Grid File System (2)

  • Files can be shared among all nodes and clients

  • Physically, it may be replicated and stored on any file system node

  • Applications can access it regardless of its location

  • File system nodes can be distributed

Client

PC

/gfarm

Gfarm

file system

metadata

File A

File A

Note

PC

File B

File C

File C

File A

File B

File B

US

File C

Japan


Scalable i o performance

Scalable I/O Performance

  • Decentralization of disk access putting priority to local disk

    • When a new file is created,

      • Local disk is selected when there is enough space

      • Otherwise, near and the least busy node is selected

    • When a file is accessed,

      • Local disk is selected if it has one of the file replicas

      • Otherwise, near and the least busy node having one of file replicas is selected

  • File affinity scheduling

    • Schedule a process on a node having the specified file

      • Improve the opportunity to access local disk


Scalable i o performance in distributed environment

Scalable I/O performance in distributed environment

User’s view

Physical execution view in Gfarm

(file-affinity scheduling)

User A submits that accesses

is executed on a node that has

File A

File A

Job A

Job A

User B submits that accesses

is executed on a node that has

File B

File B

Job B

Job B

network

Cluster, Grid

CPU

CPU

CPU

CPU

Gfarm file system

File system nodes = compute nodes

Shared network file system

Do not separate storage and CPU (SAN not necessary)

Move and execute program instead of moving large-scale data

exploiting local I/O is a key for scalable I/O performance


Example gaussian 03 in gfarm

Example - Gaussian 03 in Gfarm

  • Ab initio quantum chemistry Package

    • Install once and run everywhere

    • No modification required to access Gfarm

  • Test415 (IO intensive test input)

    • 1h 54min 33sec (NFS)

    • 1h 0min 51sec (Gfarm)

  • Parallel analysis of all 666 test inputs using 47 nodes

    • Write error! (NFS)

      • Due to heavy IO load

    • 16h 6m 58s (Gfarm)

      • Quite good scalability of IO performance

      • Longest test input takes 15h 42m 10s (test 574)

Compute

node

NFS vs GfarmFS

Compute

node

Compute

node

Compute

node

.

.

.

NFS vs GfarmFS

*Gfarm consists of local

disks of compute nodes


Example 2 trans pacific testbed at sc2003

Example 2 – Trans-Pacific testbed at SC2003

SuperSINET

# nodes 236

Storage capacity 70 TBytes

I/O performance 13 GB/sec

Indiana

Univ

Titech

147 nodes

16 TBytes

4 GB/sec

10G

(1G)

NII

SuperSINET

1G

2.4G

(1G)

SC2003

Phoenix

2.4G

Abilene

UnivTsukuba

New

York

10G

(1G)

10 nodes

1 TBytes

300 MB/sec

2.4G

[950 Mbps]

KEK

10G

(1G)

1G

[500 Mbps]

1G

7 nodes

4 TBytes

200 MB/sec

Chicago

622 M

10G

APAN

Tokyo XP

Maffin

APAN/TransPAC

1G

32 nodes

24 TBytes

4 GB/sec

Los Angeles

AIST

2.4 G

5G

10G

[2.34 Gbps]

Tsukuba

WAN

10G

24 nodes

13 TBytes

3 GB/sec

16 nodes

12 TBytes

2 GB/sec


Example 2 performance of us japan file replication

Example 2 – Performance of US-Japan file replication

3.79 Gbps!!

  • Efficient use around the peak rate in long fat networks

  • IFG-based precise pacing with GNET-1

    • -> packet loss free network

  • Disk I/O performance improvement

  • Parallel disk access using Gfarm

  • Trans-pacific file replication:stable 3.79 Gbps

  • 11 node pairs

  • 97% of theoretical peak performance(MTU 6000B)

  • 1.5TB data was transferred in an hour


  • Software stack

    Software Stack

    Gfarm

    Application

    UNIX

    Application

    Windows

    Appli.

    (Word,PPT,

    Excel etc)

    FTP

    Client

    SFTP

    / SCP

    Web

    Browser

    / Client

    Gfarm

    Command

    UNIX Command

    NFS

    Client

    Windows

    Client

    nfsd

    samba

    ftpd

    sshd

    httpd

    Syscall-hook、GfarmFS-FUSE (that mounts FS in User-space)

    libgfarm - Gfarm I/O Library

    Gfarm Grid File System


    Gfarm tm file system software

    GfarmTM File System Software

    Available at http://datafarm.apgrid.org/

    • libgfarm – Gfarm client library

      • Gfarm API

    • Metadata server, metadata cache servers

      • Namespace, replica catalog, host information, process information

    • gfsd – I/O server

      • file access

    Metadata

    cache server

    Metadata

    cache server

    Metadata

    server

    File information

    application

    Gfarm client library

    Metadata

    cache server

    file access

    CPU

    CPU

    CPU

    CPU

    gfsd

    gfsd

    gfsd

    gfsd

    . . .

    Compute and file system nodes


    Demonstration

    Demonstration

    • File manipulation

      • cd, ls, cp, mv, cat, . . .

      • grep

    • Gfarm command

      • File replica creation, node & process information

    • Remote (parallel) program execution

      • gfrun prog args . . .

      • gfrun -N #procs prog args . . .

      • gfrun -G filename prog args . . .


    I o sequence example in gfarm v1

    I/O sequence example in Gfarm v1

    open

    Metadata server

    Application

    FSN1, FSN2

    close

    Gfarm library

    Update metadata

    Node selection

    read, write, seek, . . .

    File system node

    FSN1


    Opening files in read write mode 1

    Opening files in read-write mode (1)

    • Semantics in Gfarm version 1

      • Updated content is available only when opening the file after a writing process opens the file

      • No file locking (will be introduced by version 2)

        • It supports exclusive file creation (O_EXCL)


    Opening files in read write mode 2

    Opening files in read-write mode (2)

    Process 1

    Process 2

    Metadata server

    fopen(“/gfarm/jp/file2”, “rw”)

    fopen(“/gfarm/jp/file2”, “r”)

    file2

    file2

    /gfarm

    FSN1, 2

    FSN1, 2

    File access

    File access

    ggf

    jp

    fclose()

    file1

    file2

    Delete invalid

    file copy

    in metadata,

    but file access

    is continued

    FSN1

    FSN2


    Access from legacy applications

    Access from legacy applications

    • Mounting Gfarm file system

      • GfarmFS-FUSE for Linux and FreeBSD

      • Need volunteers for other operationg systems

    • libgfs_hook.so – system call hooking library

      • It hooks open(2), read(2), write(2), …

      • When it accesses under /gfarm, call appropriate Gfarm I/O API

      • Otherwise, call ordinal system call

      • Re-link not necessary by specifying LD_PRELOAD

      • Higher portability than developing kernel module


    More feature of gfarm grid file system

    More Feature of Gfarm Grid File System

    • Commodity PC based scalable architecture

      • Add commodity PCs to increase storage capacity in operation much more than petabyte scale

        • Even PCs at distant locations via internet

    • Adaptive replica creation and consistent management

      • Create multiple file replicas to increase performance and reliability

      • Create file replicas at distant locations for disaster recovery

    • Open Source Software

      • Linux binary packages, ports for *BSD, . . .

        • It is included in Naregi, Knoppix HTC edition, and Rocks distribution

      • Existing applications can access w/o any modification

    SC03 Bandwidth

    SC06 HPC Storage Challenge Winner

    SC05 StorCloud


    High performance data analysis for particle physics using the gfarm file system

    SC06 HPC Storage Challenge

    High Performance Data Analysisfor Particle Physicsusing the Gfarm file system

    Osamu Tatebe, Mitsuhisa Sato, Taisuke Boku, Akira Ukawa

    (Univ. of Tsukuba)

    Nobuhiko Katayama, Shohei Nishida, Ichiro Adachi

    (KEK)


    Belle experiment

    3km

    Belle Experiment

    • Elementary Particle Physics Experiment, in KEK, Japan

    • 13 countries, 400 physicists

    • KEKB Accelerator (3km circumference)

      • highest luminosity (performance) machine in the world

      • Collide 8GeV electron and 3.5GeV positron to produce large amount of B mesons

    • Study of “CP violation” (difference of laws for matters and anti-matters)


    Accumulated data

    Accumulated Data

    • In the present computing system in Belle:

      • Data are stored under ~40 file servers (storage with 1PB disk + 3.5PB tape)

      • 1100 computing servers (CS) for analysis, simulations….

      • Data are transferred from file servers to CS using Belle home grown TCP/socket application

      • It takes 1~3 weeks for one physicist to read all the 30TB data

    The experiment started in 1999

    More than 1 billion B mesons are produced

    Total recorded (raw) data ~ 0.6 PB

    “mdst” data (data after process) ~ 30 TB

    Physicists analyze “mdst” data to obtain physics results

    650fb-1

    KEKB/Belle

    Integrated Luminosity(fb-1)

    PEP-II/BaBar

    year


    Hpc storage challenge

    HPC Storage Challenge

    • Scalability of Gfarm File System up to 1000 nodes

      • Scalable capacity federating local disk of 1112 nodes

        • 24 GByte x 1112 nodes = 26 TByte

      • Scalable disk I/O bandwidth up to 1112 nodes

        • 48 MB/sec x 1112 nodes = 52 GB/sec

    • KEKB/Belle data analysis challenge

      • Read 25 TB of reconstructed real data taken by the KEKB B factory within 10 minutes

        • For now, it takes 1 ~ 3 weeks

      • Search for the Direct CP asymmetry in b -> s g decays

    Goal: 1/1000 Analysis time w Gfarm


    Hardware configuration

    Hardware configuration

    • 1140 compute nodes

      • DELL PowerEdge 1855 blade server

      • 3.6GHz Dual Xeon

      • 1GB Memory

      • 72GB RAID-1

    • 80 login nodes

    • Gigabit Ethernet

      • 24 EdgeIron 48GS

      • 2 BigIron RX-16

      • Bisection 9.8 GB/s

    • Total: 45662 SPECint2000 Rate

    Computing Servers in New B Factory Computer System

    1 enclosure = 10 nodes / 7U space

    1 rack = 50 nodes


    Gfarm file system for the challenge

    Gfarm file system for the challenge

    • 1112 file system nodes

      • =1112 compute nodes

      • 23.9 GB x 1061 + 29.5 GB x 8 + 32.0 GB x 43 = 26.3 TB

    • 1 metadata server

    • 3 metadata cache server

      • one for 400 clients

    • All hardware is commodity


    Disk usage of each local disk

    Disk Usage of each local disk

    24.6 TByte

    24.6 TB of reconstructed data is stored

    on local disks of 1112 compute nodes


    Read i o bandwidth

    Read I/O Bandwidth

    52.0 GB/sec

    We succeeded in reading all the reconstructed data

    within 10 minutes

    47.9 MB/sec/node

    1112

    Completely scalable bandwidth improvement,

    and 100% of peak disk I/O bandwidth are obtained


    Breakdown of read benchmark

    Breakdown of read benchmark

    32 GB data is read in roughly 700 sec

    24 GB data is read in roughly 500 sec

    Small amountof data is stored


    I o bandwidth breakdown

    I/O bandwidth breakdown

    47.9 MB/sec in avarage


    Skimming the reconstructed data

    Skimming the reconstructed data

    Real application

    24.0 GB/sec

    34 MB/sec/node

    704

    Scalable data rate improvement is observed


    Breakdown of skimming time

    Breakdown of skimming time

    It takes from600 to 1100 secneed more investigation

    Small amountof data is stored


    Conclusion

    Conclusion

    • Scalability of Commodity-based Gfarm File System up to 1000 nodes is shown

      • Scalable capacity

        • 24 GByte x 1112 nodes = 26 TByte

      • Scalable disk I/O bandwidth

        • 48 MB/sec x 1112 nodes = 52 GB/sec

      • Novel file system approach enables such scalability!!!

    • KEKB/Belle data analysis challenge

      • Read 24.6 TB of reconstructed real data taken at the KEKB B factory within 10 minutes (52.0 GB/sec)

      • 3,024 times speedup for disk I/O

        • 3 weeks becomes 10 minutes

      • Search for the Direct CP asymmetry in b -> s g decays

        • Skim process achieves 24.0 GB/s of data rate using 704 clients

          • No reason to prevent scalability up to 1000 clients

        • It may provide clues about physics beyond the standard model


  • Login