3km
Download
1 / 1

Shohei Nishida, Nobuhiko Katayama, Ichiro Adachi (KEK) - PowerPoint PPT Presentation


  • 147 Views
  • Uploaded on

3km. High Performance Data Analysis for Particle Physics using the Gfarm file system. 2. Gfarm. 1. Belle Experiment. B Factory experiment in KEK (Japan) 14 countries, 400 physicists KEKB accelerator (3km circumference) World’s highest luminosity Collide 8GeV electron and 3.5GeV positron.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Shohei Nishida, Nobuhiko Katayama, Ichiro Adachi (KEK)' - hilda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

3km

High Performance Data Analysisfor Particle Physics using the Gfarm file system

2. Gfarm

1. Belle Experiment

  • B Factory experiment in KEK (Japan)

  • 14 countries, 400 physicists

  • KEKBaccelerator (3km circumference)

    • World’s highest luminosity

    • Collide 8GeV electron and3.5GeV positron

  • Commodity-based distributed file system that federates local disks of compute nodes

  • It can be shared among all cluster nodes and clients

    • Just mount it as if it were high-performance NFS

  • It provides scalable I/O performance w.r.t. the number of parallel processes and users

  • It supports fault tolerance and avoids access concentration by automatic replica selection

    • Files can be shared among all nodes and clients

    • Physically, it may be replicated and stored on any file system node

    • Applications can access it regardless of its location

    • File system nodes can be distributed

    650fb-1

    Client

    PC

    /gfarm

    • The experiment started in 1999

    • Total accumulated luminosity ~650 fb-1

    • (More than 1 billion B mesons are produced!!)

    • Total recorded (raw) data ~ 0.6 PB

    • “mdst” data (data after process) ~ 30 TB

      • additonal ~100 TB MC data

    • Users read “mdst” data for analysis

    Gfarm

    file system

    metadata

    KEKB/Belle

    File A

    File A

    Integrated Luminosity(fb-1)

    Note

    PC

    File B

    File C

    File C

    File A

    File B

    File B

    PEP-II/BaBar

    US

    File C

    Japan

    year

    Scalable I/O Performance

    • In the present computing system in Belle:

      • Data are stored under ~40 file servers (FS) :storage with 1PB disk + 3.5PB tape

      • 1100 computing servers (CS) for analysis, simulations….

      • Data are transferred from FS to CS using Belle home grown TCP/socket application

      • It takes 1~3 weeks for one physicist to read all the 30TB data

    User’s view

    Physical execution view in Gfarm

    (file-affinity scheduling)

    • Do not separate storage and CPU (SAN not necessary)

    • Move and execute program instead of moving large-scale data

    • exploiting local I/O is a key for scalable I/O performance

    User A submits that accesses

    is executed on a node that has

    File A

    File A

    Job A

    Job A

    User B submits that accesses

    is executed on a node that has

    File B

    File B

    Job B

    Job B

    network

    Cluster, Grid

    Computing Servers in B Factory Computer System

    CPU

    CPU

    CPU

    CPU

    • 1140 compute nodes

      • DELL PowerEdge 1855 blade server

      • 3.6GHz Dual Xeon

      • 1GB Memory, 72GB RAID-1

    • 80 login nodes

    • Gigabit Ethernet

      • 24 EdgeIron 48GS

      • 2 BigIron RX-16

      • Bisection 9.8 GB/s

    • Total: 45662 SPECint2000 Rate

    Gfarm file system

    File system nodes = compute nodes

    Shared network file system

    Metadata

    cache server

    Metadata

    cache server

    Metadata

    server

    • libgfarm–Gfarm client library

      • Gfarm API

    • Metadata server, metadata cache servers

      • Namespace, replica catalog, host information, process information

    • gfsd– I/O server

      • file access

    File information

    application

    1 enclosure = 10 nodes / 7U space

    1 rack = 50 nodes

    Gfarm client library

    Metadata

    cache server

    file access

    CPU

    CPU

    CPU

    CPU

    gfsd

    gfsd

    gfsd

    gfsd

    . . .

    3. Challenge

    Compute and file system nodes

    Use Gfarm file system for Belle analysis

    http://datafarm.apgrid.org/

    • Scalability of Gfarm File System up to 1000 nodes

      • Scalable capacity federating local disk of 1112 nodes

        • 24 GByte x 1112 nodes = 26 TByte

      • Scalable disk I/O bandwidth up to 1112 nodes

        • 48 MB/sec x 1112 nodes = 52 GB/sec

    • Speed up of KEKB/Belle data analysis

      • Read 25 TB of “mdst” (reconstructed )real data taken by the KEKB B factory within 10 minutes

        • For now, it takes 1 ~ 3 weeks

      • Search for the Direct CP asymmetry in b  s g decays

        • It may provide clues about physics beyond the standard model

    • 1112 file system nodes

      • =1112 compute nodes

      • 23.9 GB x 1061 + 29.5 GB x 8 + 32.0 GB x 43 = 26.3 TB

    • 1 metadata server

    • 3 metadata cache server

      • one for 400 clients

    • All hardware is commodity

    • All software is open source

    24.6 TB of reconstructed data is stored

    on local disks of 1112 compute nodes

    Shohei Nishida, Nobuhiko Katayama, Ichiro Adachi (KEK)

    Osamu Tatebe, Mitsuhisa Sato, Taisuke Boku, Akira Ukawa (Univ. of Tsukuba)

    24.6 TByte

    Goal: 1/1000 Analysis time with Gfarm

    4. Measurement

    Read I/O Bench mark

    Running Belle Analysis Software

    Break down of skimming time

    • Read “mdst” data and skim (select) useful events for bsganalysis.

    • Output file is index file (event number lists)

    52.0 GB/sec

    It takes from600 to 1100 secneed more investigation

    Real application

    24.0 GB/sec

    Instability comes from

    shared use of resources

    47.9 MB/sec/node

    1112

    Small amountof data is stored

    34 MB/sec/node

    We succeeded in reading all the reconstructed data

    within 10 minutes

    704

    Completely scalable bandwidth improvement,

    Scalable data rate improvement is observed

    and 100% of peak disk I/O bandwidth are obtained

    Our team is the winner at the Storage Challenge

    in the SC06 conference:

    5. Conclusion

    • Scalability of Commodity-based Gfarm File System up to 1000 nodes is shown

      • Capacity 26TB, Bandwidth 52.0 GB/sec

      • Novel file system approach enables such scalability.

    • Read 24.6 TB of “mdst” data within 10 minutes (52.0 GB/sec)

      • 24.0 GB/sec

      • 3,000 times speedup for disk I/O


    ad