1 / 21

High Performance GridFTP Transport of Earth System Grid (ESG) Data

High Performance GridFTP Transport of Earth System Grid (ESG) Data. Center for Enabling Distributed Petascale Science. Description.

trey
Download Presentation

High Performance GridFTP Transport of Earth System Grid (ESG) Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance GridFTP Transport of Earth System Grid (ESG) Data Center for Enabling Distributed Petascale Science

  2. Description • Transfer 10TBs of climate data into the SC09 show floor from three sites – the Argonne Leadership Computing Facility (ACLF), the National Energy Research Scientific Computing center (NERSC) and LLNL. • As the data arrives at its destination in the University of Utah’s SC09 booth, it will be stored on disks provided by the Data Direct Networks. • Data will be processed using climate data analysis and visualization tool and then publicly displayed along with graphs depicting the characteristics of the transfer.

  3. End-to-End Flow

  4. Scientific Purpose • Climate data is moved in this challenge • Climate is a discipline that is highly collaborative, and its datasets are distributed across the globe. • An interesting feature of climate data is that the actual file size is not very large compared to that of other sciences. • Climate researchers, however, need to move hundreds or thousands of files in a single transfer. • Volume of data to be moved across the network is massive. • Multiple TB of data from Climate Research Program Coupled Model Intercomparison Project, Phase 3 (CMIP3) is moved • This data was used in the Intergovernmental Panel on Climate Change (IPCC) Fourth Assessment Report (AR4) • This data is used in anticipation of the approaching IPCC Fifth Assessment Report (AR5)

  5. How Computing and Network map into Climate Modeling Efforts Each Climate Modeling task maps onto these strategic objectives from:

  6. Network Challenges in ESG • Independent gateways federating metadata and users • Individual data nodes responsible for publishing services • Designed for model output data sets

  7. Technical Approach and Methods • Transfers initiated by the climate community can be between a client and a server or between two remote servers initiated by the user from a third machine. • GridFTP and other data movement tools developed by Center for Enabling Distributed Petascale Science (CEDPS) are ideal for these types of transfers • GridFTP is optimized for high-bandwidth, wide area networks. • Globus implementation of GridFTP provides a software suite optimized for a broad range of data access applications • Including bulk file transfer and data extraction from complex storage systems.

  8. GridFTP Advantages • Performance - Orders of magnitude performance improvements over standard FTP • Uses parallel TCP streams and non-TCP protocols such as UDT • coordinated transfer using multiple computers at source and destination. • Secure - GridFTP supports the PKI/X.509 based Grid Security Infrastructure (GSI) – simple options to encrypt/integrity check data • GridFTP also supports SSH security • Robust - Restart markers allow interrupted transfers to restart with minimal delay overhead. • Extensible – Clear abstractions to interface with various transport protocols and with different storage systems • Completely shields user from the complexities of underlying storage systems including tape archves such as HPSS

  9. Key GridFTP Features used in the Challenge • Concurrency and Pipelining • Allows the client to simultaneously maintain multiple outstanding, unacknowledged transfer commands • Greatly improves performance lots of small files transfers Pipelining Traditional File Request 1 File Request 1 File Request 2 DATA 1 File Request 3 DATA 1 ACK 1 ACK 1 File Request 2 DATA 2 ACK 2 DATA 2 DATA 3 ACK 2 ACK 3 File Request 3 DATA 3 ACK 3

  10. GridFTP Clients and Netlogger • Three different GridFTP clients are used to move the 10 TB data set for the challenge • Globus.org – hosted data movement service • BDM – Bulk Data Mover • Globus-url-copy • Netlogger – used to monitor transfers and troubleshoot problems • Distributed performance analysis and troubleshooting • Standard log format and best practices • Log collection tools • Log parser • Data analysis tools

  11. What is the Globus.org Data Movement Service (a.k.a. DataKoa)? • A new Globus data movement service • The same vision, but an updated implementation • Hosted • Domain-independent, multi-use • Enables scientists to focus on domain-specific work • Manages technology failures • Sends notifications of interesting events • Enables non-experts to easily and efficiently move data • No operations overhead • Minimal user-side software installation • User interfaces require no special expertise • Built-in data transport configuration expertise

  12. Globus.org Data Movement Service Laptop The client connects to Globus.org and submits requests. It can then disappear from the network Globus.org orchestrates the transfer between GridFTP servers. Globus.org GridFTP Server A GridFTP Server B

  13. What is BDM? • BDM: Bulk Data Mover • Scalable data movement management tool • Calls GridFTP file transfers • Designed for climate community (Earth System Grid) needs • Efficient and reliable transfer management from user’s point of view • Simple to install and maintain as a novice user • Scalable to large in volume • Scalable to large in number of files • Efficient handling on extreme variance in file sizes • Scalable to future performance expectations • Network performance improvements – 100Gbps and beyond • Storage performance improvements – distributed, parallel, SSD, etc. • Multiple transfer protocol support • Able to work with other applications with similar needs • Information • http://sdm.lbl.gov/bdm • Contact: Dean Williams williams13@llnl.gov

  14. Argonne National Laboratory Globus-url-copy • Commonly used command line scriptable GridFTP client • Supports various transfer optimizations including parallel TCP streams, concurrent file transfers • New features • Fault tolerant • Store state in a file • Restarting globus-url-copy transfers only the remaining data • Associate multiple physical endpoints with single logical endpoint • Load balance across all the physical endpoints

  15. NetLogger BWC Deployment Plots on the web ALCF LLNL NERSC GridFTP servers GridFTP servers GridFTP servers LBNL Data Logs NetLogger DB SC09 Show Floor

  16. Data Direct Networks Silicon Storage Architecture (S2A)

  17. Argonne National Laboratory ESnet Science Data Network • Good network is as important having the right tools and applications. • needed a good network that would move these datasets at high speeds to the convention center • ESnet was the perfect fit to pull data from national labs • Science Data Networks (SDN) and On-Demand Secure Circuit and Advance Reservation System (OSCARS) • guarantees that we will have a dedicated circuit on the network for the duration of the challenge • don’t have to compete with anyone else for bandwidth

  18. Argonne National Laboratory Data Analysis and Visualization • The data were analyzed using the Climate Data Analysis Tools (CDAT) developed by Program for Climate Model Diagnosis and Intercomparison (PCMDI) • CDAT is a suite of interrelated diagnostic software tools • Flexible, portable, adaptable, efficient, easy-to-use, shareable and free • Capable of operating in a distributed environment • 3D Interface provided by the ViSUS plugin developed at the SCI Institute at University of Utah and LLNL • Streaming and progressive data flow • Integrated analysis and illustration tools

  19. Data Analysis and Visualization Full Video is available at http://www.sci.utah.edu/~pascucci/tmp/climate_video/

  20. Argonne National Laboratory Overarching Research Agenda • Climate community is expecting to generate petabytes of simulated data for analysis and future climate predictions. • In the next few years, climate researchers will be moving terabytes of data to collaborators across the globe for IPCC Fifth Assessment Report (AR5), which will be published in 2013. • Moving large amounts of data seamlessly, reliably and quickly is required to make sense of the enormous AR5 climate data set • Help scientists understand climatic imbalances and the potential impacts of future climate change scenarios.

  21. Argonne National Laboratory Overarching Research Agenda • This demonstration highlights the tools and services that will help them transport their data quickly and reliably • Hope that the lessons learned in this experiment will help us to do this better • Improve the transport and monitoring tools further and help not only the climate researchers but also other researchers in getting their science done faster than before

More Related