1 / 27

File and Object Replication in Data Grids

File and Object Replication in Data Grids. Chin-Yi Tsai. Outline. Introduction Background and Related Work Globus Data Grid Tools File Replication Tool : GDMP Object Replication Experimental Results with GridFTP Conclusion. File. object. object. object. object. object. object.

Download Presentation

File and Object Replication in Data Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File and Object Replication in Data Grids Chin-Yi Tsai

  2. Outline • Introduction • Background and Related Work • Globus Data Grid Tools • File Replication Tool : GDMP • Object Replication • Experimental Results with GridFTP • Conclusion

  3. File object object object object object object Introduction • Data replication is a key-issue in Data Grid • File and Object • Distributed analysis of experimental data • High Energy Physics Community (HEP) • CERN • ATLAS • CMS • The CMS experiment is a high energy physics experiment located at CERN, that will start data taking in the year 2006 • Computing, storage, network • There is a natural mapping to a Grid environment • GDMP architecture uses Globus Data Grid tools as middleware

  4. Grid site 1 (source) Grid site 2 (destination) application object copier tool File Replication and Object Replication Grid site 1 (source) Grid site 2 (destination) application Local storage

  5. Major focus of European DataGridProject on High Energy Physics • Object data stores used for next generation experiments • objects are important for data handling • Grid software mainly to deal with file replication issues • single file (about 1 or 2 GB in size; in total a few PB) • contains many objects • most objects are read-only

  6. Related Data Grid Projects • Earth Science Grid (ESG) • management of climate data • Particle Physics Data Grid (PPDG) • HEP applications • Grid Physics Network (GriPhyN) • Realizing the concept of Virtual Data

  7. Globus Data Grid Tools • The Globus Toolkit is an open source software toolkit used for building grids • middleware • Four main components of Globus • The Grid Security Infrastructure (GSI) • The Globus Resource Management • The Globus Information Management architecture • Data Management architecture, or Data Grid • GridFTP, Replica Management

  8. GridFTP uniform client interface HPSS SRB DPSS GridFTP DFS

  9. Site A …. user user user Local security infrastructure GSI GSI Local security infrastructure user user user Site B Features of GridFTP • GSI and Kerberos support • GSS API • Third-party control of data transfer • add GSS API • Parallel data transfer • Multiple TCP stream, single host • Striped data transfer • Multiple TCP stream, multiple host/server • Partial file transfer • Automatic negotiation of TCP buffer/window sizes • Support the reliable and restartable data transfer

  10. The GridFTP Protocol Implementation • The two main libraries • globus_ftp_control_library • globus_ftp_client_library

  11. Replica Catalog • Mapping between logical name for files or collections and one or more copies of the objects on physical storage systems • Three types of entries • logical collections • Location (physical) • logical files

  12. One Application Model Replica Catalog Logical Collection Weather measurement 2003 Logical Collection Weather measurement 2002 filename: Jan 2003 filename: Feb 2003 … filename: Dec 2003 Location cwb.gov.tw Location ntu.edu.tw Location fcu.edu.tw Logical File Parent filename: Jan 2003 filename: Feb 2003 Protocol: GridFTP Hostname: cwb.gov.tw Path: nfs/weather/ filename: Jan 2003 filename: Feb 2003 filename: Oct 2003 filename: Jan 2003 filename: Sep 2003 … Logical File Jan 2003 Logical File Jan 2003

  13. listCollectionNamesFile Site A File1 File2 File3 File4 File1 File2 File3 File4 File5 Site B File2 File3 File5 listANamesFile namesToSearchFile listBNamesFile filename:File1 filename:File2 filename:File3 filename:File4 filename:File4 filename:File5 filename:File2 filename:File3 filename:File5 An Example Replication Scenario File1: 100MB File2: 200MB File3: 300MB File4: 400MB File5: 500MB Location entry corresponding to site A uc : gridftp://Ahost.isi.edu:2222/nfs/path/on/A Location entry corresponding to site B uc : gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B

  14. Implementation This Scenario with the Command Line Tool Registering the collection globus-replica-catalog –host <ldap url> -manager <ldap DN> -password <> -collection –create listCollectionNamesFile Registering the location A globus-replica-catalog –host < ldap url> -manager < ldap DN> -password <> -location locationA? –create gridftp://Ahost.isi.edu:2222/nfs/path/on/AlistANamesFile Registering the location B globus-replica-catalog –host < ldap url> -manager < ldap DN> -password <> -location locationB? –create gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/BlistBNamesFile

  15. Registering logical file File1, File2, File3, File4, File5 globus-replica-catalog –host < ldap url> -manager < ldap DN> -password <> -logicalfile File1 –create 104857600 Searching for the uc(URL constructor) attribute of all location that contain File4 and File5 globus-replica-catalog –host < ldap url> -manager < ldap DN> -password <> -collection –find-locations NamesToSearchFile –attributes uc List the value of the size attribute of the File2 globus-replica-catalog –host < ldap url> -manager < ldap DN> -password <> -logicalfile File2? –list-attributes size

  16. Request Manager Security Layer Replica Catalog Service Data Mover Service Storage Manager Service GDMP Architecture • The GDMP client-server software system is a generic file replication tool

  17. Application API Replica Catalog Service Globus Replica Catalog Replica Catalog Service • Maintain a global file name space of replicas • New file • logical file name • meta-information • physical location • Client sites query the Replica Catalog Service • Implementation • LDAP and Globus library (replica catalog) • High-level API

  18. Data Mover Service • Layered design • high-level API and low-level service • Data transfer • security, performance, robustness • To use GridFTP as GDMP’s underlying file transfer mechanism • Handle network failures and perform additional check for corruption

  19. Site B GDMP Site B disk pool GDMP disk pool Storage Management Service • Use external tools for staging (different for each MSS) • Assume that each site has a local disk pool = data transfer cache • GDMP triggers file staging to the disk pool • If a file is not located on the disk pool but requested by a remote site GDMP, initiates a disk-to-disk file transfer • GDMP has a plug-in for Hierarchical Storage Manager (HRM) APIs, which provide a common interface to be used to access different Mass Storage Systems. • The implementation is based on CORBA

  20. Grid site 1 (source) Grid site 2 (destination) application object copier tool Object Replication Motivation • File replication works well for many kinds of applications • however, too inefficient for physics analysis: • only a few objects of a file are requested • physicists want to have replicas on specific sites with sufficient CPU power • don’t want to have the entire file but only a few objects • file replication: overhead in terms of data to be transferred • use object copier to copy objects to a file and then replicate the “new” file • one object per file is inefficient since object size is between a 100bytes and 1 MB - too many files

  21. Object Replication Architecture Choices • large, world-wide distributed databases are not considered very attractive in HEP • significant parts of GDMP and Globus are used • Object replication cycle: • objects are identified by application • objects not present at the location are identified • “missing” objects are copied into new files and then transferred to the application • Copy and file transfer are pipelined to achieve a better response time • Index files used for locating objects

  22. Object Replication Prototyping Experience • Most of current next-generation experiments do not do analysis yet: • object replication is still a prototype • file replication based on GDMP is in production use • machine where object copier is running needs to be powerful (CPU and IO)

  23. Experimental Results with GridFTP • Main motivation • study the impact of TCP socket buffer size tuning on parallel datatransfers • understand the throughput that can be achieved in realistic settings • Get maximal throughput • it is critical to use optimal TCP send and receive socket buffer size (too small or to large) • Test server • WU-ftpd server 0.4b6 • Test program • extened_get • extended_put

  24. Experimental Results with GridFTP (cont’d)

  25. Experimental Results with GridFTP (cont’d) • Optimal TCP buffer size = RTT * (speed of bottleneck link) • RTT measured with Unix ping tool • bottleneck link speed: pipechar (new tool from LBNL) • Simple method to determine optimal number of parallel streams is not known yet • too many streams may overload the receiving host • usually, 4~8 parallel streams are optimal

  26. Conclusion • GDMP replication service has been enhanced with more advanced data management features • namespace • file catalog management • efficient file transfer (GridFTP) • Object-based replication • experimental analysis

More Related