1 / 24

The Data Grid: Architecture for Distributed Management & Analysis of Large Scientific Datasets

This presentation introduces the concept of a Data Grid, which is a database architecture designed for storing and handling large scientific datasets. It covers the design principles, services, and higher-level components of a Data Grid, including storage systems, metadata access mechanisms, and replica management.

Download Presentation

The Data Grid: Architecture for Distributed Management & Analysis of Large Scientific Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury, S.Tuecke Presented By: Kasturi Chatterjee Agnostic: Selim Kalayci

  2. Agenda • Introduction • Data Grid Design • Data Grid Services • Higher-Level Data Grid Components • Conclusion

  3. Introduction • Grid : Geographically distributed computing resources configured for coordinated use • Data Grid : Database Architecture for storage and handling huge amount of data supported by a Grid

  4. Introduction • Scientific disciplines are data intensive as well as computationally demanding • Terabytes and petabytes of data • Diverse Domains and Geographic Distribution of Users and Resources

  5. Data Grid • Integrate heterogenous data archives into a distributed data management grid* • Identify services for high performance, distributed, data intensive computing* • APIs and Components required to implement it efficiently *from globus project slides available at loci.cs.utk.edu/dsi/netstore99/docs/presentations/foster-d-slides.pdf

  6. Agenda • Introduction • Data Grid Design • Data Grid Services • Higher-Level Data Grid Components • Conclusion

  7. Data Grid Design • Design Principles Mechanism Neutrality independent of low-level mechanisms Policy Neutrality design decisions are exposed to users Compatibility with Computational Grid integration of storage and computation Uniformity of Information Infrastructure uniform access to information about resource structure and state

  8. Layered Architecture (from the paper)

  9. Core Services Storage Systems • DPSS : Distributed Parallel Storage System • HPSS : High Performance Storage System Metadata Repository • LDAP : Lightweight Directory Access Protocol • MCAT : MetaData Catalogue

  10. Agenda • Introduction • Data Grid Design • Data Grid Services • Higher-Level Data Grid Components • Conclusion

  11. Data Grid Services • Data Access Mechanisms for accessing, managing and initiating third-party transfers of data • Metadata Access Mechanisms for accessing and managing information about data

  12. Data Grid Services (fromloci.cs.utk.edu/dsi/netstore99/docs/presentations/foster-d-slides.pdf )

  13. Data Grid Services • Storage Systems and Data Access Storage Systems: provides functions for creating, destroying, writing and manipulating file instances associate a set of properties like name, size and access restrictions with each file instance Eg: A data grid implementation may use SRB to access data

  14. Data Grid Services Data Access APIs are defined which describes the possible operations on storage systems and file instances API provides standard interface to storage systems like create, delete, open, close, read, write and storage to storage transfer Self-Optimizing capability Uniform Access to heterogeneous Systems

  15. Data Grid Services Metadata Service Application Metadata, Replica Metadata and System Configuration Metadata Single interface to access them Pros: Uniformity Cons: Complex Implementation Structured as hierarchical and distributed Pros: Scalable, no single failure point, local control

  16. Data Grid Services • Application Metadata : metadata describing the information content represented by the file, circumstances under which data was obtained and information to applications to process it • Replica Metadata : data used to manage replication of data objects • System Configuration Metadata : describes the system i.e. network connectivity, storage systems, usage policy etc.

  17. Agenda • Introduction • Data Grid Design • Data Grid Services • Higher-Level Data Grid Components • Conclusion

  18. Higher-Level Data Grid Components • Replica Management from I. Foster Slides Collections contain related files Logical files describe replicated physical files Services for managing replicated file instances Create / delete Schedule / manage data transfer Register in the replica catalog Metadata display

  19. Higher-Level Data Grid Components • How Does a Replica Manager Works ? • Maintains a repository/catalogue • Entries correspond to logical files/file collections • Associated with each logical file/collection are one/more physical instance of objects • Catalogue contains mapping from logical file to physical instances

  20. Higher-Level Data Grid Components • Replica Manager doesn’t do the following : • determine when or where replicas are created • which replicas are to be used by an application keeps policy separate from replica manager design making it generic

  21. Higher-Level Data Grid Components • Replica Selection • Process of choosing replica that will optimize a desired performance criterion • Selection process may initiate creation of a new replica • Intelligent scheduling to determine appropriate replica, site for (re)computation, etc.

  22. Agenda • Introduction • Data Grid Design • Data Grid Services • Higher-Level Data Grid Components • Conclusion

  23. Conclusion • Implementation experience led to the adoption of using collection of logical files • Implements computation and data intensive Grid architecture • APIs provide standard interface for various utilities • Replica Management and Metadata services are provided using LDAP

  24. Further Works Chervenak et al 1.Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing :2001 2. High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies :2001 3. A Replica Location Grid Service Implementation : 2004 4. Applying Peer-to-Peer Techniques to Grid Replica Location Services :2006 Leanne Guy et al Replica Management in Data Grids in 2002 : addressed Read/Write Replica techniques

More Related