Data Management Azizol Abdullah FSKTM
What is Data Management? • It depends on ……. • Storage system • Data transport mechanism • Replication management • Metadata management • Publishing and curation of data
What is data management? (cont.) • Storage systems • Disk arrays • Network caches (e.g., DPSS) • Hierarchical storage systems (e.g., HPSS) • Efficient data transport mechanisms • Striped • Parallel • Secure • Reliable • Third-party transfers
What is data management? (cont.) • Replication management • Associate files into collections • Mechanisms for reliably copying collections, propagating updates to collections, selecting among replicas • Metadata management • Associate attributes that describe data • Select data based on attributes • Publishing and curation of data • “Official” versions of important collections • Digital libraries
Data-Intensive Applications: Physics • CERN Large Hadron Collider • Several terabytes of data per year • Starting in 2005 • Continuing 15 to 20 years Replication scenario: • Copy of everything at CERN (Tier 0) • Subsets at national centers (Tier 1) • Smaller regional centers (Tier 2) • Individual researchers will have copies
GriPhyN Overview(www.griphyn.org) • 5-year, $12.5M NSF ITR proposal to realize the concept of virtual data, via: • Key research areas: • Virtual data technologies (information models, management of virtual data software, etc.) • Request planning and scheduling (including policy representation and enforcement) • Task execution (including agent computing, fault management, etc.) • Development of Virtual Data Toolkit (VDT) • Four Applications: ATLAS, CMS, LIGO, SDSS
GriPhyN Participants • Computer Science • U.Chicago, USC/ISI, UW-Madison, UCSD, UCB, Indiana, Northwestern, Florida • Toolkit Development • U.Chicago, USC/ISI, UW-Madison, Caltech • Applications • ATLAS (Indiana), CMS (Caltech), LIGO (UW-Milwaukee, UT-B, Caltech), SDSS (JHU) • Unfunded collaborators • UIC (STAR-TAP), ANL, LBNL, Harvard, U.Penn
The Petascale Virtual Data Grid (PVDG) Model • Data suppliers publish data to the Grid • Users request raw or derived data from Grid, without needing to know • Where data is located • Whether data is stored or computed • User can easily determine • What it will cost to obtain data • Quality of derived data • PVDG serves requests efficiently, subject to global and local policy constraints
PVDGScenario User requests may be satisfied via a combination of data access and computation at local, regional, and central sites
Other Application Scenarios • Climate community • Terabyte-scale climate model datasets: • Collecting measurements • Simulation results • Must support sharing, remote access to and analysis of datasets • Distance visualization • Remote navigation through large datasets, with local and/or remote computing
Data-intensive computing • The term data-intensive computing is used to describe applications that are I/O bound. • Such applications devote the largest fraction of execution time to movement of data. • They can be identified by evaluating “computational bandwidth”—the number of bytes of data processed per floating-point operation. • On vector supercomputers for applications that sustain high performance, usually 7 bytes of data are accessed from memory for every floating point operation
Storage Systems: Disk Arrays • What is a disk array? • Collection of disks • Advantages: • Higher capacity • Many small, inexpensive disks • Higher throughput • Higher bandwidth (Mbytes/sec) on large transfers • Higher I/O rate (transactions/sec) on small transfers
Trends in Magnetic Disks • Capacity increases: 60% per year • Cost falling at similar rate ($/MB or $/GB) • Evolving to smaller physical sizes • 14in 5.25in 3.5in 2.5in 1.0in … ? • Put lots of small disks together • Problem: RELIABILITY • Reliability of N disks = Reliability of 1 disk divided by N
Key Concepts in Disk Arrays Striping for High Performance • Interleave data from single file across multiple disks • Fine-grained interleaving: • every file spread across all disks • any access involves all disks • Course-grained interleaving: • interleave in large blocks • small accesses may be satisfied by a single disk
Key Concepts in Disk Arrays Redundancy • Maintain extra information in disk array • Duplication • Parity • Reed-Solomon error correction codes • Others • When a disk fails: use redundancy information to reconstruct data on failed disk
RAID“Levels”(Redundant Arrays of Inexpensive Disks) • Defined by combinations of striping & redundancy ( 6 level of RAID) • RAID Level 1: Mirroring or Shadowing • Maintain a complete copy of each disk • Very reliable • High cost: twice the number of disks • Great performance: on a read, may go to disk with faster access time
RAID “Levels” (cont.) • RAID Level 2: Memory Style Error Detection and Correction • Not really implemented in practice • Based on DRAM-style Hamming codes • In disk systems, don’t need detection • Use less expensive correction schemes
RAID “Levels” (cont.) • RAID Level 3: Fine-grained Interleaving and Parity • Many commercial RAIDs • Calculate parity bit-wise across disks in the array (using exclusive-OR logic) • Maintain a separate parity disk; update on write operations • When a disk fails, use other data disk and parity disk to reconstruct data on lost disk • Fine-grained interleaving: all disks involved in any access to the array
RAID “Levels” (cont.) • RAID Level 4: Large Block Interleaving and Parity • Similar to level 3, but interleave on larger blocks • Small accesses may be satisfied by a single disk • Supports higher rate of small I/Os • Parity disk may become a bottleneck with multiple concurrent I/Os
RAID “Levels” (cont.) • RAID Level 5: Large Block Interleaving and Distributed Parity • Similar to level 4 • Distributes parity blocks throughout all disks in array
RAID Levels (cont.) • RAID Level 6: Reed-Solomon Error Correction Codes • Protection against two disk failures
RAID Levels (cont.) • Disks getting so cheap: consider massive storage systems composed entirely of disks • No tape!!
DPSS: Distributed Parallel Storage System • Produced by Lawrence Berkeley National Labs • “Cache”: provides storage that is • Faster than typical local disk • Temporary • “Virtual disk”: appears to be single large, random-access, block-oriented I/O device • Isolates application from tertiary storage system: • Acts as large buffer between slow tertiary storage and high-performance network connections • “Impedance matching”
Features of DPSS • Components: • DPSS block servers • Typically low-cost workstations • Each with several disk controllers, several disks per controller • DPSS mater process • Data requests sent from client to master process • Determines which DPSS block server stores the requested blocks • Forwards request to that block server • Note: servers can be anywhere on network (a distributed cache)
Features of DPSS (cont.) • Client API library • Supports variety of I/O semantics • dpssOpen(), dpssRead(), dpssWrite(), dpssLSeek(), dpssClose() • Application controls data layout in cache • For typical applications that read sequentially: stripe blocks of data across servers in round-robin fashion • DPSS client library is multi-threaded • Number of client threads is equal to number of DPSS servers: client speed scales with server speed
Features of DPSS (cont.) • Optimized for relatively small number of large files • Several thousand files • Greater than 50 MB • DPSS blocks are available as soon as they are placed in cache • Good for staging larges files to/from tertiary storage • Don’t have to wait for large transfer to complete • Dynamically reconfigurable • Add or remove servers or disks on the fly
Features of DPSS (cont.) • Agent-based performance monitoring system • Client library automatically sets TCP buffer size to optimal value • Uses information published by monitoring system • Load balancing • Supports replication of files on multiple servers • DPSS master uses status information stored in LDAP directory to select a replica that will give fastest response
Hierarchical Storage System • Fast, disk cache in front of larger, slower storage • Works on same principle as other hierarchies: • Level-1 and Level-2 caches: minimize off-chip memory accesses • Virtual memory systems:minimize page faults to disk • Goal: • Keep popular material in faster storage • Keep most of material on cheaper, slower storage • Locality: 10% of material gets 90% of accesses
Hierarchical Storage System (cont.) • Problem with tertiary storage (especially tape): • Very slow • Tape seek times can be a minute or more…
Data Management GridFTP
Motivation…. • The GridFTP protocol • born out of a realization that the Grid environment needed a fast, secure, efficient, and reliable transport mechanism. • Existing distributed data storage systems • DPSS, HPSS: focus on high-performance access, utilize parallel data transfer, striping • DFS: focus on high-volume usage, dataset replication, local caching • SRB: connects heterogeneous data collections, uniform client interface, metadata queries
Motivation…. (cont.) • Problems • Incompatible (and proprietary) protocols • Each require custom client • Partitions available data sets and storage devices • Each protocol has subset of desired functionality
A Common, Secure,Efficient Data Access Protocol • Common, extensible transfer protocol • Common protocol means all can interoperate • Decouple low-level data transfer mechanisms from the storage service • Advantages: • New, specialized storage systems are automatically compatible with existing systems • Existing systems have richer data transfer functionality • Interface to many storage systems • HPSS, DPSS, file systems • Plan for SRB integration
Access/Transport Protocol Requirements • Suite of communication libraries and related tools that support • GSI, Kerberos security • Third-party transfers • Parameter set/negotiate • Partial file access • Reliability/restart • Large file support • Data channel reuse • All based on a standard, widely deployed protocol • Integrated instrumentation • Loggin/audit trail • Parallel transfers • Striping (cf DPSS) • Policy-based access control • Server-side computation • Proxies (firewall, load bal)
And The Protocol Is … GridFTP • Why FTP? • Ubiquity enables interoperation with many commodity tools • Already supports many desired features, easily extended to support others • Well understood and supported • We use the term GridFTP to refer to • Transfer protocol which meets requirements • Family of tools which implement the protocol • Note GridFTP > FTP • Note that despite name, GridFTP is not restricted to file transfer!
GridFTP: Basic Approach • FTP protocol is defined by several IETF RFCs • Start with most commonly used subset • Standard FTP: get/put etc., 3rd-party transfer • Implement standard but often unused features • GSS binding, extended directory listing, simple restart • Extend in various ways, while preserving interoperability with existing servers • Striped/parallel data channels, partial file, automatic & manual TCP buffer setting, progress monitoring, extended restart
2: B initiates Transfer, A disconnects 3: C receives file Data 3rd Party Transfer Computer B Computer A Data 1: A sends transfer request to B Computer C
The GridFTP Family of Tools • Provide the following features: • Grid Security Infrastructure (GSI) and Kerberos support: • Robust and flexible authentication, integrity, and confidentiality features are critical when transferring or accessing files. • GridFTP supports both GSI and Kerberos authentication, with user controlled setting of various levels of data integrity and/or confidentiality. • Third-party control of data transfer: • In order to manage large data sets for large distributed communities, it is necessary to provide third-party control of transfers between storage servers. • GridFTP provides this capability by adding GSSAPI security to the existing third-party transfer capability defined in the FTP standard.
The GridFTP Family of Tools (cont.) • Parallel data transfer: • On wide-area links, using multiple TCP streams can improve aggregate bandwidth over using a single TCP stream. • This is required both between a single client and a single server, and between two servers. • GridFTP supports parallel data transfer through FTP command extensions and data channel extensions. • Striped data transfer: • Partitioning data across multiple servers can further improve aggregate bandwidth. • GridFTP supports striped data transfers through extensions defined in the Grid Forum draft.
The GridFTP Family of Tools (cont.) • Partial file transfer: • Many applications require the transfer of partial files. • However, standard FTP requires the application to transfer the entire file, or the remainder of a file starting at a particular offset. • GridFTP introduces new FTP commands to support transfers of regions of a file. • Support for reliable data transfer: • Reliable transfer is important for many applications that manage data. • Fault recovery methods for handling transient network failures, server outages, etc., are needed. • The FTP standard includes basic features for restarting failed transfer that are not widely implemented. • The GridFTP protocol exploits these features, and substantially extends them.
The GridFTP Family of Tools (cont.) • Manual control of TCP buffer size: • This is a critical parameter for achieving maximum bandwidth with TCP/IP. • The protocol also has support for automatic buffer size tuning, but we have not yet implemented anything in our code. • We are talking with both NCSA and LANL to see if it makes sense to integrate work they are doing in this area into our code. • Integrated Instrumentation: • The protocol calls for restart and performance markers to be sent back. • It is not specified how often, and this is something we intend to address shortly.
Why Did We Need a New Transport Protocol? • requirements was a transport protocol that met the following criteria: • Targeted at bulk data transport: We saw this as a protocol to move lots of data (100s of Megabytes and above) • Based on industry standards: i.e., a clear, well defined, published, nonproprietary protocol. • Secure: Allowed for authentication, authorization, integrity, and privacy • Fast and Efficient: This meant employing multiple levels of parallelism and minimizing overhead.
Why Did We Need a New Transport Protocol? (cont.) • Robust: The protocol must be able to tolerate system failures gracefully. • Allowed 3rd party transfers: We believe much of the traffic will be generated by automated systems such as schedulers. • Integrated instrumentation: The protocol must provide feedback on operational status so that intelligent actions can be taken during transfers. • Easily Extensible: Both in terms of standards body approval and technically/architecturally/coding wise.
Data Management Replication Management
The Motivation… • Data-intensive, high-performance computing applications require an efficient management and transfer of terabytes or petabytes of information in wide-area,distributed computing environments. • Examples of such applications include experimental analyses and simulations in scientific disciplines such as: • high-energy physics • climate modeling • earthquake engineering • astronomy.
The Motivation… (cont.) • In such applications, massive datasets must be shared by a community of hundreds or thousands of researchers distributed worldwide. • These researchers need to transfer large subsets of these datasets to local sites or other remote resources for processing. • They may create local copies or replicas to overcome long wide-area data transfer latencies.
The Motivation… (cont.) • Once multiple copies of files are distributed at multiple locations, researchers need a service: • to be able to locate copies • to determine whether to access an existing copy or create a new one • Meet the performance needs of their applications.