Data Management

Data Management Azizol Abdullah FSKTM

What is Data Management? • It depends on ……. • Storage system • Data transport mechanism • Replication management • Metadata management • Publishing and curation of data

What is data management? (cont.) • Storage systems • Disk arrays • Network caches (e.g., DPSS) • Hierarchical storage systems (e.g., HPSS) • Efficient data transport mechanisms • Striped • Parallel • Secure • Reliable • Third-party transfers

What is data management? (cont.) • Replication management • Associate files into collections • Mechanisms for reliably copying collections, propagating updates to collections, selecting among replicas • Metadata management • Associate attributes that describe data • Select data based on attributes • Publishing and curation of data • “Official” versions of important collections • Digital libraries

Data-Intensive Applications: Physics • CERN Large Hadron Collider • Several terabytes of data per year • Starting in 2005 • Continuing 15 to 20 years Replication scenario: • Copy of everything at CERN (Tier 0) • Subsets at national centers (Tier 1) • Smaller regional centers (Tier 2) • Individual researchers will have copies

The Large Hadron Collider (LHC) experiment

The CERN structure

GriPhyN Overview(www.griphyn.org) • 5-year, $12.5M NSF ITR proposal to realize the concept of virtual data, via: • Key research areas: • Virtual data technologies (information models, management of virtual data software, etc.) • Request planning and scheduling (including policy representation and enforcement) • Task execution (including agent computing, fault management, etc.) • Development of Virtual Data Toolkit (VDT) • Four Applications: ATLAS, CMS, LIGO, SDSS

GriPhyN Participants • Computer Science • U.Chicago, USC/ISI, UW-Madison, UCSD, UCB, Indiana, Northwestern, Florida • Toolkit Development • U.Chicago, USC/ISI, UW-Madison, Caltech • Applications • ATLAS (Indiana), CMS (Caltech), LIGO (UW-Milwaukee, UT-B, Caltech), SDSS (JHU) • Unfunded collaborators • UIC (STAR-TAP), ANL, LBNL, Harvard, U.Penn

The Petascale Virtual Data Grid (PVDG) Model • Data suppliers publish data to the Grid • Users request raw or derived data from Grid, without needing to know • Where data is located • Whether data is stored or computed • User can easily determine • What it will cost to obtain data • Quality of derived data • PVDG serves requests efficiently, subject to global and local policy constraints

PVDGScenario User requests may be satisfied via a combination of data access and computation at local, regional, and central sites

Other Application Scenarios • Climate community • Terabyte-scale climate model datasets: • Collecting measurements • Simulation results • Must support sharing, remote access to and analysis of datasets • Distance visualization • Remote navigation through large datasets, with local and/or remote computing

Data-intensive computing • The term data-intensive computing is used to describe applications that are I/O bound. • Such applications devote the largest fraction of execution time to movement of data. • They can be identified by evaluating “computational bandwidth”—the number of bytes of data processed per floating-point operation. • On vector supercomputers for applications that sustain high performance, usually 7 bytes of data are accessed from memory for every floating point operation

Storage Systems: Disk Arrays • What is a disk array? • Collection of disks • Advantages: • Higher capacity • Many small, inexpensive disks • Higher throughput • Higher bandwidth (Mbytes/sec) on large transfers • Higher I/O rate (transactions/sec) on small transfers

Trends in Magnetic Disks • Capacity increases: 60% per year • Cost falling at similar rate ($/MB or $/GB) • Evolving to smaller physical sizes • 14in  5.25in  3.5in  2.5in  1.0in … ? • Put lots of small disks together • Problem: RELIABILITY • Reliability of N disks = Reliability of 1 disk divided by N

Key Concepts in Disk Arrays Striping for High Performance • Interleave data from single file across multiple disks • Fine-grained interleaving: • every file spread across all disks • any access involves all disks • Course-grained interleaving: • interleave in large blocks • small accesses may be satisfied by a single disk

Key Concepts in Disk Arrays Redundancy • Maintain extra information in disk array • Duplication • Parity • Reed-Solomon error correction codes • Others • When a disk fails: use redundancy information to reconstruct data on failed disk

RAID“Levels”(Redundant Arrays of Inexpensive Disks) • Defined by combinations of striping & redundancy ( 6 level of RAID) • RAID Level 1: Mirroring or Shadowing • Maintain a complete copy of each disk • Very reliable • High cost: twice the number of disks • Great performance: on a read, may go to disk with faster access time

RAID “Levels” (cont.) • RAID Level 2: Memory Style Error Detection and Correction • Not really implemented in practice • Based on DRAM-style Hamming codes • In disk systems, don’t need detection • Use less expensive correction schemes

RAID “Levels” (cont.) • RAID Level 3: Fine-grained Interleaving and Parity • Many commercial RAIDs • Calculate parity bit-wise across disks in the array (using exclusive-OR logic) • Maintain a separate parity disk; update on write operations • When a disk fails, use other data disk and parity disk to reconstruct data on lost disk • Fine-grained interleaving: all disks involved in any access to the array

RAID “Levels” (cont.) • RAID Level 4: Large Block Interleaving and Parity • Similar to level 3, but interleave on larger blocks • Small accesses may be satisfied by a single disk • Supports higher rate of small I/Os • Parity disk may become a bottleneck with multiple concurrent I/Os

RAID “Levels” (cont.) • RAID Level 5: Large Block Interleaving and Distributed Parity • Similar to level 4 • Distributes parity blocks throughout all disks in array

RAID Levels (cont.) • RAID Level 6: Reed-Solomon Error Correction Codes • Protection against two disk failures

RAID Levels (cont.) • Disks getting so cheap: consider massive storage systems composed entirely of disks • No tape!!

DPSS: Distributed Parallel Storage System • Produced by Lawrence Berkeley National Labs • “Cache”: provides storage that is • Faster than typical local disk • Temporary • “Virtual disk”: appears to be single large, random-access, block-oriented I/O device • Isolates application from tertiary storage system: • Acts as large buffer between slow tertiary storage and high-performance network connections • “Impedance matching”

Features of DPSS • Components: • DPSS block servers • Typically low-cost workstations • Each with several disk controllers, several disks per controller • DPSS mater process • Data requests sent from client to master process • Determines which DPSS block server stores the requested blocks • Forwards request to that block server • Note: servers can be anywhere on network (a distributed cache)

Features of DPSS (cont.) • Client API library • Supports variety of I/O semantics • dpssOpen(), dpssRead(), dpssWrite(), dpssLSeek(), dpssClose() • Application controls data layout in cache • For typical applications that read sequentially: stripe blocks of data across servers in round-robin fashion • DPSS client library is multi-threaded • Number of client threads is equal to number of DPSS servers: client speed scales with server speed

Features of DPSS (cont.) • Optimized for relatively small number of large files • Several thousand files • Greater than 50 MB • DPSS blocks are available as soon as they are placed in cache • Good for staging larges files to/from tertiary storage • Don’t have to wait for large transfer to complete • Dynamically reconfigurable • Add or remove servers or disks on the fly

Features of DPSS (cont.) • Agent-based performance monitoring system • Client library automatically sets TCP buffer size to optimal value • Uses information published by monitoring system • Load balancing • Supports replication of files on multiple servers • DPSS master uses status information stored in LDAP directory to select a replica that will give fastest response

Hierarchical Storage System • Fast, disk cache in front of larger, slower storage • Works on same principle as other hierarchies: • Level-1 and Level-2 caches: minimize off-chip memory accesses • Virtual memory systems:minimize page faults to disk • Goal: • Keep popular material in faster storage • Keep most of material on cheaper, slower storage • Locality: 10% of material gets 90% of accesses

Hierarchical Storage System (cont.) • Problem with tertiary storage (especially tape): • Very slow • Tape seek times can be a minute or more…

Data Management GridFTP

Motivation…. • The GridFTP protocol • born out of a realization that the Grid environment needed a fast, secure, efficient, and reliable transport mechanism. • Existing distributed data storage systems • DPSS, HPSS: focus on high-performance access, utilize parallel data transfer, striping • DFS: focus on high-volume usage, dataset replication, local caching • SRB: connects heterogeneous data collections, uniform client interface, metadata queries

Motivation…. (cont.) • Problems • Incompatible (and proprietary) protocols • Each require custom client • Partitions available data sets and storage devices • Each protocol has subset of desired functionality

A Common, Secure,Efficient Data Access Protocol • Common, extensible transfer protocol • Common protocol means all can interoperate • Decouple low-level data transfer mechanisms from the storage service • Advantages: • New, specialized storage systems are automatically compatible with existing systems • Existing systems have richer data transfer functionality • Interface to many storage systems • HPSS, DPSS, file systems • Plan for SRB integration

Access/Transport Protocol Requirements • Suite of communication libraries and related tools that support • GSI, Kerberos security • Third-party transfers • Parameter set/negotiate • Partial file access • Reliability/restart • Large file support • Data channel reuse • All based on a standard, widely deployed protocol • Integrated instrumentation • Loggin/audit trail • Parallel transfers • Striping (cf DPSS) • Policy-based access control • Server-side computation • Proxies (firewall, load bal)

And The Protocol Is … GridFTP • Why FTP? • Ubiquity enables interoperation with many commodity tools • Already supports many desired features, easily extended to support others • Well understood and supported • We use the term GridFTP to refer to • Transfer protocol which meets requirements • Family of tools which implement the protocol • Note GridFTP > FTP • Note that despite name, GridFTP is not restricted to file transfer!

GridFTP: Basic Approach • FTP protocol is defined by several IETF RFCs • Start with most commonly used subset • Standard FTP: get/put etc., 3rd-party transfer • Implement standard but often unused features • GSS binding, extended directory listing, simple restart • Extend in various ways, while preserving interoperability with existing servers • Striped/parallel data channels, partial file, automatic & manual TCP buffer setting, progress monitoring, extended restart

2: B initiates Transfer, A disconnects 3: C receives file Data 3rd Party Transfer Computer B Computer A Data 1: A sends transfer request to B Computer C

The GridFTP Family of Tools • Provide the following features: • Grid Security Infrastructure (GSI) and Kerberos support: • Robust and flexible authentication, integrity, and confidentiality features are critical when transferring or accessing files. • GridFTP supports both GSI and Kerberos authentication, with user controlled setting of various levels of data integrity and/or confidentiality. • Third-party control of data transfer: • In order to manage large data sets for large distributed communities, it is necessary to provide third-party control of transfers between storage servers. • GridFTP provides this capability by adding GSSAPI security to the existing third-party transfer capability defined in the FTP standard.

The GridFTP Family of Tools (cont.) • Parallel data transfer: • On wide-area links, using multiple TCP streams can improve aggregate bandwidth over using a single TCP stream. • This is required both between a single client and a single server, and between two servers. • GridFTP supports parallel data transfer through FTP command extensions and data channel extensions. • Striped data transfer: • Partitioning data across multiple servers can further improve aggregate bandwidth. • GridFTP supports striped data transfers through extensions defined in the Grid Forum draft.

The GridFTP Family of Tools (cont.) • Partial file transfer: • Many applications require the transfer of partial files. • However, standard FTP requires the application to transfer the entire file, or the remainder of a file starting at a particular offset. • GridFTP introduces new FTP commands to support transfers of regions of a file. • Support for reliable data transfer: • Reliable transfer is important for many applications that manage data. • Fault recovery methods for handling transient network failures, server outages, etc., are needed. • The FTP standard includes basic features for restarting failed transfer that are not widely implemented. • The GridFTP protocol exploits these features, and substantially extends them.

The GridFTP Family of Tools (cont.) • Manual control of TCP buffer size: • This is a critical parameter for achieving maximum bandwidth with TCP/IP. • The protocol also has support for automatic buffer size tuning, but we have not yet implemented anything in our code. • We are talking with both NCSA and LANL to see if it makes sense to integrate work they are doing in this area into our code. • Integrated Instrumentation: • The protocol calls for restart and performance markers to be sent back. • It is not specified how often, and this is something we intend to address shortly.

Why Did We Need a New Transport Protocol? • requirements was a transport protocol that met the following criteria: • Targeted at bulk data transport: We saw this as a protocol to move lots of data (100s of Megabytes and above) • Based on industry standards: i.e., a clear, well defined, published, nonproprietary protocol. • Secure: Allowed for authentication, authorization, integrity, and privacy • Fast and Efficient: This meant employing multiple levels of parallelism and minimizing overhead.

Why Did We Need a New Transport Protocol? (cont.) • Robust: The protocol must be able to tolerate system failures gracefully. • Allowed 3rd party transfers: We believe much of the traffic will be generated by automated systems such as schedulers. • Integrated instrumentation: The protocol must provide feedback on operational status so that intelligent actions can be taken during transfers. • Easily Extensible: Both in terms of standards body approval and technically/architecturally/coding wise.

Data Management Replication Management

The Motivation… • Data-intensive, high-performance computing applications require an efficient management and transfer of terabytes or petabytes of information in wide-area,distributed computing environments. • Examples of such applications include experimental analyses and simulations in scientific disciplines such as: • high-energy physics • climate modeling • earthquake engineering • astronomy.

The Motivation… (cont.) • In such applications, massive datasets must be shared by a community of hundreds or thousands of researchers distributed worldwide. • These researchers need to transfer large subsets of these datasets to local sites or other remote resources for processing. • They may create local copies or replicas to overcome long wide-area data transfer latencies.

The Motivation… (cont.) • Once multiple copies of files are distributed at multiple locations, researchers need a service: • to be able to locate copies • to determine whether to access an existing copy or create a new one • Meet the performance needs of their applications.

Data Management

Data Management

Presentation Transcript

Data Management

Data Management

Data Management

Data Management

Data Management

Data Management

Data Management

Data Management

Data Management

Data Management

DATA MANAGEMENT

Data Management

Data Management

Data Management

data Management:

Data management

Data Management

Data Management

Data Management

Data Management