A Service for Data-Intensive Computations on Virtual Clusters

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Executing Preservation Strategies at Scale Intensive 2009, Valencia, Spain

Planets Project • “Permanent Long-term Access through NETworked Services” • Addresses the problem of digital preservation • driven by National Libraries and Archives • Project instrument: FP6 Integrated Project • 5. IST Call • Consortium: 16 organisations from 7 countries • Duration: 48 months, June 2006 – May 2010 • Budget: 14 Million Euro • http://www.planets-project.eu/

Outline • Digital Preservation • Need for eScience Repositories • The Planets Preservation Environment • Distributed Data and Metadata Integration • Data-Intensive Computations • Grid Service for Job Execution on AWS • Experimental Results • Conclusions

Drivers for Digital Preservation • Exponential growth of digital data across all sectors of society • Production large quantities of often irreplaceable information • Produced by academic and other research • Data becomes significantly more complex and diverse • Digital information is fragile • Ensure long term access and interpretability • Requires data curation and preservation • Development and assessment of preservation strategies • Need for integrated and largely automated environments

Repository Systems and eScience • Digital repositories and libraries are designed for the creation, discovery, and publication of digital collections • systems often heterogeneous, closed, limited in scale, • Data grid middleware focus on the controlled sharing and management of large data sets in a distributed environment. • limited in metadata mgnt., and preservation functionality • Integrate repositories into large-scale environment • distributed data management capabilities • integration of policies, tools, and workflows • integration with computational facilities

The Planets Environment • A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies. • A decision-support system for the development of preservation plans • An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries. • A workflow environment for the data integration, process execution, and the maintenance of preservation information.

Service Gateway Architecture Administration UI Preservation Planning (Plato) Experimentation Testbed Application Workflow Execution UI User Applications Workflow Execution and Monitoring Experiment Dataand Metadata Repository Service and Tool Registry Authentication and Authorization Notification and Logging System Portal Services Application Services ExecutionServices Data Storage Services Application Execution and Data Services Physical Resources, Computers, Networks

Data and Metadata Integration • Support distributed and remote storage facilities • data registry and metadata repository • Integrate different data sources and encoding standards • Currently a range of interfaces/protocols (OAI-PMH, SRU) and encoding standards (DC, METS) are supported. • Development of common digital object model • Recording of provenance information, preservation, integrity, and descriptive metadata • Dynamic mapping of objects from different digital libraries (TEL, memory institutions, research collections)

Data and Metadata Registry Portal Application Content Repository Content Repository PUID Data Catalogue Application Service Experiment Data Metadata Repository Temp Storage DigObj DigObj

Data Intensive Applications • Development Planets Job Submission Services • Allow Job Submission to a PC cluster (e.g. Hadoop, Condor) • Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL) • Demand for massive compute and storage resources • Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.) • Allows computation to be moved to data or vice-versa • On-demand cluster based on virtual machine images • Many Preservation Tools are/rely on 3rd party applications • Software can be preinstalled on virtual cluster nodes

Virtual Cluster and File System (Apache Hadoop) Virtual Node (Xen) Experimental Architecture Cloud Infrastructure (EC2) HPCBP Job JSDL JSS Job Description File Workflow Environment Storage Infrastructure (S3) Digital Objects Data Service

A Light-Weight Job Execution Service WS - Container AccountManager JSDLParser SessionHandler Data Resolver TLS/WS Security BES/HPCBP API Input Generator Exec. Manager Job Manager

A Light-Weight Job Execution Service WS - Container AccountManager JSDLParser SessionHandler Data Resolver S3 TLS/WS Security BES/HPCBP API Input Generator HDFS Exec. Manager Job Manager Hadoop

The MapReduce Application • Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes. • Automated decomposition (split) • Mapping to intermediary pairs (map), optionally (combine) • Merge output (reduce) • Provides implementation for data parallel + i/o intensive applications • Migrating a digital object (e.g web page, folder, archive) • Decompose into atomic pieces (e.g. file, image, movie) • On each node, process input splits • Merge pieces and create new data collections

Experimental Setup • Amazon Elastic Compute Cloud (EC2) • 1 – 150 cluster nodes • Custom image based on Fedora 8 i386 • Amazon Simple Storage Service (S3) • max. 1TB I/O per experiment • All measurements based on a single core per virtual machine • Apache Hadoop (v.0.18) • MapReduce Implementation • Preinstalled command line tools • Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

Experimental Results 1 – Scaling Job Size x(1k) = 3,6 x(1k) = 4,4 x(1k) = 3,5 number of nodes = 5 x(1k) = t_seq / t_parallel and tasks = 1000

Experimental Results 2 – Scaling #nodes n=1, t=36, s1 = 0.72, e=72% X n=1 (local), t=26 X n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% X n=50, t=1.68, s=16, e=31% X n=100, t=1.03, s=26, e=26% X X

Results and Observations • AWS + Hadoop + JSS • Robust and fault tolerant, scalable up to large numbers of nodes • ~32,5MB/s download / ~13,8MB/s upload (cloud internally) • Also small clusters of virtual machines reasonable • S = 4,4 for p = 5, with n=1000 and s=7,5MB • Ideally, size of cluster grows/shrinks on demand • Overheads: • SLE vs. Cloud (p=1, n=1000) 30% due to file system master • In average, less then 10% due to S3 • Small overheads due to coordination, latencies, pre-processing

Conclusion • Challenges in digital preservation are diversity and scale • Focus: scientific data, arts and humanities • Repositories Preservation systems need to be embedded with large-scale distributed environments • grids, clouds, eScience environments • Cloud Computing provides a powerful solution for getting on-demand access to an IT infrastructure. • Currently integration issues: Policies, Legal Aspects, SLAs • We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies

Fin

A Service for Data-Intensive Computations on Virtual Clusters

A Service for Data-Intensive Computations on Virtual Clusters

Presentation Transcript

Collecting Data for Career Clusters

parallel data mining on multicore clusters

A Crystal Ball for Data-Intensive Processing

Data Intensive Applications on Clouds

Low Latency Computations on Massive Data

Data Intensive Cyberinfrastructure

Memento : Coordinated In-Memory Caching for Data-Intensive Clusters

On the Varieties of Clouds for Data Intensive Computing

Satisfying Data-Intensive Queries Using GPU Clusters

Algorithms and Data Structures for Fast Computations on Networks

Creating Clusters in a Virtual Environment

Data Intensive Scientific Compute Model for Multicore clusters

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

parallel data mining on multicore clusters

A Software Architecture for Highly Data-Intensive Systems

Adaptable Virtual Machine Environment for Heterogeneous Clusters

Balancing Performance and Power Consumption in Data-Intensive Computing Clusters

“A UC-Wide Cyberinfrastructure for Data-Intensive Research”

Lower bounds on data stream computations

Direct Self-Consistent Field Computations on GPU Clusters

A Virtual Machine Monitor for Utilizing Non-dedicated Clusters

A Virtual Data Product Toolkit Based on Geospatial Web Service Orchestration