Caitriana nicholson university of glasgow
Download
1 / 26

Caitriana Nicholson University of Glasgow - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

Caitriana Nicholson University of Glasgow. Dynamic Data Replication in LCG 2008. Outline. Introduction Grid Replica Optimisation The OptorSim grid simulator OptorSim architecture Experimental setup Results Conclusions. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Caitriana Nicholson University of Glasgow' - naava


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Caitriana nicholson university of glasgow

Caitriana NicholsonUniversity of Glasgow

Dynamic Data Replication in LCG 2008


Outline
Outline

  • Introduction

  • Grid Replica Optimisation

  • The OptorSim grid simulator

  • OptorSim architecture

  • Experimental setup

  • Results

  • Conclusions


Introduction
Introduction

  • Large Hadron Collider (LHC) at CERN will have raw data rate of ~15 PB/year

  • LHC Computing Grid (LCG) for data storage and computing infrastructure

  • 2008 will be first full year of LHC running

  • Actual analysis behaviour still unknown

    è use simulation to investigate behaviour

    èinvestigate dynamic data replication


Grid replica optimisation
Grid Replica Optimisation

  • Many variables determine overall grid performance

    • Impossible to reach one optimal solution!

  • Possible to optimise variables which are part of grid middleware

    • Job scheduling, data management etc

  • This talk considers data management only…

    …and dynamic replica optimisation in particular


Dynamic replica optimisation
Dynamic Replica Optimisation

= optimisation of the placement of file replicas on grid sites…

…in a dynamic, automated fashion


Design of a replica optimisation service
Design of a Replica Optimisation Service

  • Centralised, hierarchical or distributed?

  • Pull or push?

  • Choosing a replication trigger

    • On file request?

    • On file popularity?

  • Aim to achieve global optimisation as a result of local optimisation


Optorsim
OptorSim

  • OptorSim is a grid simulator with a focus on data management

  • Developed as part of European DataGrid Work Package 2

  • Based on EDG architecture

  • Used to examine automated decisions about replica placement and deletion

    http://edg-wp2.web.cern.ch/edg-wp2/optimization/optorsim.html


Architecture
Architecture

  • Sites with Computing Element (CE) and/or Storage Element (SE)

  • Replica Optimiser decides replications for its site

  • Resource Broker schedules jobs

  • Replica Catalogue maps logical to physical filenames

  • Replica Manager controls and registers replications


Algorithms
Algorithms

  • Job scheduling

    • Details not covered in this talk

    • “QueueAccessCost” scheduler used in these results

  • Data replication

    • No replication

    • Simple replication:“always replicate, delete existing files if necessary”

      • Least Recently Used (LRU)

      • Least Frequently Used (LFU)

    • Economic model: “replicate only if profitable”

      • Sites “buy” and “sell” files using auction mechanism

      • Files deleted if less valuable than new file


Experimental setup jobs files
Experimental Setup - Jobs & Files

  • Job types based on computing models

  • “Dataset” for each experiment

    ~1 year’s AOD (analysis data)

  • 2GB files

  • Placed at CERN and Tier-1s at start


Experimental setup storage resources
Experimental Setup - Storage Resources

  • CERN & Tier 1 site capacities from LCG Technical Design Report

  • “Canonical” Tier 2 capacity of 197 TB each (18.8 PB / 95 sites)

  • Define storage metric D = (average SE size)

    (total dataset size)

  • Memory limitations -> scale down Tier 2 SE sizes to 500 GB

    • Allows file deletion to start quickly

    • Disadvantage of small D


Experimental setup computing network
Experimental Setup - Computing & Network

  • Most (chaotic) analysis jobs run at Tier 2s

    • Tier 1s not given CE, except those running LHCb jobs

    • CERN Analysis Facility with CE of 7840 kSI2k

    • Tier 2s with averaged CE of 645 kSI2k each (61.3 MSI2k / 95 sites)

  • Network based on NREN topologies

    • Sites connected to closest router

    • Default of 155 Mbps if published value not available



Parameters
Parameters

  • Job scheduler “QueueAccessCost”

    • Combines data location and queue information

  • Sequential access pattern

  • 1000 jobs per simulation

  • Site policies set according to LCG Memorandum of Understanding


Evaluation metrics
Evaluation Metrics

  • Different grid users will have different criteria of evaluation

  • Used in these summary results are:

    • Mean job time

      • Average time taken for job to run, from scheduling to completion

    • Effective Network Usage (ENU)

      • (File requests which use network resources)

        (Total number of file requests)


Results data replication
Results: Data Replication

  • Performance of algorithms measured with varying D

  • D varied by reducing dataset size

  • 20-25% gain in mean job time as D approaches realistic value


Results: Data Replication

  • ENU shows similar gain

  • Allows clearer distinction between strategies


Results data replication1
Results: Data Replication

  • Number of jobs increased to 4000

  • Mean job time increases linearly

  • Relative improvement as D increases will hold for higher numbers of jobs

  • Realistic number of jobs is >O(10000)


Results site policies
Results: Site Policies

  • Vary site policies:

    • All Job Types

      • Sites accept jobs from any VO

    • One Job Type

      • Sites accept jobs from one VO

    • Mixed

      • default

  • All Job Types is ~60% faster than One Job Type


Results: Site Policies

  • All Job Types also give ~25% lower ENU than other policies

  • Egalitarian approach benefits all grid users


Results access patterns
Results: Access Patterns

  • Sequential access likely for many physics applications

  • Zipf-like access will also occur

    • Some files accessed frequently, many infrequently

  • Replication gives performance gain of ~75% when Zipf access pattern used


Results: Access Patterns

  • ENU also ~75% lower with Zipf access

  • Any Zipf-like element makes replication highly desirable

  • Size of efficiency gain depends on streaming model, etc


Conclusions
Conclusions

  • OptorSim used to simulate LCG in 2008

  • Dynamic data replication reduces running time of simulated grid jobs:

    • 20% reduction with sequential access

    • 75% reduction with Zipf-like access

    • Similar reductions in network usage

  • Little difference between replication strategies

    • Simpler LRU, LFU 20-30% faster than economic model

  • Site policy which allows all experiments to share resources gives most effective grid use




Replica optimiser architecture
Replica optimiser architecture

  • Access Mediator (AM) - contacts replica optimisers to locate the cheapest copies of files and makes them available locally

  • Storage Broker (SB) - manages files stored in SE, trying to maximise profit for the finite amount of storage space available

  • P2P Mediator (P2PM) - establishes and maintains P2P communication between grid sites


ad