caitriana nicholson university of glasgow
Skip this Video
Download Presentation
Caitriana Nicholson University of Glasgow

Loading in 2 Seconds...

play fullscreen
1 / 26

Caitriana Nicholson University of Glasgow - PowerPoint PPT Presentation

  • Uploaded on

Caitriana Nicholson University of Glasgow. Dynamic Data Replication in LCG 2008. Outline. Introduction Grid Replica Optimisation The OptorSim grid simulator OptorSim architecture Experimental setup Results Conclusions. Introduction.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Caitriana Nicholson University of Glasgow' - naava

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
caitriana nicholson university of glasgow

Caitriana NicholsonUniversity of Glasgow

Dynamic Data Replication in LCG 2008

  • Introduction
  • Grid Replica Optimisation
  • The OptorSim grid simulator
  • OptorSim architecture
  • Experimental setup
  • Results
  • Conclusions
  • Large Hadron Collider (LHC) at CERN will have raw data rate of ~15 PB/year
  • LHC Computing Grid (LCG) for data storage and computing infrastructure
  • 2008 will be first full year of LHC running
  • Actual analysis behaviour still unknown

è use simulation to investigate behaviour

èinvestigate dynamic data replication

grid replica optimisation
Grid Replica Optimisation
  • Many variables determine overall grid performance
    • Impossible to reach one optimal solution!
  • Possible to optimise variables which are part of grid middleware
    • Job scheduling, data management etc
  • This talk considers data management only…

…and dynamic replica optimisation in particular

dynamic replica optimisation
Dynamic Replica Optimisation

= optimisation of the placement of file replicas on grid sites…

…in a dynamic, automated fashion

design of a replica optimisation service
Design of a Replica Optimisation Service
  • Centralised, hierarchical or distributed?
  • Pull or push?
  • Choosing a replication trigger
    • On file request?
    • On file popularity?
  • Aim to achieve global optimisation as a result of local optimisation
  • OptorSim is a grid simulator with a focus on data management
  • Developed as part of European DataGrid Work Package 2
  • Based on EDG architecture
  • Used to examine automated decisions about replica placement and deletion

  • Sites with Computing Element (CE) and/or Storage Element (SE)
  • Replica Optimiser decides replications for its site
  • Resource Broker schedules jobs
  • Replica Catalogue maps logical to physical filenames
  • Replica Manager controls and registers replications
  • Job scheduling
    • Details not covered in this talk
    • “QueueAccessCost” scheduler used in these results
  • Data replication
    • No replication
    • Simple replication:“always replicate, delete existing files if necessary”
      • Least Recently Used (LRU)
      • Least Frequently Used (LFU)
    • Economic model: “replicate only if profitable”
      • Sites “buy” and “sell” files using auction mechanism
      • Files deleted if less valuable than new file
experimental setup jobs files
Experimental Setup - Jobs & Files
  • Job types based on computing models
  • “Dataset” for each experiment

~1 year’s AOD (analysis data)

  • 2GB files
  • Placed at CERN and Tier-1s at start
experimental setup storage resources
Experimental Setup - Storage Resources
  • CERN & Tier 1 site capacities from LCG Technical Design Report
  • “Canonical” Tier 2 capacity of 197 TB each (18.8 PB / 95 sites)
  • Define storage metric D = (average SE size)

(total dataset size)

  • Memory limitations -> scale down Tier 2 SE sizes to 500 GB
    • Allows file deletion to start quickly
    • Disadvantage of small D
experimental setup computing network
Experimental Setup - Computing & Network
  • Most (chaotic) analysis jobs run at Tier 2s
    • Tier 1s not given CE, except those running LHCb jobs
    • CERN Analysis Facility with CE of 7840 kSI2k
    • Tier 2s with averaged CE of 645 kSI2k each (61.3 MSI2k / 95 sites)
  • Network based on NREN topologies
    • Sites connected to closest router
    • Default of 155 Mbps if published value not available
  • Job scheduler “QueueAccessCost”
    • Combines data location and queue information
  • Sequential access pattern
  • 1000 jobs per simulation
  • Site policies set according to LCG Memorandum of Understanding
evaluation metrics
Evaluation Metrics
  • Different grid users will have different criteria of evaluation
  • Used in these summary results are:
    • Mean job time
      • Average time taken for job to run, from scheduling to completion
    • Effective Network Usage (ENU)
      • (File requests which use network resources)

(Total number of file requests)

results data replication
Results: Data Replication
  • Performance of algorithms measured with varying D
  • D varied by reducing dataset size
  • 20-25% gain in mean job time as D approaches realistic value

Results: Data Replication

  • ENU shows similar gain
  • Allows clearer distinction between strategies
results data replication1
Results: Data Replication
  • Number of jobs increased to 4000
  • Mean job time increases linearly
  • Relative improvement as D increases will hold for higher numbers of jobs
  • Realistic number of jobs is >O(10000)
results site policies
Results: Site Policies
  • Vary site policies:
    • All Job Types
      • Sites accept jobs from any VO
    • One Job Type
      • Sites accept jobs from one VO
    • Mixed
      • default
  • All Job Types is ~60% faster than One Job Type

Results: Site Policies

  • All Job Types also give ~25% lower ENU than other policies
  • Egalitarian approach benefits all grid users
results access patterns
Results: Access Patterns
  • Sequential access likely for many physics applications
  • Zipf-like access will also occur
    • Some files accessed frequently, many infrequently
  • Replication gives performance gain of ~75% when Zipf access pattern used

Results: Access Patterns

  • ENU also ~75% lower with Zipf access
  • Any Zipf-like element makes replication highly desirable
  • Size of efficiency gain depends on streaming model, etc
  • OptorSim used to simulate LCG in 2008
  • Dynamic data replication reduces running time of simulated grid jobs:
    • 20% reduction with sequential access
    • 75% reduction with Zipf-like access
    • Similar reductions in network usage
  • Little difference between replication strategies
    • Simpler LRU, LFU 20-30% faster than economic model
  • Site policy which allows all experiments to share resources gives most effective grid use
replica optimiser architecture
Replica optimiser architecture
  • Access Mediator (AM) - contacts replica optimisers to locate the cheapest copies of files and makes them available locally
  • Storage Broker (SB) - manages files stored in SE, trying to maximise profit for the finite amount of storage space available
  • P2P Mediator (P2PM) - establishes and maintains P2P communication between grid sites