D samgrid where ve we come from and where are we going evolution of a long established plan
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan. Gavin Davies Imperial College London. Tevatron Running experiments (Less data than LHC, but still PBs/experiment) Growing - great physics & better still to come..

Download Presentation

DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


D samgrid where ve we come from and where are we going evolution of a long established plan

DØ – SAMGridWhere’ve we come from, and where are we going?Evolution of a ‘long’ established plan

Gavin Davies

Imperial College London

GridPP18 Glasgow Mar 07


Introduction

Tevatron

Running experiments (Less data than LHC, but still PBs/experiment)

Growing - great physics & better still to come..

Have 2fb-1 of data and expect up 6fb-1 more by end 2009

2010 running been discussed

Computing model: Datagrid (SAM) for all data handling & originally distributed computing with evolution to automated use of common tools/solutions on the grid (SAMGrid) for all tasks

Started with production tasks eg MC generation, data processing

Greatest need & easiest to ‘gridify’ - ahead of the wave and a running expt.

Base on SAMGrid, but have a program of interoperability from v. early on

Initially LCG and then OSG

Increased automation, user analysis considered last

SAM gives remote data analysis

Introduction

GridPP18 Glasgow Mar 07


Computing model

Computing Model

Remote Farms

Raw Data

Central Farms

RECO Data

User Data

RECO MC

Data Handling Services

Central Storage

User

Desktops

Central Analysis

Systems

Remote Analysis

Systems

GridPP18 Glasgow Mar 07


Components terminology

Components - Terminology

  • SAM (Sequential Access to Metadata)

    • Well developed metadata & distributed data replication system

    • Originally developed by DØ & FNAL-CD, now used by CDF & MINOS

  • JIM (Job Information and Monitoring)

    • handles job submission and monitoring (all but data handling)

    • SAM + JIM →SAMGrid – computational grid

  • Runjob

    • handles job workflow management

  • UK Role

    • Project leadership

    • Key technology – runjob, integration of SAMGrid dev & production

GridPP18 Glasgow Mar 07


Sam plots

SAM plots

http://d0db-prd.fnal.gov/sam_local/SamAtAGlance/

All

Over 10 PB (250B evts) last yr

Up to 1.2 PB moved per month

(x5 increase over 2 yrs ago)

1PB / month

  • SAM TV - monitor SAM and SAM stations

    • Continued success: SAM shifters – often remote

http://www-clued0.fnal.gov/%7Esam/samTV/current/

GridPP18 Glasgow Mar 07


Samgrid plots

SAMGrid-plots

JIM: > 10 active execution sites

“Moving to forwarding nodes”

http://samgrid.fnal.gov:8080/

“No longer add red dots”

http://samgrid.fnal.gov:8080/list_of_schedulers.php

http://samgrid.fnal.gov:8080/list_of_resources.php

GridPP18 Glasgow Mar 07


Samgrid interoperability

SAMGrid Interoperability

  • Long programme of interoperability – LCG 1st and then OSG

  • Step 1: Co-existence – use shared resources with SAM(Grid) headnode

    • Widely done for both MC and 2004/5 data reprocessing

      • Nikhef MC v. good example – GridPP10 talk

  • Step 2 – SAMGrid-LCG interface

    • SAM does data handling & JIM job submission

    • Basically forwarding mechanism

    • Data fixing in early 2006

    • MC since

  • OSG activity – learnt from LCG activity

    • P20 data reprocessing now

  • Replicate as needed

GridPP18 Glasgow Mar 07


Monte carlo

Monte Carlo

  • Massive increase with spread of SAMGrid use & LCG (OSG later)

  • P17 – 455M events since 09/05

  • 30M events/month

  • 80% in Eu

    • Almost a const of nature

  • UKRAC

    • Full details on web

    • http://www.hep.ph.ic.ac.uk/~villeneu/d0_uk_rac/d0_uk_rac.html

  • LCG gridwide submission reached scaling problem

GridPP18 Glasgow Mar 07


Data reprocessing fixing

Data – reprocessing & fixing

  • P14 Reprocessing: Winter 2003/04

    • 100M events remotely, 25M in UK

    • Distributed computing rather than Grid

  • P17 Reprocessing: Spring – Autumn 05

    • x 10 larger ie 1B events, 250TB, from raw

    • SAMGrid as default (using mc_runjob)

  • P17 Fixing: Spring 06

    • All RunIIa – 1.4B events in 6 weeks

    • SAMGrid-LCG ‘burnt-in’

  • Moving to primary processing and skimming

Site certification

GridPP18 Glasgow Mar 07


A comment if i may

A comment..if I may

  • Largest data challenges (I believe) in HEP using the grid

  • Learnt a lot about the technology, and especially how it scales

  • Learnt a lot about organisation / operation of such projects

  • Some of these can be abstracted and of benefit to others… (a different talk…)

GridPP18 Glasgow Mar 07


A comment graphically

A comment - graphically

  • P20 reprocessing

    • I know its OSG

    • (started with LCG)

    • SAMGrid-LCG

      • Will use to catch-up

IN2P3

A lot of green

A lot of red

OSG

GridPP18 Glasgow Mar 07


D runjob

(DØ –) Runjob

Runjob

CDFRunjob

CMSRunjob

DØRunjob

  • Used in all production tasks – UK responsibility

  • In 04 we froze SAM at v5 & mc_runjob used by SAMGrid for MC and reprocessing from then till summer 06

  • DØrunjob - the rewrite

  • Joint (CDF,) CMS, DØ, FNAL-CD project

  • Base classes from common

    Runjob package

  • Things got messy – but triumph

    • Sustainable, long term product with SAM v7

  • For details see: http://projects.fnal.gov/runjob/

GridPP18 Glasgow Mar 07


Next steps issues i

Next steps / issues - I

  • Complete endgame development – ability to analysis larger datasets with decreasing manpower

    • Additional functionality – skimming, primary processing at multiple sites, MC prod at diff stages, diff output…

    • Additional resources - Completing the forwarding nodes

      • Full data /MC capability

      • Scaling issues to access the full LCG and OSG worlds

    • Data analysis – how gridified do we go? – an open issue

      • My feeling – need to be ‘interoperable’

        • – Fermigrid, certain large LCG sites

      • Will need development, deployment and operations effort

    • And operations..

GridPP18 Glasgow Mar 07


Next steps issues ii

Next steps / issues - II

  • “Steady” state – goal to reach by end of CY 07 (≥ 2yrs running)

    • Maintenance of existing functionality

    • Continued experimental requests

    • Continued evolution as grid standard’s evolve

    • Operations

      • You do still need manpower

        • and not just to make sure the hardware works

        • MC and data are not fire and forget

  • Manpower a real issue

    • (Especially with data analysis on the grid)

GridPP18 Glasgow Mar 07


Summary plans

Summary / plans

  • DØ and Tevatron performing very well

    • Big physics results have come out, better yet on their way

    • Much more data to come  increasing needs, with reduced effort

  • SAM & SAMGrid critical to DØ

    • Without the grid DØ would not have worked

    • GridPP key part of effort (technical / leadership) – THANKS -

    • Users - demanding, hard to develop and maintain production level services

  • Baseline: Ensure (scaling for) production tasks

    • Move to SAMv7 and d0runjob

    • Accessing all LCG - establishing UKRAC – forwarding nodes

  • In parallel open question of data analysis – will need to go part way

  • Manpower for development, integration and operation is a real issue

GridPP18 Glasgow Mar 07


D samgrid where ve we come from and where are we going evolution of a long established plan

Back-ups

GridPP18 Glasgow Mar 07


Samgrid architecture

SAMGrid Architecture

GridPP18 Glasgow Mar 07


  • Login