Grid job information and data management for the run ii experiments at fnal
1 / 25

Grid Job - PowerPoint PPT Presentation

  • Updated On :

Grid Job, Information and Data Management for the Run II Experiments at FNAL. Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team. Plan of Attack. Brief History, D0 and CDF computing Grid Jobs and Information Management Architecture Job management Information management

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Grid Job' - arleen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Grid job information and data management for the run ii experiments at fnal l.jpg

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov et al

FNAL/CD/CCF, D0, CDF, Condor team

Plan of attack l.jpg
Plan of Attack Experiments at FNAL

  • Brief History, D0 and CDF computing

  • Grid Jobs and Information Management

    • Architecture

    • Job management

    • Information management

    • JIM project status and plans

  • Globally Distributed data handling in SAM and beyond

  • Summary

Igor Terekhov, FNAL

History l.jpg
History Experiments at FNAL

  • Run II CDF and D0, the two largest, currently running collider experiments

  • Each experiment to accumulate ~1PB raw, reconstructed, analyzed data by 2007. Get the Higgs jointly.

  • Real data acquisition – 5 /wk, 25MB/s, 1TB/day, plus MC

Igor Terekhov, FNAL

Globally distributed computing l.jpg
Globally Distributed Computing Experiments at FNAL

  • D0 – 78 institutions, 18 countries. CDF – 60 institutions, 12 countries.

  • Many institutions have computing (including storage) resources, dozens for each of D0, CDF

  • Some of these are actually shared, regionally or experiment-wide

    • Sharing is good

    • A possible contribution by the institution into the collaboration while keeping it local

    • Recent Grid trend (and its funding) encourages it

Igor Terekhov, FNAL

Goals of globally distributed computing in run ii l.jpg
Goals of Globally Distributed Computing in Run II Experiments at FNAL

  • To distribute data to processing centers – SAM is a way, see later slide

  • To benefit from the pool of distributed resources – maximize job turnaround, yet keep single interface

  • To facilitate and automate decision making on job/data placement.

    • Submit to the cyberspace, choose best resource

  • To provide an aggregate view of the system and its activities and keep track of what’s happening

  • To maintain security

  • Finally, to learn and prepare for the LHC computing

Igor Terekhov, FNAL

Sam highlights l.jpg
SAM Highlights Experiments at FNAL

  • SAM is Sequential data Access via Meta-data.


  • Presented numerous times, prev CHEPS

  • Core features: meta-data cataloguing, global data replication and routing, co-allocation of compute and data resources

  • Global data distribution:

    • MC import from remote sites

    • Off-site analysis centers

    • Off-site reconstruction (D0)

Igor Terekhov, FNAL

Now that the data s distributed jim l.jpg
Now that the Data’s Distributed: JIM Experiments at FNAL

  • Grid Jobs and Information Management

  • Owes to the D0 Grid funding – PPDG (an FNAL team), UK GridPP (Rod Walker, ICL)

  • Very young – started 2001

  • Actively explore, adopt, enhance, develop new Grid technologies

  • Collaborate with the Condor team from The University of Wisconsin on Job management

  • JIM with SAM is also called The SAMGrid


Igor Terekhov, FNAL

Slide8 l.jpg

Igor Terekhov, FNAL Experiments at FNAL

Job management strategies l.jpg
Job Management Strategies Experiments at FNAL

  • We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster)

  • We distinguish structured jobs from unstructured.

    • Structured jobs have their details known to Grid middleware.

    • Unstructured jobs are mapped as a whole onto a cluster

  • In the first phase, we want reasonably intelligent scheduling and reliable execution of unstructured data-intensive jobs.

Igor Terekhov, FNAL

Job management highlights l.jpg
Job Management Highlights Experiments at FNAL

  • We seek to provide automated resource selection (brokering) at the global level with final scheduling done locally (environments like CDF CAF, Frank’s talk)

  • Focus on data-intensive jobs:

    • Execution time is composed of:

      • Time to retrieve any missing input data

      • Time to process the data

      • Time to store output data

    • In the Leading Order, we rank sites by the amount of data cached at the site (minimize missing input data)

    • Scheduler is interfaced with the data handling system

Igor Terekhov, FNAL

Job management distinct jim features l.jpg
Job Management – Distinct JIM Features Experiments at FNAL

  • Decision making is based on both:

    • Information existing irrespective of jobs (resource description)

    • Functions of (jobs,resource)

  • Decision making is interfaced with data handling middleware rather than individual SE’s or RC alone: this allows incorporation of DH considerations

  • Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability

Igor Terekhov, FNAL

Job management l.jpg

User Interface Experiments at FNAL

User Interface

Submission Client

Submission Client

Job Management

Match Making Service

Match Making Service


Queuing System



Information Collector

Information Collector


Data Handling System

Data Handling System

Data Handling System

Data Handling System

Execution Site #1

Execution Site #n

Computing Element

Computing Element

Computing Element

Storage Element

Storage Element

Storage Element

Storage Element

Storage Element

Grid Sensors

Grid Sensors

Grid Sensors

Grid Sensors

Computing Element

Igor Terekhov, FNAL

Condor framework and enhancements we drove l.jpg
Condor Framework and Enhancements We Drove Experiments at FNAL

  • Initial Condor-G:

    • Personal Grid agent helping user run a job on a cluster of his/her choice

  • JIM: True grid service for accepting and placing jobs from all users

    • Added MMS for Grid job brokering

  • JIM: from 2-tier to 3-tier architecture

    • Decouple queing/spooling/scheduling machine from user machine

    • Security delegation, proper std* spooling, etc

  • Will move into standard Condor

Igor Terekhov, FNAL

Condor framework and enhancements we drove14 l.jpg
Condor Framework and Enhancements We Drove Experiments at FNAL

  • Classic Matchmaking service (MMS):

    • Clusters advertise their availability, jobs are matched with clusters

    • Cluster (Resource) description exists irrespective of jobs

  • JIM: Ranking expressions contain functions that are evaluated at run-time

    • Helps rank a job by a function(job,resource)

    • Now: query participating sites for data cached. Future: estimates when data for the job can arrive etc

    • Feature now in standard Condor-G

Igor Terekhov, FNAL

Monitoring highlights l.jpg
Monitoring Highlights Experiments at FNAL

  • Sites (resources) and jobs

  • Distributed knowledge about jobs etc

  • Incremental knowledge building

  • GMA for current state inquiries, Logging for recent history studies

  • All Web based

Igor Terekhov, FNAL

Information management implementation and technology choices l.jpg
Information Management – Implementation and Technology Choices

  • XML for representation of site configuration and (almost) all other information

  • Xquery and XSLT for information processing

  • Xindice and other native XML databases for database semantics

Igor Terekhov, FNAL

Slide17 l.jpg

Meta-Schema Choices


Main Site/cluster Config









Igor Terekhov, FNAL

Jim monitoring l.jpg

Web Browser Choices

Web Browser

JIM Monitoring

Web Server

Web Server 1

Web Server N

Site N Information System

Site 2 Information System

Site 1 Information System





Igor Terekhov, FNAL

Jim project status l.jpg
JIM Project Status Choices

  • Delivered prototype for D0, Oct 10, 2002:

    • Remote job submission

    • Brokering based on data cached

    • Web-based monitoring

  • SC-2002 demo – 11 sites (D0, CDF), big success

  • April 2003 – production deployment of V1 (Grid analysis in production a reality as of April, 1)

  • Post V1 – OGSA, Web services, logging service

Igor Terekhov, FNAL

Grid data handling l.jpg
Grid Data Handling Choices

  • We define GDH as a middleware service which:

    • Brokers storage requests

    • Maintains economical knowledge about costs of access to different SE’s

    • Replicates data as needed (not only as driven by admins)

  • Generalizes or replaces some of the services of the Data Management part of SAM

Igor Terekhov, FNAL

Grid data handling initial thoughts l.jpg
Grid Data Handling, Initial Thoughts Choices

Igor Terekhov, FNAL

The necessary almost final slide l.jpg
The Necessary (Almost) Final Slide Choices

  • Run II experiments’ computing is highly distributed, Grid trend is very relevant

  • The JIM (Jobs and Information Management) part of the SAMGrid addresses the needs for global and grid computing at Run II

  • We use Condor and Globus middleware to schedule jobs globally (based on data), and provide Web-based monitoring

  • Demo available – see me or Gabriele

  • SAM, the data handling system, is evolved towards the Grid, with modern storage element access enabled

Igor Terekhov, FNAL

P s related talks l.jpg
P.S. – Related Talks Choices

  • F. Wuerthwein, CAF (Cluster Analysis Facility) – job management on a cluster and interface to JIM/Grid

  • F. Ratnikov, Monitoring on CAF and interface to JIM/Grid

  • S. Stonjek, SAMgrid deployment experiences

  • L. Lueking, G. Garzoglio – SAM-related

Igor Terekhov, FNAL

Backup slides l.jpg

Backup Slides Choices

Information management l.jpg
Information Management Choices

  • In JIM’s view, this includes both:

    • resource description for job brokering

    • Infrastructure for monitoring (core project area)

  • GT MDS is not sufficient:

    • Need (persistent) info representation that’s independent of LDIF or other such format

    • Need maximum flexibility in information structure – no fixed schema

    • Need configuration tools, push operation etc

Igor Terekhov, FNAL