1 / 21

Distributed Services for Grid Enabled Data Analysis

Distributed Services for Grid Enabled Data Analysis. Distributed Services for Grid Enabled Data Analysis. Scenario. Liz and John are members of CMS Liz is from Caltech and is an expert in event reconstruction John is from Florida and is an expert in statistical fits

emmet
Download Presentation

Distributed Services for Grid Enabled Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis

  2. Scenario • Liz and John are members of CMS • Liz is from Caltech and is an expert in event reconstruction • John is from Florida and is an expert in statistical fits • They wish to combine their expertise and collaborate on a CMS Data Analysis Project

  3. Prototype vertically integrated system Transparent/seamless experience Distribute grid services using a uniform web service Clarens ! Understand system latencies failure modes Investigate request scheduling in a resource limited and dynamic environment Emphasize functionality over scalability Investigate interactive vs. scheduled data analysis on a grid Hybrid example Understand where are the difficult issues Analysis Client IGUANA Analysis Client ROOT Analysis Client Web Browser Analysis Client PDA Grid-services Web Service: Clarens Grid Monitoring Service MonALISA Grid Resource Service VDT Server Grid Execution Service VDT Client Grid Scheduling Service Sphinx Virtual Data Service Chimera Workflow Generation Service ShahKar Collaborative Environment Service CAVE Remote Data Service Clarens Demo Goals

  4. y.ntpl y.root x.ntpl x.root request browse Data Discovery Virtual data products are pre-registered with the Chimera Virtual Data Service. Using Clarens, data products are discovered by Liz and John by remotely browsing the Chimera Virtual Data Service x.cards y.cards pythia pythia y.ntpl x.ntpl h2root h2root x.root y.root Chimera Virtual Data System

  5. Data Analysis Liz wants to analyse x.root using her analysis code a.C x.cards pythia // Analysis code: a.C #include <iostream.h> #include <math.h> #include "TFile.h" #include "TTree.h" #include "TBrowser.h" #include "TH1.h" #include "TH2.h" #include "TH3.h" #include "TRandom.h" #include "TCanvas.h" #include "TPolyLine3D.h" #include "TPolyMarker3D.h" #include "TString.h" void a( char treefile[], char newtreefile[] ) { Int_t Nhep; Int_t Nevhep; Int_t Isthep[3000]; Int_t Idhep[3000], Jmohep[3000][2], Jdahep[3000][2]; Float_t Phep[3000][5], Vhep[3000][4]; Int_t Irun, Ievt; Float_t Weight; Int_t Nparam; Float_t Param[200]; TFile *file = new TFile( treefile ); TTree *tree = (TTree*) file -> Get( "h10 tree -> SetBranchAddress( "Nhep", &Nh x.ntpl h2root x.root Chimera Virtual Data System

  6. Interactive Workflow Generation Liz browses the local directory for her analysis code and the Chimera Virtual Data Service for input LFNs… x.cards pythia x.ntpl Select input LFN h2root x.root Select CINT script Define output LFN Chimera Virtual Data System register browse

  7. Interactive Workflow Generation She selects and registers (to the Grid) her analysis code, the appropriate input LFN, and a newly defined ouput LFN x.cards pythia x.ntpl Select input LFN a.C b.C c.C d.C y.ntpl y.root x.ntpl x.root h2root x.root Select CINT script Define output LFN xa.root Chimera Virtual Data System register browse

  8. Interactive Workflow Generation A branch is automatically added in the Chimera Virtual Data Catalog, and a.C is uploaded into “gridspace” and registered with RLS x.cards pythia x.ntpl Select input LFN a.C b.C c.C d.C y.ntpl y.root x.ntpl x.root h2root a.C x.root a.C x.root root Select CINT script Define output LFN xa.root xa.root Chimera Virtual Data System register browse

  9. Interactive Workflow Generation x.cards Querying the Virtual Data Service, Liz sees that xa.root is now available to her as a new virtual data product pythia x.ntpl y.ntpl y.root x.ntpl x.root xa.root h2root x.root a.C root request browse xa.root Chimera Virtual Data System

  10. Request Submission x.cards She requests it…. pythia x.ntpl y.ntpl y.root x.ntpl x.root xa.root h2root x.root a.C xa.root root request browse xa.root Chimera Virtual Data System

  11. Brief Interlude: The Grid is Busy and Resources are Limited! • Busy: • Production is taking place • Other physicists are using the system • Use MonALISA to avoid congestion in the grid • Limited: • As grid computing becomes standard fare, oversubscription to resources will be common ! • CMS gives Liz a global high priority • Based upon local and global policies, and current Grid weather, a grid-scheduler: • must schedule her requests for optimal resource use

  12. Sphinx Scheduling Server • Nerve Centre • Global view of system • Data Warehouse • Information driven • Repository of current state of the grid • Control Process • Finite State Machine • Different modules modify jobs, graphs, workflows, etc and change their state • Flexible • Extensible Sphinx Server Message Interface Graph Reducer Control Process Job Predictor Graph Predictor Data Warehouse Job Admission Control • Policies • Accounting Info • Grid Weather • Resource Prop. • and status • Request Tracking • Workflows • etc Graph Admission Control Graph Data Planner Job Execution Planner Graph Tracker Data Management Information Gatherer

  13. File Service File Service File Service File Service VDT Resource Service VDT Resource Service VDT Resource Service VDT Resource Service Fermilab Caltech Florida Iowa Sphinx RLS MonALISA ROOT Chimera Sphinx/VDT Monitoring Service Execution Service Replica Location Service Virtual Data Service Scheduling Service Data Analysis Client Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis Clarens Clarens Globus Clarens GridFTP Clarens Globus Globus MonALISA

  14. Collaborative Analysis x.cards Meanwhile, John has been developing his statistical fits in b.C by analysing the data product x.root pythia x.ntpl h2root y.ntpl y.root x.ntpl x.root xa.root xb.root x.root a.C b.C root root xb.root xa.root xb.root request browse

  15. Collaborative Analysis x.cards After Liz has finished optimising the event reconstruction, John uses his analysis code b.C on her data product xa.root to produce the final statistical fits and results ! pythia x.ntpl h2root y.root x.ntpl x.root xa.root xb.root xab.root x.root a.C b.C root root xab.root xa.root xb.root request browse root xab.root

  16. Distributed Services Prototype in Data Analysis Remote Data Service Replica Location Service Virtual Data Service Scheduling Service Grid-Execution Service Monitoring Service Smart Replication Strategies for “Hot Data” Virtual Data w.r.t. Location Execution Priority Management on a Resource Limited Grid Policy Based Scheduling & QoS Virtual Data w.r.t. Existence Collaborative Environment Sharing of Datasets Use of Provenance Key Features

  17. Credits • California Institute of Technology • Julian Bunn, Iosif Legrand, Harvey Newman, Suresh Singh, Conrad Steenberg, Michael Thomas, Frank Van Lingen, Yang Xia • University of Florida • Paul Avery, Dimitri Bourilkov, Richard Cavanaugh, Laukik Chitnis, Jang-uk In, Mandar Kulkarni, Pradeep Padala, Craig Prescott, Sanjay Ranka • Fermi National Accelerator Laboratory • Anzar Afaq, Greg Graham

  18. DMC (Data Management Component) • Scheduling the data transfers to achieve optimal workflow execution • The problem: Combining data and Execution scheduling • Various kinds of data transfers • Smart replication • User initiated • Workflow based replication • Automatic replication • Hot data management

  19. Scheduler needs information to make decisions. The information needs to be as “current” as possible That brings monitoring into the picture Load Average Free Memory Disk Space Virtual Organization (VO) Quota System Different policies for resources Needs monitoring and accounting/tracking of resource quotas MonALISA Dynamic discovery of sites Configurable monitoring service and parameters View Generation using filters Displays SPHINX job information Future Directions As grid grows, the problem of latency becomes more potent Solution: Data Fusion/Aggregation Inline with the hierarchical views of grid (VO) and the hierarchical scheduler! Monitoring in SPHINX

  20. File Service File Service File Service File Service VDT Resource Service VDT Resource Service VDT Resource Service VDT Resource Service Fermilab Caltech Florida Iowa Sphinx RLS MonALISA ROOT Chimera Sphinx/VDT Monitoring Service Execution Service Replica Location Service Virtual Data Service Scheduling Service Data Analysis Client Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis Clarens Clarens Globus Clarens GridFTP Clarens Globus Globus MonALISA

More Related