1 / 61

Virtual Data Tools Status Update

Virtual Data Tools Status Update. ATLAS Grid Software Meeting BNL, 6 May 2002 Mike Wilde Argonne National Laboratory An update on work by Jens Voeckler, Yong Zhao, Gaurang Mehta, and many others. The Virtual Data Model. Data suppliers publish data to the Grid

vonda
Download Presentation

Virtual Data Tools Status Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtual Data ToolsStatus Update ATLAS Grid Software MeetingBNL, 6 May 2002 Mike Wilde Argonne National Laboratory An update on work by Jens Voeckler, Yong Zhao, Gaurang Mehta, and many others.

  2. The Virtual Data Model • Data suppliers publish data to the Grid • Users request raw or derived data from Grid, without needing to know • Where data is located • Whether data is stored or computed on demand • User and applications can easily determine • What it will cost to obtain data • Quality of derived data • Virtual Data Grid serves requests efficiently, subject to global and local policy constraints

  3. begin v /usr/local/demo/scripts/cmkin_input.csh file i ntpl_file_path file i template_file file i num_events stdout cmkin_param_fileendbegin v /usr/local/demo/binaries/kine_make_ntpl_pyt_cms121.exe pre cms_env_var stdin cmkin_param_file stdout cmkin_log file o ntpl_fileendbegin v /usr/local/demo/scripts/cmsim_input.csh file i ntpl_file file i fz_file_path file i hbook_file_path file i num_trigs stdout cmsim_param_fileendbegin v /usr/local/demo/binaries/cms121.exe condor copy_to_spool=false condor getenv=true stdin cmsim_param_file stdout cmsim_log file o fz_file file o hbook_fileendbegin v /usr/local/demo/binaries/writeHits.sh condor getenv=true pre orca_hits file i fz_file file i detinput file i condor_writeHits_log file i oo_fd_boot file i datasetname stdout writeHits_log file o hits_dbendbegin v /usr/local/demo/binaries/writeDigis.sh pre orca_digis file i hits_db file i oo_fd_boot file i carf_input_dataset_name file i carf_output_dataset_name file i carf_input_owner file i carf_output_owner file i condor_writeDigis_log stdout writeDigis_log file o digis_dbend CMS Pipeline in VDL pythia_input pythia.exe cmsim_input cmsim.exe writeHits writeDigis

  4. Job Execution Site U of Chicago Globus GRAM GSI JobSumissionSitesANL, SC,… GridFTPClient Condor-G Agent Job Execution Site U of Florida Globus Client Globus GRAM GSI Local File Storage GridFTPClient GridFTPServer GSI Job Execution Site U of Wisconsin Globus GRAM GridFTPClient Simulate CMS Detector Response Simulate Physics Copy flat-file to OODBMS Simulate Digitization of Electronic Signals CondorPool CondorPool CondorPool Virtual Data for Real Science:A Prototype Virtual Data Catalog Architecture of the System: Virtual Data Catalog (PostgreSQL) Virtual Data Language VDL Interpreter (VDLI) Grid testbed Production DAG of Simulated CMS Data:

  5. tsObj tsObj catalog cluster core brg field tsObj tsObj field brg brg field core brg field 5 4 3 2 1 1 3 2 1 2 1 2 Cluster-finding Data Pipeline

  6. Virtual Data Tools • Virtual Data API • A Java class hierarchy to represent transformations and derivations • Virtual Data Language • Textual for illustrative examples • XML for machine-to-machine interfaces • Virtual Data Database • Makes the objects of a virtual data definition persistent • Virtual Data Service • Provides an OGSA interface to persistent objects

  7. Languages • VDLt – textual version • mainy for documentation for now • May eventually implement a ytranslator • Can dump data structures in this representation • VDLx – XML version – app-to-VDC interchange • Useful for bulk data entry – catalog import-export • aDAGx – XML version of abstract DAG • cDAG – actual DAGman DAG

  8. Components and Interfaces • Java API • Manage Catalog objects (tr,dv, args…) • Create / Locate / Update / Delete • Same API at client and within server • Can embed Java classes in an App for now • Virtual Data Catalog Server • Web (eventually OGSA) • SOAP interface mirrors Java API operations • XML processor • Database – managed by VDCS

  9. System Architecture Client App Virtual Data Catalog Service Client API Virtual Data Catalog Objects Virtual Data Catalog Database

  10. Initial Release Architecture Client App Client API Virtual Data Catalog Objects Virtual Data Catalog Database

  11. Applicaton interfaces • Invoke Java client API (to make OGSA calls) • Invoke Java server API (for now, embed VDC processing directly in App • Make OGSA calls directly • Formulate XML (VDLx) to load the catalog or request derivations

  12. Example VDL-Text TR t1( output a2, input a1, none env="100000", none pa="500" ) {  app = "/usr/bin/app3"; argument parg = "-p "${none:pa}; argument farg = "-f "${input:a1}; argument xarg = "-x -y "; argument stdout = ${output:a2}; profile env.MAXMEM = ${none:env}; }

  13. Example Derivation DV t1 ( a2=@{output:run1.exp15.T1932.summary}, a1=@{input:run1.exp15.T1932.raw}, env="20000", pa="600“ );

  14. Derivations with dependencies TR trans1( output a2, input a1 ){ app = "/usr/bin/app1";  argument stdin = ${input:a1};  argument stdout = ${output:a2};} TR trans2( output a2, input a1 ){ app = "/usr/bin/app2"; argument stdin = ${input:a1}; argument stdout = ${output:a2};} DV trans1( a2=@{output:file2}, a1=@{input:file1}); DV trans2( a2=@{output:file3}, a1=@{output:file2});

  15. Expressing Dependencies

  16. Define the transformations TR generate( output a ){  app = "generator.exe"; argument stdout = ${output:a2}; TR findrange( output b, input a, none p="0.0" ){   app = "ranger.exe";  argument arg = "-i "${:p};  argument stdin = ${output:a};  argument stdout = ${output:b};} TR default.analyze( input a[], output c ){  pfnHint vanilla = "analyze.exe";  argument files = ${:a};  argument stdout = ${output:a2};}

  17. Derivations forming a DAG DV generate( a=@{output:f.a} ); DV findrange( b=@{output:f.b}, a=@{input:f.a}, p="0.5" ); DV findrange( b=@{output:f.c}, a=@{input:f.a}, p="1.0" ); DV analyze( a=[ @{input:f.b}, @{input:f.c} ], c=@{output:f.d} );

  18. Virtual Data Class Diagram Diagram by Jens Voeckler

  19. Virtual Data Catalog Structure

  20. Virtual Data Language - XML

  21. VDL Searches • Locate the derivations that can produce a specific lfn • General queries for catalog maintenance • Locate transforms that can produce a specific file type (what does a type mean in this context?)

  22. Virtual Data Issues • Param file support • Param structures • Sequences • Virtual datasets

  23. Execution Environment Profile • Condor / DAGman / GRAM / WP1 • Concept of a EE driver • Allows plug-in of DAG generating code for: DAGman, Condor, GRAM, WP1 JM/RB • Execution Profile: Global, User/Group, Transformation, Derviation , Invocation

  24. First Release – June 2002 • Java Catalog Classes • XML import – export • Textual VDL formatting • DAX – (abstract) DAG in XML • Simple planner for constrained Grid • Will generate Condor DAGs

  25. Next Releases - Features • RLS Integration • Compound Transformations • Database persistency • OGSA Service • Other needed clients: C, TCL, ? • Expanded execution profiles / planners • Support for WP1 scheduler / broker • Support for generic RSL-based schedulers

  26. Longer-term Feature Preview • Instance tracking • Virtual files and virtual transformations • Multi-modal data • Structured namespaces • Grid-wide distributed catalog service • Metadata database integration • Knowledge-base integration

  27. SDSS Extension:Dynamic Dependencies • Data is organized into spacial cells • Scope of search is not known until run time • In this case – nearest 9 or 25 cells to a centroid • Need a dynamic algorithmic spec for what the range of cells to process is – a nested loop that generates the actual file names to examine. • In complex cases, might be a sequence of such centroid-based sequences.

  28. LIGO Example • Consider 3 (fictitious) channels: c, p, t • Operations are extract and concatenate • ex –i a –s t0 –e tb >ta • ex –i e –s te –e t1 >te • cat ta b c d te | filter • exch p <a –s t0 –e t1 • filter –v p,t • Examine whether derived metadata handles this concept

  29. Distributed Virtual Data Service • Will parallel the service architectureof the RLS • …but probably can’t use soft-state approach – needs consistency; can accept latency • Need a global name space for collaboration-wide information and knowledge sharing • May use distributed database technology below the covers • Will leverage a distributed, structured namespce • Preliminary – not yet designed

  30. VDC VDC VDC VDC Distributed Virtual Data Service Tier 1 centers Distributed virtual data service Regional Centers Local sites apps

  31. End of presentation

  32. Supplementary Material

  33. Knowledge Management Architecture • Knowledge based requests are formulated in terms of science data • Eg, Give me this transform of channels c,p,&t over time range t0-t1 • Finder finds the data files • Translates range “t0-t1” into a set of files • Coder creates an execution plan and defines derivations from known transformations • Can deal with missing files (e.g, file c in LIGO example) • K-B request is formulated in terms of virtual datasets • Coder translates into logical files • Planner trans;ates into physical files

  34. User View of the Virtual Data Grid 2) Launch secondary job on WI pool; input files via Globus GASS Master Condor job running at Caltech Secondary Condor job on WI pool 5) Secondary reports complete to master Caltech workstation 6) Master starts reconstruction jobs via Globus jobmanager on cluster 3) 100 Monte Carlo jobs on Wisconsin Condor pool 9) Reconstruction job reports complete to master 4) 100 data files transferred via GridFTP, ~ 1 GB each 7) GridFTP fetches data from UniTree NCSA Linux cluster NCSA UniTree - GridFTP-enabled FTP server 8) Processed objectivity database stored to UniTree Scott Koranda, Miron Livny, others

  35. Production Pipeline GriphyN-CMS Demo pythia cmsim writeHits writeDigis CPU: 2 min 8 hours 5 min 45 min 1 run 1 run 1 run . . . . . . . . . . . . . . . . . . 1 run Data: 0.5 MB 175 MB 275 MB 105 MB truth.ntpl hits.fz hits.DB digis.DB 1 run = 500 events SC2001 Demo Version: 1 event

  36. file1 file1 File3,4,5 GriPhyN: Virtual DataTracking Complex Dependencies psearch –t 10 … file1 file8 • Dependency graph is: • Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 • Programs: 8 < psearch, 7 < summarize,(3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  37. file1 file1 File3,4,5 Re-creating Virtual Data psearch –t 10 … file1 file8 • To recreate file 8: Step 1 • simulate > file1, file2 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  38. file1 file1 File3,4,5 Re-creating Virtual Data psearch –t 10 … file1 file8 • To re-create file8: Step 2 • files 3, 4, 5, 6 derived from file 2 • reformat > file3, file4, file5 • conv > file 6 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  39. file1 file1 File3,4,5 Re-creating Virtual Data psearch –t 10 … file1 file8 • To re-create file 8: step 3 • File 7 depends on file 6 • Summarize > file 7 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  40. file1 file1 file1 File3,4,5 file2 reformat –f fz … conv –I esd –o aod file6 Re-creating Virtual Data psearch –t 10 … file8 • To re-create file 8: final step • File 8 depends on files 1, 3, 4, 5, 7 • psearch < file1, file3, file4, file5, file 7 > file 8 simulate –t 10 … Requestedfile file7 summarize –t 10 …

  41. SDSS Galaxy Cluster Finding

  42. Cluster-finding Grid Work of: Yong Zhao, James Annis, & others

  43. Cluster-finding pipeline execution

  44. Virtual Data in CMS Virtual Data Long Term Vision of CMS: CMS Note 2001/047, GRIPHYN 2001-16

  45. CMS Data Analysis Dominant use of Virtual Data in the Future Event 1 Event 2 Event 3 Tag 2 100b 100b 200b 200b Reconstructed data (produced by physics analysis jobs) Tag 1 Jet finder 2 7K 7K 5K 5K Jet finder 1 Reconstruction Algorithm 100K 100K Calibration data 100K 300K 100K 50K 200K 100K 300K 100K 50K 200K Raw data (simulated or real) 100K 100K 100K 100K 50K 50K Uploaded data Virtual data Algorithms

  46. Topics – Planner • Does the planner have a queue? What does presence and absence of queue imply? • How is responsibility between planner and the executor (cluster scheduler) partitioned? • How does planner estimate times if it only has partial responsibility for when/where things run? • How does a cluster sched assign CPUs – dedicated or shared? • See Mirons email on NeST for more Qs • Use of a Execution profiler in the planner arch? • Characterize the resource requirements of an app over time • Parameterize the res reqs of an app w.r.t its (salient) parameters

  47. Planner Context • Map of grid resources • Status of grid resources • State (up/down) • Load • Dedication (commitment of resource to VO or group based on policy) • Policy • Request Queue (w/ lookahead, or process sequentially?)

  48. CAS and SAS • Site Authorization Service • How does a physical site control the policy by which its resources get used? • How does a SAS and a CAS interact? • Can a resource inerpret restructed proxies from multiple CAS’s? (Yes, but not from arbitrary CASes) • Consider MPI and MPICH-G jobs – how would the latter be handled? • Consider: if P2 schedules a whole DAG up front – causes schedule to use outdated information

  49. Planner Architecture

  50. Policy • Focuses on Security and Configuration (controlled resource sharing/allocation) • Allocation example: • “cms should get 90% of the resources at Caltech” • Issues of fair share scheduling • How to factor in time quanta:CPU-hours; GB-Days • Relationship to accounting

More Related