1 / 17

SAM Job Submission

SAM Job Submission. Rod Walker, 10 th May, Gridpp, Manchester. What is SAM? sam submit …… Data Management Details. Conclusions. What is SAM?. SAM is S equential data A ccess via M eta-data Project started in 1997 to handle D0’s needs for Run II data system.

yuri
Download Presentation

SAM Job Submission

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAM Job Submission Rod Walker, 10th May, Gridpp, Manchester. • What is SAM? • sam submit …… • Data Management • Details. • Conclusions.

  2. What is SAM? • SAM is Sequential data Access via Meta-data • Project started in 1997 to handle D0’s needs for Run II data system. • Current SAM team includes: • Andrew Baranovski, Lauri Loebel-Carpenter, Gabriele Garzoglio, Chris Jozwiak, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders) • http://d0db.fnal.gov/sam

  3. SAM is a Distributed System Name Server Database Server(s) (Central Database) Global Resource Manager(s) Log server Shared Globally Station 1 Servers Station 3 Servers Local Station n Servers Station 2 Servers Mass Storage System(s) Arrows indicate Control and data flow Shared Locally

  4. Job Submission • Executable • Runtime environment • Executable&assoc. files (user specific). • Experiment environment. • Data • Dataset definition • Select by metadata. • Converted to LFN`s at submit time, ie.datasets change. • Build SQL query…then…execute query.

  5. Dataset

  6. Job Running & Job Control (Run this exe | on this data) 1. sam submit –defname=mydata –script=myexe 2.submit to SM Job Manager (Project Master) Local SM (Station Master) 3.invoke Client jobEnd 4.submit To BS 7.Started 5.Submission ok 9.setJobCount/stop Process Manager (SAM wrapper script) Batch System User Task 6.start job 8.invoke 10.resubmit

  7. Stager User exe User exe User exe User exe Job control Replica Catalogue PFN LFN Fetch PFN 2 1 Finished Wait 4 3 Release getNextFile() BS Here`s the path to a local file: /sam/cache1/boo/mydata1.dat Physics & wrapper

  8. Data Management • Replica Catalogue • Replication • Cache Management

  9. Replica Catalogue • Combined with Metadata in an Oracle database, although logically distinct • Query on metadata to create a dataset • list of LFN`s • Experiment specific (D0/CDF). • Query on LFN to locate physical file. • Generic replica catalogue. • node:/path/to/cache/myfile.dat

  10. Replica Catalogue 600,000 files increasing at 3000/day, 120TB. 150,000 in cache 5000 files per day replicated, 5000 destroyed. ½ million queries per day, (90% SELECT).

  11. Cache Managment • 13.6TB, in several 100 individually managed caches. • 1TB in and out/day (10k files) • Cache lifetime ~10 days • Various prescriptions for cache replacement, e.g. 1st in, 1st out, last use. 70% hit rate(~6000 files/day)

  12. Replication • Easy – use your favourite ftp. • BUT……what could go wrong. • Cache space – Cache Management. • network, dead node, corrupted file - retries. • dead disk, uncached – fail-over. • sluggish robot, slow delivery – hold job. • A stroll through my log file.

  13. 05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: delivery error (Category SAM Internal)  Severity level: ERROR  Generated on 07 May 16:01:51 by eworker  In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp)  WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp  Recommended action: Please contact sam-admin@fnal.gov05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery failed,scheduling retry in 3 seconds Retry

  14. 05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: delivery error (Category SAM Internal)  Severity level: ERROR  Generated on 07 May 16:02:35 by eworker  In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp)  WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp  Recommended action: Please contact sam-admin@fnal.gov05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Maximum numberof retrials exceeded. Will not retry again from this source!05/07/02 16:02:35 imperial-test.SM.Repler 11698: Will avoid locations:(cab:d0cs015.fnal.gov:/sam/cache/boo)05/07/02 16:02:35 imperial-test.SM.Repler 11698: No loc is preferred,selectingenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all(prl733.24) Give up on this source. Avoid this location. Get another location from RC, and retry.

  15. 05/07/02 16:10:53 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: OK (Category Enstore)  Severity level: SUCCESS  Generated on 07 May 16:10:53 by eworker  In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=1369320147LABEL=PRL859LOCATION=0000_000000000_0000067DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=160.38SEEK_TIME=73.47MOUNT_TIME=25.36QWAIT_TIME=65.79TIME2NOW=329.78STATUS=ok STDERR: Completed transferring 1369320147 bytes in 1 files in329.720216036 sec.        Overall rate = 3.96 MB/sec.  Drive rate = 8.14 MB/sec.        Network rate = 8.13 MB/sec.  Exit status Got it

  16. 05/07/02 15:46:09 imperial-test.SM.PBS BS Adapter 11698: Rememberingthat job 1760.gw39.hep.ph.ic.ac.uk for project 61983_sam_ is held --------------------------05/07/02 16:00:56 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: OK (Category Enstore)  Severity level: SUCCESS  Generated on 07 May 16:00:56 by eworker  In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=788805399LABEL=PRL829LOCATION=0000_000000000_0000025DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=90.08SEEK_TIME=45.05MOUNT_TIME=27.14QWAIT_TIME=225.50TIME2NOW=392.28STATUS=ok STDERR: Completed transferring 788805399 bytes in 1 files in392.221878052 sec.        Overall rate = 1.92 MB/sec.  Drive rate = 8.35 MB/sec.        Network rate = 8.35 MB/sec.  Exit status = 0., method name: samcp  Recommended action: Please contact sam-admin@fnal.gov---------------------------05/07/02 105/07/02 16:00:57 imperial-test.SM.PBS BS Adapter 11698: Willexecute: qrls 1760.gw39.hep.ph.ic.ac.uk Hold in queue until 1st file delivered. File arrives Release

  17. Conclusions • Executable is stupid - no knowledge of data transfer. Job manager does the clever stuff. • SAM has a fully featured, tried and tested data management system. • No GSI, GridFTP, or CondorG as yet, …but you need more than G`s to make a grid!

More Related