When and how to use large scale computing chtc and htcondor
This presentation is the property of its rightful owner.
Sponsored Links
1 / 51

When and How to Use Large-Scale Computing: CHTC and HTCondor PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on
  • Presentation posted in: General

When and How to Use Large-Scale Computing: CHTC and HTCondor. Lauren Michael, Research Computing Facilitator Center for High Throughput Computing STAT 692, November 15, 2013. Topics We’ll Cover Today. Why to Access L arge-Scale C omputing resources

Download Presentation

When and How to Use Large-Scale Computing: CHTC and HTCondor

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


When and how to use large scale computing chtc and htcondor

When and How to Use Large-Scale Computing: CHTC and HTCondor

Lauren Michael, Research Computing Facilitator

Center for High Throughput Computing

STAT 692, November 15, 2013


Topics we ll cover today

Topics We’ll Cover Today

  • Why to Access Large-Scale Computing resources

  • CHTC Services and Campus-Shared Computing

  • What is High-Throughput Computing (HTC)?

  • What is HTCondor and How Do You Use It?

  • Maximizing Computational Throughput

  • How to Run R on Campus-Shared Resources


When should you use outside computing resources

When should you use outside computing resources?

  • your computing work won’t run at all on your computer(s) (lack sufficient RAM, disk, etc.)

  • your computing work will take too long on your own computer(s)

  • you would like to off-load certain processes in favor of running others on your computer(s)


Chtc services

CHTC Services

Center for High Throughput Computing, est. 2006

  • Large-scale, campus-shared computing systems

    • high-throughput computing (HTC) grid and high-performance computing (HPC) cluster

    • all standard services provided free-of-charge

    • automatic access to the national Open Science Grid (OSG)

    • hardware buy-in options for priority access

    • information about other computing resources

  • Support for using our systems

    • consultation services, training, and proposal assistance

    • solutions for numerous software (including Python, Matlab, R)


Htcondor chtc s r d arm

HTCondor: CHTC’s R&D Arm

  • R&D for HTCondor and other HTC software

  • Services provided to the campus community

    • HTC Software

      • HTCondor: manage your compute cluster

      • DAGMan: manage computing workflows

      • Bosco: submit locally, run globally

    • Software Engineering Expertise & Consulting

      • CHTC-operated Build-and-Test Lab (BaTLab)

    • Software Security Consulting

      Your Problems become Our Research!


When and how to use large scale computing chtc and htcondor

http://chtc.cs.wisc.edu

Researchers who use the CHTC are located all over campus (red buildings)


Chtc staff

CHTC Staff

Director, [email protected]

(also OSG Technical Director and WIDs CTO)

Campus Support:[email protected]

2+ Research Computing Facilitators

  • Lauren Michael (lead) [email protected]

    3 Systems Administrators

    +4-8 Part-time Students

    HTCondorDevelopment Team

    OSG Software Team


Htc versus hpc

HTC versus HPC

  • high-throughput computing (HTC)

    • many independent processes that can run on 1 or few processors (“cores” or “threads”) on the same computer

    • mostly standard programming methods

    • best accelerated by: access to as many cores as possible

  • high-performance computing (HPC)

    • sharing the workload of interdependent processes over multiple cores to reduce overall compute time

    • OpenMP and MPI programming methods, or multi-thread

    • requires: access to many servers of cores within the same tightly-networked cluster; access to shared files


Parallel is confusing

“parallel” is confusing

  • essentially means: spread computing work out over multiple processors

  • Use of the words “parallel” and “parallelize” can apply to HTCorHPC when referring to programs

  • It’s important to be clear!


Topics we ll cover today1

Topics We’ll Cover Today

  • Why to Access Large-Scale Computing resources

  • CHTC Services and Campus-Shared Computing

  • What is High-Throughput Computing (HTC)?

  • What is HTCondor and How Do You Use It?

  • Maximizing Computational Throughput

  • How to Run R on Campus-Shared Resources


What is htcondor

What is HTCondor?

  • match-maker of computing work and computers

  • “job scheduler”

    • matches are made based upon necessary RAM, CPUs, disk space, etc., as requested by the user

    • jobs re-run if interrupted

  • works beyond “clusters” to coordinate distributed computers for maximum throughput

  • coordinates data transfers between users and distributed computers

  • can coordinate servers, desktops, and laptops


How htcondor works

How HTCondor Works

Central Manager

(of the pool)

Job ClassAd

Machine ClassAd

Submit Node(s)

(where jobs are submitted)

input

input

Execute Node(s)

(where jobs run)

output

Queue

job1.1user1

job1.2user1

job2.1user2


Submit nodes available to you

Submit nodes available to YOU


Basic htcondor submission

Basic HTCondor Submission

  • Prepare programs and files

  • Write submit file(s)

  • Submit jobs to the queue

  • Monitor the jobs

  • (Remove bad jobs)


Preparing programs and files

Preparing Programs and Files

  • Make programs portable

    • compile code to a simple binary

    • statically-link code dependencies

    • consider CHTC’s tools for packaging Matlab, Python, and R

  • Consider using a shell script (or other “wrapper”) to run multiple commands for you

    • create a local install of software

    • set environment variables

    • then, run your code

  • Stage all files on a submit node


Htc components

HTC Components

1. Cut up computing work into many independent pieces

(CHTC can consult)

2. Make programs portable, minimize dependencies

(CHTC can consult, or may have prepared solutions)

3. Learn how to submit jobs

(CHTC can help you a lot!)

4. Maximize your overall throughput on available computational resources

(CHTC can help you a lot!)


Basic htcondor submit file

Basic HTCondor Submit File

basicjobs arevanilla universe

# This is a comment

universe = vanilla

output = process.out

error = process.err

log = process.log

executable = cosmos

arguments = cosmos.in 4

should_transfer_files = YES

transfer_input_files = cosmos.in

when_to_transfer_output = ON_EXIT

request_memory = 100

request_disk = 100000

request_cpus = 1

queue

outputand errorare where system output and error will go

logis where HTCondor stores info about how your job ran

executable is your single program or a shell script

The program will be run as:

./cosmos cosmos.in 4

memory in MB and disk in KB

queuewith no number after it will submit only one job


Basic htcondor submit file1

Basic HTCondor Submit File

# This is a comment

universe = vanilla

output = process.out

error = process.err

log = process.log

executable = cosmos

arguments = cosmos.in 4

should_transfer_files = YES

transfer_input_files = cosmos.in

when_to_transfer_output = ON_EXIT

request_memory = 100

request_disk = 100000

request_cpus = 1

queue

Initial File Organization

In foldertest/

cosmos

cosmos.in

submit.txt


Htcondor multi job submit file

HTCondor Multi-Job Submit File

# This is a comment

universe = vanilla

output = $(Process).out

error = $(Process).err

log = $(Cluster).log

executable = cosmos

arguments = cosmos_$(Process).in

should_transfer_files = YES

transfer_input_files = cosmos_$(Process).in

when_to_transfer_output = ON_EXIT

request_memory = 100

request_disk = 100000

request_cpus = 1

queue 3

test/

cosmos

cosmos_0.in

cosmos_1.in

cosmos_2.in

submit.txt


Htcondor multi folder submit file

HTCondor Multi-Folder Submit File

# This is a comment

universe = vanilla

InitialDir =$(Process)

output = $(Process).out

error = $(Process).err

log = /home/user/test/$(Cluster).log

executable = /home/user/test/cosmos

arguments = cosmos.in

should_transfer_files = YES

transfer_input_files = cosmos.in

when_to_transfer_output = ON_EXIT

request_memory = 100

request_disk = 100000

request_cpus = 1

queue 3

test/

cosmos

cosmos.in

submit.txt

0/

cosmos.in

1/

cosmos.in

2/

cosmos.in


Submitting jobs

Submitting Jobs

[[email protected] test]$ condor_submitsubmit.txt

Submitting job(s)...

3job(s) submitted to cluster 29747.

[[email protected] test]$


Checking the queue

Checking the Queue

[[email protected] test]$ condor_qlmichael

  • -- Submitter: simon.stat.wisc.edu: <144.92.142.159:9620?sock=3678_5c57_3> : simon.stat.wisc.edu

  • ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

  • 29747.0 lmichael 2/15 09:06 0+00:01:34 R 0 9.8 cosmos cosmos.in

  • 29747.1 lmichael 2/15 09:06 0+00:00:00 I 0 9.8 cosmos cosmos.in

  • 29747.2 lmichael 2/15 09:06 0+00:00:00 I 0 9.8 cosmos cosmos.in

  • 3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended

  • [[email protected]]$

  • View all user jobs in the queue: condor_q


Log files

Log Files

000 (29747.001.000) 02/15 09:29:17 Job submitted from host: <144.92.142.159:9620?sock=3678_5c57_3>

...

001 (29747.001.000) 02/15 09:33:59 Job executing on host: <144.92.142.153:9618?sock=17172_f1f3_3>

...

005 (29747.001.000) 02/15 09:39:01 Job terminated.

(1) Normal termination (return value 0)

Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

0 - Run Bytes Sent By Job

0 - Run Bytes Received By Job

0 - Total Bytes Sent By Job

0 - Total Bytes Received By Job

Partitionable Resources : Usage Request Allocated

Cpus : 1 1

Disk (KB) : 225624 100000 645674

Memory (MB) : 85 1000 1024


Removing jobs

Removing Jobs

  • Remove a single job: condor_rm 29747.0

  • Remove all jobs of a cluster: condor_rm 29747

  • Remove all of your jobs: condor_rmlmichael


Topics we ll cover today2

Topics We’ll Cover Today

  • Why to Access Large-Scale Computing resources

  • CHTC Services and Campus-Shared Computing

  • What is High-Throughput Computing (HTC)?

  • What is HTCondor and How Do You Use It?

  • Maximizing Computational Throughput

  • How to Run R on Campus-Shared Resources


Maximizing throughput

Maximizing Throughput

  • The Philosophy of HTC

  • The Art of HTC

  • Other Best-Practices


The philosophy of htc

The Philosophy of HTC

  • break up your work into many ‘smaller’ jobs

    • single CPU, short run times, small input/output data

  • run on as many processors as possible

    • single CPU and low RAM needs

    • take everything with you; make programs portable

    • use the “right” submit node for the right “resources”

  • automate as much as you can

  • (share your processors with others to increase everyone’s throughput)


Success stories

Success Stories

  • Edgar Spalding: studies effect of gene on plant growth outcomes

  • GeoDeepDive Project: extracts and comprises “dark data” from PDFs of publications in Geosciences

    We want HTC to revolutionize your research!


The art of htc

The Art of HTC

carrying out the philosophy, well

  • Tuning job requests for memory and disk

  • Matching run times to the maximum number of available processors

  • Automation


Tuning job resource requests

Tuning Job Resource Requests

Problem: Don’t know what your job needs?

  • If you don’t ask for enough memory and disk:

    • Your jobs will be kicked off for going over, and will have to be retried (though, HTCondor will automatically request more for you)

  • If you ask for too much:

    • Your jobs won’t match to as many available “slots” as they could


Tuning job resource requests1

Tuning Job Resource Requests

Solution: Testing is Key!!!

  • Run just a few jobs at first to determine memory and disk needs from log files

    • If your first request is not enough, HTCondor will retry the jobs and request more until they finish.

    • It’s okay to request a lot (1 GB each) for a few tests.

  • Change the “request” lines to a better value

  • Submit a large batch


Time matching submit file additions

Time-Matching (submit file additions)


Time tuning batching

Time-Tuning: Batching

  • Problem: Jobs less than 5 minutes are bad for overall throughput

    • more time spent on matching and data transfers than on your job’s processes

    • Ideal time is between 5 minutes and 2 hours (OSG)

  • Solution: Use a shell script (or other method) to run multiple processes within a single job

    • avoids transfer of intermediate files between sequential, related processes

    • debugging can be a bit trickier


Time tuning checkpointing

Time-Tuning: Checkpointing

  • The best way to run longer jobs without losing progress to eviction.

    Two Ways:

  • Compile your code with condor_compile and use the “standard” universe within HTCondor

  • Implement self-checkpointing

    *Consult HTCondor’s online manual or contact the CHTC for help


Automate tasks

Automate Tasks

  • Use $(Process)

  • Shell scripts to run multiple tasks within the same job

    • including environment preparation

  • Hardcode arguments, calculate them (random number generation), or use parameter files/tables

  • Use HTCondor’sDAGMan feature

    • “directed acyclic graph”

    • create complex workflows of dependent jobs, and submit them all at once

    • additional helpful features: success checks and more


Non throughput considerations

Non-Throughput Considerations

Remember that you are sharing with others

  • “Be Kind to Your Submit Node”

    • avoid transfers of large files through the submit node (large: >10GB per batch; ~10 MB/job x 1000+ jobs)

      • transfer files from another server as part of your job (wget and curl)

      • compress where appropriate; delete unnecessary files

      • remember: “new” files are copied back to submit nodes

    • avoid running multiple CPU-intensive executables

  • Test all new batches, and scale up gradually

    • 3 jobs, then 100s, then 1000s, then


Topics we ll cover today3

Topics We’ll Cover Today

  • Why to Access Large-Scale Computing resources

  • CHTC Services and Campus-Shared Computing

  • What is High-Throughput Computing (HTC)?

  • What is HTCondor and How Do You Use It?

  • Maximizing Computational Throughput

  • How to Run R on Campus-Shared Resources


Running r on htc resources the b est w ay

Running R on HTC Resources:The BestWay

  • Problem: R programs don’t easily compile to a binary

  • Solution: Take R with your job!

    CHTC has tools just for R (and Python, and Matlab)

  • Installed on CS/Stat submit nodes, simon, and CHTC submit nodes


1 build r code with chtc buildrlibs

1. Build R Code with chtc_buildRlibs

  • Copy your R code and any R library tar.gz files to the submit node

  • Run the following command:

    chtc_buildRlibs --rversion=sl5-R-2.10.1 \

    --<library1>.tar.gz,<library2>.tar.gz

  • R versions supported: 2.10.1, 2.13.1, 2.15.1

    (use the closest version below yours)

  • Get back sl5-RLIBS.tar.gz and sl6-RLIBS.tar.gz

    (you’ll use these in the next step)


2 download the chtcrun package

2. Download the “ChtcRun” Package

  • download ChtcRun.tar.gz, according to the guide (wget)

  • un-tar it: tar xzfChtcRun.tar.gz

  • View ChtcRuncontents:

    process.template(submit file template)

    mkdag(script that will ‘create’ jobs based

    upon your staged data)

    Rin/(example data staging folder)


3 prepare data and process template

3. Prepare data and process.template

  • Stage data as such:

    ChtcRun/

    data/

    1/ input.in <specific_files>

    2/ input.in<specific_files>

    job3/ input.in<specific_files>

    test4/ input.in<specific_files>

    shared/ <RLIBS.tar.gz> <program>.R <shared_files>

  • Modify process.template with respect to:

    • request_memory and request_disk, if you know

    • +WantFlocking = true OR +WantGlidein = true


4 run mkdag and submit jobs

4. Run mkdag and submit jobs

  • In ChtcRun, execute the mkdag script

    • (Examples at the top of “./mkdag --help”)

      ./mkdag --data=Rin–outputdir=Rout \

      --cmdtorun=soartest.R--type=R \

      --version=R-2.10.1 --pattern=meanx

    • “pattern” indicates a portion of a filename that you expect to be created by successful completion of any single job

  • A successful mkdag run will instruct you to navigate to the ‘outputdir’, and submit the jobs as a single DAG:

    condor_submit_dagmydag.dag


5 monitor j ob c ompletion

5. Monitor Job Completion

  • Check jobs in the queue as they’re gradually added and completed (condor_q)

  • Check other files in your ‘outputdir’:

    Rout/ mydag.dag.dagman.out(updated table of job stats)

    1/ process.logprocess.out,err ChtcWrapper1.out

    2/ process.logprocess.out,err ChtcWrapper2.out

    …/

    After testing a small number of jobs, submit many!

    (up to many 10,000s; # submitted is throttled for you)


What next

What Next?

  • Use a Stat server to submit shorter jobs to the CS pool.

  • Obtain access to simon.stat.wisc.edu from Mike Camilleri ([email protected]), and submit longer jobs to the CHTC Pool.

  • Meet with the CHTC to submit jobs to the entire UW Grid and to the national Open Science Grid.

    • chtc.cs.wisc.edu, click “Get Started”

      User support for HTCondorusers at UW:

      [email protected]


What next1

What Next?

  • Use a Stat server to submit shorter jobs to the CS pool.

  • Obtain access to simon.stat.wisc.edu from Mike Camilleri ([email protected]), and submit longer jobs to the CHTC Pool.

  • Meet with the CHTC to submit jobs to the entire UW Grid and to the national Open Science Grid.

    • chtc.cs.wisc.edu, click “Get Started”

      User support for HTCondorusers at UW:

      [email protected]


  • Login