1 / 15

Overview of Wisconsin Campus Grid

Overview of Wisconsin Campus Grid. Dan Bradley dan@hep.wisc.edu Center for High-Throughput Computing. Technology. HTCondor. e xecutable = a.out RequestMemory = 1000 o utput = stdout e rror = stderr q ueue 1000. firewall. Open one port and use shared_port on submit machine.

golda
Download Presentation

Overview of Wisconsin Campus Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of Wisconsin Campus Grid Dan Bradley dan@hep.wisc.edu Center for High-Throughput Computing

  2. Technology Dan Bradley

  3. HTCondor executable = a.out RequestMemory = 1000 output = stdout error = stderr queue 1000 firewall Open one port and use shared_port on submit machine. submit machine submit machine • pools: 5 • submit nodes: 50 • user groups: 106 • execute nodes: 1,600 • cores: 10,000 flocking CCB Condor Pool Condor Pool If execute nodes are behind NAT but have outgoing net, use CCB. NAT Dan Bradley

  4. Accessing Files • No campus-wide shared FS • HTCondor file transfer for most cases: • Send software + input files to job • Grind, grind, … • Send output files back to submit node • Some other cases: • AFS: works on most of campus, but not across OSG • httpd + SQUID(s): when xfer from submit node doesn’t scale • CVMFS: read-only http FS (see talk tomorrow) • HDFS: big datasets on lots of disks • Xrootd: good for access from anywhere • Used on top of HDFS and local FS Dan Bradley

  5. Managing Workflows • A simple submit file works for many users • We provide an example job wrapper script to help download and set up common software packages: MATLAB, python, R • DAGMan is used by many others • Common pattern: • User drops files into a directory structure • Script generates DAG from that • Rinse, lather, repeat • Some application portals are also used • e.g. NEOS Online Optimization Service Dan Bradley

  6. Overflowing to OSG • glideinWMS • We run a glideinWMS “frontend” • Uses OSG glidein factories • Appears to users as just another pool to flock to • But jobs must opt-in: +WantGlidein = True • We customize glideins to make them look more like other nodes on campus: • publish OS version, glibc version, CVMFS availability million hours used Dan Bradley

  7. A Clinical Health Application • Tyler Churchill: modeling cochlear implants to improve signal processing. • Used OSG + campus resources to run simulations that include important acoustic temporal fine structure, which is typically ignored due to difficulty. “We can't do much about sound resolution given hardware limitations, but we can improve the integrated software. OSG and distributed high-throughput computing are helping us rapidly produce results that directly benefit CI wearers.” Dan Bradley

  8. Engaging Users Dan Bradley

  9. Engaging Users • Meet with individuals (PI + techs) • Diagram workflow • How much input, output, memory, time? • Suitable for exporting to OSG? • Where will the output go? • What software is needed? Licenses? • Tech support as needed • Periodic reviews Dan Bradley

  10. Training Users • Workshops on campus • New users can learn about HTCondor, OSG, etc. • Existing groups can send new students • Show examples of what others have done • Classes • Scripting for scientific users: python, perl, submitting batch jobs, DAGMan Dan Bradley

  11. User Resources • Many bring only their (big) brains • Use central or local department submit nodes • Use only modest scratch space • Some have their own submit node • Can attach their own storage • Control user access • Install system software packages Dan Bradley

  12. Submitting Big • Kick started work with big run in EC2, now continuing on campus. • Building a database to quickly classify stem cells and identify important genes active in cell states useful for clinical applications. • Victor Ruotti, winner of Cycle Computing’s Big Science Challenge Dan Bradley

  13. Users with Clusters • Three flavors: • condominium • User provides cash, we do the rest • neighborhood association • User provides space, power, cooling, machines • Configuration is standardized • sister cities • Independent pools that people want to share • e.g. student computer labs Dan Bradley

  14. Laboratory for Molecular and Computational Genomics • Cluster integrated into campus grid • Combined resources can map data representing the equivalent of one human genome in 90 minutes. • Tackling challenging cases such as the important maize genome, which is difficult for traditional sequence assembly approaches. • Using whole genome single molecule optical mapping technique. Dan Bradley

  15. Reaching Further Research Groups by Discipline Dan Bradley

More Related