1 / 11

JLAB Computing Facilities Development

JLAB Computing Facilities Development. Ian Bird Jefferson Lab 2 November 2001. 2 TB Farm cache. DM1. Reconstruction & Analysis Farm 350 Linux CPU ~10 K SPECint95 Batch system: LSF + local Java layer + web interface. DM10. Tape storage system 12000 slot STK silos

Download Presentation

JLAB Computing Facilities Development

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001

  2. 2 TB Farm cache DM1 • Reconstruction & Analysis Farm • 350 Linux CPU • ~10 K SPECint95 • Batch system: • LSF + • local Java layer + • web interface DM10 • Tape storage system • 12000 slot STK silos • 8 Redwood, 10 9940, 10 9840 drives • 10 (Solaris, Linux) Data movers with ~ 300 GB buffer each • Gigabit Ethernet or Fiberchannel • Software – JASMine 15 TB Experiment cache pools clients 10 TB unmanaged disk pools JASMine managed mass storage sub-systems 0.5 TB LQCD cache pool • Lattice QCD cluster(s) • 40 Alpha Linux • 256 P4 Linux (~Mar 02) – 0.5 Tflop • Batch system: • PBS + • Web portal clients Jefferson Lab Mass Storage & Farms August 2001 Ian.Bird@jlab.org

  3. Tape storage • Current • 2 STK silos (12,000 tape slots) • 28 drives • 8 Redwood, 10 9840, 10 9940 • Redwoods to be replaced by 10 more 9940 FY02 • 9940 are 60 GB @ 10 MB/s • Outlook • (Conservative?) Tape roadmap has > 500 GB tapes by FY06 at speeds of >= 60 MB/s • FNAL model (expensive ADIC robots + lots of commodity drives) does not work – they are moving to STK + 9940’s Ian.Bird@jlab.org

  4. Disk storage • Current • ~ 30 TB of disk • Mix of SCSI and IDE disk on Linux servers: • ~ 1 TB per dual CPU with Gigabit interface – matches load, I/O, and network throughput • Costs for IDE - $10K / TB, performance as good as SCSI • Outlook • This model scales by a small factor (10 ? but not 100?) • Need a reliable global filesystem (not NFS) • Tape costs will remain ~ factor 5 cheaper than disk for some time • Fully populated silo with 10 drives today ~ $2K/TB, disk ~$10K/TB • Investigations in hand to consider large disk farms to replace tape • Issues are power, heat, manageability, error rates • Consider • Compute more, store less • Store metadata, re-compute data as needed rather than storing and moving it; computing is (and will become more and more) cheaper than storage • Good for eg. Monte Carlo – generate as needed on modest sized (but very powerful) farms Ian.Bird@jlab.org

  5. Clusters • Current • Farm, 350 Linux cpu, • Latest: 2 dual 1 GHz systems in 1u box (i.e. 4 cpu) • Expect modest expansion over next few years (up to 500 cpu?) • LQCD • ~ 40 Alpha now, 256 P4 in FY02, growth to 500 – 1000 cpu in 5 years (goal is 10 TFlop) • We know how to manage systems of this complexity with relatively few people • Outlook • Moore’s law (still works) – expect raw cpu to remain cheap • Issues will become power and cooling • Several “server blade” systems being developed using Transmeta (low power) chips – 3u rack backplane with 10 dual systems slotted in – prospect of even denser compute farms • MC farm on your desk? – generate on demand Ian.Bird@jlab.org

  6. First purchases, 9 duals per 24” rack Last summer, 16 duals (2u) + 500 GB cache (8u) per 19” rack Recently, 5 TB IDE cache disk (5 x 8u) per 19” Intel Linux Farm Ian.Bird@jlab.org

  7. 16 single Alpha 21264, 1999 12 dual Alpha (Linux Networks), 2000 LQCD Clusters Ian.Bird@jlab.org

  8. Networks • Current • Machine room & campus backbone is all Gigabit Ethernet • 100 Mbit to desktops • Expect affordable 10 Gb in 1-2 years • WAN (ESnet) is OC-3 (155 Mb/s) • Outlook • Less clear – expect at least 10 Gb and probably another generation (100 Gb?) by Hall D • Expect ESnet to be >= OC-12 (622 Mb/s) • Would like WAN speeds to be comparable to LAN for successful distributed (grid) computing models • We are involved in ESnet/Internet 2 task force to ensure bandwidth is sufficient on LHC (= Hall D) timescales Ian.Bird@jlab.org

  9. Facilities • Current • Computer Center is close to full – esp. with LQCD cluster • New Building • Approved (CD-0) to start design in FY03 • Expect construction FY04, occupation FY05? • Extension to Cebaf Center, will include: • 10,000 ft2 machine room (current is < 3000 & full) • Will leave 2 silos in place, but move other equipment • Designed to be extensible if needed • Need this space to allow growth and sufficient cooling (there is now factor 2-5 gap between computing power densities and cooling abilities…) • Building will provide also provide space for ~ 150-200 people Ian.Bird@jlab.org

  10. Software • Mass storage software • JASMine – written at JLAB, designed with Hall-D data rates in mind • Fully distributed & scalable – 100 MB/s today, limited only by number and speed of drives • Will be part of JLAB Grid software – cache manager component works remotely, • Demo system JLAB-FSU under construction • Batch software • Farm : LSF with a Java layer • LQCD: PBS with a web portal • Merge these technologies, provide grid portal access to compute and storage resources: • Built on Condor-G, Globus, SRB, JLAB web-services as part of PPDG collaboration Ian.Bird@jlab.org

  11. Summary • Technology and facilities outlook is good • The Hall D computing goals will be readily achievable • Actual facilities design and ramp-up must be driven by a well founded Hall D computing model • The computing model should be based on a distributed system • Make use of appropriate technologies • The design of the computing model needs to be started now! Ian.Bird@jlab.org

More Related