  1. Israel ATLAS TIER-2Status April 2011Lorne Levinson Israel ATLAS Tier2 Status

  Israel HEP community • ATLAS is the only LHC experiment in which we participate • also Phenix (Heavy Ion @BNL), ILC, ZEUS • Israel is "1.35% of ATLAS" (MoU pledge, authors, common fund) • 25-30 people doing physics analysis • 3 sites: • Tel Aviv University, Tel Aviv (1956) • a university • The Technion Israel Institute of Technology, Haifa (1924) • a university • Weizmann Institute of Science, Rehovot (1934) • a research institute for Biology, Chemistry, Physics, Math & CS) with graduate school (no undergrads) • longest travel is Weizmann  Technion 2 hours office-to-office

  Organization • we are a distributed Tier2/Tier3 • each site combines Tier2 and Tier3 resources in the same cluster • all resources shared flexibly between T2 and T3 (Lustre/Storm) • single management and budget, single purchasing • three sites as identical as possible • Steering Committee for overall policy • Management & Operations team for the three sites • stable funding approved until 2012

  Storage Continues to be the biggest reliability issue. • Our hardware is now stable: • replaced DDN 6620's with DDN 9900 • Fully redundant, 300 disk slots, 8x8Gb/s FC ports  5GB/s • two Lustre "OSS" servers • WI servers with 10Gb/s to cluster, TAU, Tech will install 10G in April • Gave up on Thumpers+Lustre and Thumpers+iSCSI+Lustre. • We NFS mount Thumpers with Solaris+ZFS for extra "archive" storage, home directories or /opt/exp_soft • Lustre + Storm  problem is Storm team does not test new Storm releases on Lustre • Storm-Lustre community must solve this

  Storm/Lustre • Storm allows LCGSRM storage and our local global file name space to share the same physical storage. • No rigid boundary • Jobs in cluster can do Linux file io to read SRM files • Storm can run over Lustre (open source) or GPFS (IBM) • Lustre: • Object Storage Targets serve (stripes of) file data • Meta-Data Server holds directories • redundant failover of MDS's will soon be supported

  Storage – installed SRM + local capacity Net TB

  Group disks • We are hosting four ATLASGROUPDISKareas • Muon performance (Technion) • Top (Weizmann) • Heavy Ion (Weizmann) • Standard Model (TAU) (empty)

  CPU • Last purchase was dual Intel E5520 quad core • May delivery purchase is dual Intel X5650 hex-core • again 4 motherboards per 2U box with redundant power supply • We benefit a lot that some other groups place some cores in our cluster: • * Weizmann: ATLAS+Phenix/Heavy-Ion, HEP Theory, Condensed matter • * Technion: HEP Theory and Bio-informatics • * TAU includes:HEP Theory

  Services nodes Virtualize most services • Two 8-core servers, 48GB • Failover • Easier management • VM images • Roll-back • Image sharing • Easier testing: temp machines • May delivery of HW • Deciding among: VMware, Xen, Citrix, KVM • SE not included

  Networking Our networking is not good • Geant connection is 2 x 1.5G (subscribed on 2 x 2.5G infrastructure) • "Political" limits: TAU 500M, Technion 350M, WI 400M • Because a 1G line is shared with institute traffic and the shared router is not really able to do 1G duplex • We suspect that the gross mismatch with SARA/NIKHEF's10G causes failed connections due to dropped packets. • Lowering the # of files & streams to avoid dropped packets leaves us with even worse net BW • Expensive because it is an undersea fiber and one (Italian) company owns the fibers. • An Israeli competitor is installing another fiber now

  Networking

  GEANT

  Networking plans May 2011(?): • Increase international connection: from 3Gb/s to 4Gb/s. • 5G might be possible later this year, but not budgeted. • Replace old routers at entrances to institutes with 10G capable equipment. • This should increase our thru'put and reliability and allow us to actually use a major share of the 1G BW to the sites • Negotiating 10G academic backbone • Could have 10G to Geant in spring 2012

  SAM/NAGIOS • Our NGI did not take on the SAM/NAGIOS monitoring responsibility • After the new NAGIOS tests replaced SAM tests, we received no alerts on failed tests. • This was a severe problem • Finally in December it was agreed with EGI, our NGI and us that we would deploy a NAGIOS test service for Israel, until our NGI succeeded to do it. • The only functioning grid sites in Israel are our 3 ATLAS sites • Our NAGIOS service was up and running in January.

  Upcoming work • Deploy Zenoss fabric and service monitor on all three clusters • currently in-test at Weizmann • Deploy Puppet configuration system on all three clusters • We gave up on Quattor after having finally succeeded in getting it to run, • Clear that it was unsustainable • Currently for work nodes at Weizmann • Needs to include gLite nodes • Virtualization of services (excl SE) • Address Storm "untested new version" problem

  End