220 likes | 340 Views
The University of Texas at Arlington (UTA) actively contributes to the ATLAS project at the SWT2, focusing on High-Energy Physics (HEP) computing. This report details UTA's involvement in collaborations, such as Panda development, and the progress of Phase I and II implementations. With hardware capacity encompassing significant computational power and storage solutions, UTA leverages its resources to support ATLAS MC production and analysis. The report also outlines ongoing projects, collaborations, and future expansions tailor-made for enhanced efficiency in scientific computations.
E N D
UTA Site Report Jae Yu Univ. of Texas, Arlington 6th DOSAR Workshop University Mississippi Apr. 17 – 18, 2008
Introduction • UTA a partner of ATLAS SWT2 • Actively participating in ATLAS production • Kaushik De is co-leading Panda development • Phase I implementation at UTACC completed and running • Phase II implementation at Physics and Chemistry Research Building completed • MonALISA based OSG Panda monitoring back online • DDM monitoring project completed • Working on Achelois, the ATLAS Operations Support System • HEP group working with other discipline in shared use of existing computing resources • Interacting with the campus HPC community • Working with HiPCAT, Texas grid community • Co-Linux Condor cluster setup began but going very s-l-o-w
UTA DPCC – The 2003 Solution • UTA HEP-CSE + UTSW Medical joint project through NSF MRI • Primary equipment for D0 re-reconstruction and MC production up to 2005 • Now primarily participating in ATLAS MC production and reprocessing at part of SWT2 resources • Other disciplines also use this facility but at a minimal level • Biology, Geology, UTSW medical, etc • Hardware Capacity • PC based Linux system assisted by some 70TB of IDE disk storage • 3 IBM PS157 Series Shared Memory system • Now being turned into a T3 cluster
UTA – DPCC • 84 P4 Xeon 2.4GHz CPU = 202 GHz • 5TB of FBC + 3.2TB IDE Internal • GFS File system • 100 P4 Xeon 2.6GHz CPU = 260 GHz • 64TB of IDE RAID + 4TB internal • NFS File system • Total CPU: 462 GHz • Total disk: 76.2TB • Total Memory: 168Gbyte • Network bandwidth: 68Gb/sec • HEP – CSE Joint Project • DØ+ATLAS • CSE Research
UTA Tier3 • Conversion of UTA_DPCC • Commissioned 2003 • NSF-MRI grant between CSE, HEP, UT-SWM • Shared resource • 160 cores (Dual Processor 32bit Xeon) • 1GB RAM / core • 45 TB NAS based storage • Rocks 4.3 with SLC 4.6 • Several head nodes
Tier 3 specific customizations • Exploring with Scalla topology • Provide rootd storage to desktop • Provide platform for experimenting with PROOF • Testbed for T2 T3 interactions • Still maintain production activities • Panda Development
SWT2 • Joint effort between UTA, OU, LU and UNM • 2000ft2 in the new building • Designed for 3000 1U nodes • Could go up to 24k cores 1MW Total power capacity Cooling with 5 Livert units
UTA SWT2 Clusters • UTA_SWT2 (Phase I) • Supporting MC Production only • Initial deployment of ATLAS project funds • SWT2_CPB (Phase II) • Supporting MC production and analysis • Main cluster for UTA • Project Funded • UTA_DPCC • Supporting MC production and local analysis • Tier3 prototype • Supporting production system software development • NSF MRI Grant
Installed SWT2 Phase I Equipment • 160 Node cluster (Dell SC1425) • 320 cores (3.2GHz Xeon EM64T) • 2GB RAM/core • 160GB SATA local disk drive • 8 Head Nodes (Dell 2850) • Dual 3.2 GHz Xeon EM64T • 8GB RAM • 2x 73GB (RAID1) SCSI Storage • 16TB Storage System • Direct Data Networks S2A3000 system • 80x250GB SATA drives • 6 I/O servers • IBRIX Fusion file system • Dedicated internal storage network (Gigabit Ethernet) • Has been operating and conducting Panda production over a year
UTA_SWT2 (Phase I) Services • OSG 0.8 Compute Element • ATLAS specific Gatekeeper • Local Replica Catalog • DQ2 Site-services (supports all of SWT2) • Redundant GUMS server for UTA resources
SWT2 Phase II (CPB) Equipment • 50 node cluster (SC 1435) • 200 Cores (2.4GHz Dual Opteron 2216) • 8GB RAM (2GB/core) • 80 GB SATA disk • 2 Head nodes • Dual Opteron 2216 • 8 GB RAM • 2x73GB (RAID1) SAS Drives • 75 TB (raw) Storage System • 10xMD1000 Enclosures • 150x500GB SATA Disk Drives • 8 I/O Nodes • DCache will be used for aggregating Storage • 10GB internal network capacity • In service • Future purchases will focus more on the storage
SWT2_CPB Services • OSG 0.8 Compute Element • Local Replica Catalog • GUMS server for all UTA Resources
ATLAS SWT2 Phase I and II SWT2-PH1@UTACC SWT2-PH2@UTACPB • 320 Xeon 3.2 GHz cores = 940 GHz • 2GB Ram/core = 640GB • 160GB Internal/unit 25.6TB • 8 dual core server nodes • 16TB of storage assisted by 6I/O server • Dedicated Gbit internal connections • 200 Optaron 3.2 GHz cores = 640 GHz • 2GB Ram/core = 400GB • 8GB SATA /unit 4TB • 8 dual core server nodes • 75TB of storage by 10 Dell MD100 Raid assisted by 8 I/O servers • Dedicated Gbit internal connections
SWT2 Growth Courtesy M. Ernst US ATLAS Transparent Distributed Facility Workshop 3/04/08
ATLAS SWT2 Capacity • Current Capacity for Project Funded Resources • Total CPU ~ 1,200K SI2K • Total Disk ~ 92TB (usable) • Additions to SWT2_CPB • Total CPU ~ 2,200K SI2K • Total Disk ~ 300TB (usable) • Future Purchases will be weighted more to storage growth
Network Capacity History at UTA • Had DS3 (44.7MBits/sec) till late 2004 • Choke the heck out of the network for about a month downloading D0 data for re-reconstruction • Met with VP of Research at UTA and emphasized the importance of network backbone for attracting external funds • Increased to OC3 (155 MBits/s) early 2005 • OC12 as of early 2006 • Connected to NLR (10GB/s) through LEARN (http://www.tx-learn.org/) via 1GB connection to NTGP • $9.8M ($7.3M for optical fiber network) state of Texas funds approved in Sept. 2004 • Most areas on LEARN lit • The “Last Mile” connection problem still exists
LEARN Status http://www.tx-learn.org/networkmap.cfm
NLR – National LambdaRail ONENET LONI LEARN 10GB/sec connections
Software Development Activities • MonALISA based ATLAS distributed analysis monitoring • A good, scalable system • Software development and implementation completed • ATLAS-OSG sites are on the LHC Dashboard • New server brought back into service for OSG at UTA • Completeda DDM monitoring project • Working on Achelois, the ATLAS Operations Support System
CSE Student Exchange Program • Joint effort between HEP and CSE • David Levine, Gergley Zaruba and Manfred Huberare primary contacts at CSE • A total of 10 CSE MS Students each have worked in SAM-Grid team • Five generations of the student • Many of them playing leading roles in grid community • Abishek Rana at UCSD • Parag Mashilka at FNAL • Sudhamsh Reddy worked for UTA and back into Ph.D program • New program with BNL implemented • First student on completed the tenure and is on job training • Second set of two Ph.D. students at BNL and completing their tenure • Participating in ATLAS Panda monitoring project • One student working on pilot factory using condor glide-in • Finalizing the integration into the regular service • Facing some budget cut issues here
The LAW Project • Co-Linux based Condor System OU established as the model • Had a meeting between OU team and UTA IT management in Feb. during the HiPCAT meeting • We are very grateful for the OU team willing to come and help • A team of UTA IT personnel designated to work on this project, looking into various aspects • Head of Academic Computing Services is the lead • Had a meeting with her two weeks ago • 28 new machine lab in Engineering as the trial case • IT wants to start with 5 machines or so first • IT will manage and support based software, OS and Co-Linux • All other software related responsibilities on HEP
Conclusions • MonALISA based panda monitoring providing info to LHC Dashboard • New server was brought up Waiting for BNL to separate panda monitor servers out • Completed a project in DDM monitoring • Completed EG2 (photon) CSC note • Plan to connect to 10GB/s NLR via 1GB connection to UTD • Quite expensive to make the final connectio • Involved in ATLAS Computing Operations Support • Working closely with HiPCAT for State-wide grid activities • Started working on setting up a Co-Linux Condor Cluster