1 / 36

Status of DØ Computing at UTA

Status of DØ Computing at UTA. DoE Site Visit Nov. 13, 2003 Jae Yu University of Texas at Arlington. Introduction  The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing DØRAC DØSAR DØGrid Software Development Effort Impact on Outreach and Education Conclusions.

glora
Download Presentation

Status of DØ Computing at UTA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of DØ Computing at UTA DoE Site Visit Nov. 13, 2003 Jae Yu University of Texas at Arlington Introduction  The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing DØRAC DØSAR DØGrid Software Development Effort Impact on Outreach and Education Conclusions

  2. Introduction • UTA has been producing DØ MC events as the US leader • UTA led the effort to • Start remote computing at DØ • Define remote computing architecture at DØ • Implement the remote computing design at DØ in the US • Leverage on experience as the ONLY active US DØ MC farm  This became no longer true  • UTA is the leader in US DØ Grid effort • The UTA DØ Grid team has been playing a leadership role in monitoring software development Status DØ Computing Effort DoE Site Visit, Jae Yu

  3. The UTA-DØGrid Team • Faculty: Jae Yu, David Levine (CSE) • Research Associate: HyunWoo Kim • SAM/Grid expert • Development of McFarm SAM/Grid job manager • Software Program Consultant: Drew Meyer • Development, improvement, and maintenance of McFarm • CSE Master’s Degree Students: • Nirmal Ranganathan: Investigation of Resource needs in Grid execution • EE M.S. Student: Prashant Bhamidipati • MC Farm operation and McPerM development • PHY Undergraduate Student: David Jenkins • Take over MC Farm Operation and Development of Monitoring database • Graduated: • Three CSE MS students  All are at industry • One CSE Undergraduate student  on MS program at U. of Washington Status DØ Computing Effort DoE Site Visit, Jae Yu

  4. UTA DØ MC Production • Have two independent farms • Swift farm (HEP) • 36 P3 866MHz cpu’s • 250Mbyte/cpu • A total of .6TB disk space • CSE Farm • 12 P3 866MHz cpu’s • McFarm as our production control software • Statistics (11/1/2002 – 11/12/2003): • Produced: ~10M • Delivered: ~ 8M Status DØ Computing Effort DoE Site Visit, Jae Yu

  5. What do we want to do with the data? Want to analyze data no matter where we are!!! Location and time independent analysis Status DØ Computing Effort DoE Site Visit, Jae Yu

  6. DØ Data Taking Summary 30~40M events/mo Status DØ Computing Effort DoE Site Visit, Jae Yu

  7. What do we need for efficient data analyses in a HEP experiment? • Total expected data size is ~4PB (4 million GB=100km of 100GB Hard drives)!!! • Detectors are complicated  Need many people to construct and make them work • Collaboration is large and scattered all over the world • Allow software development at remote institutions • Optimized resource management, job scheduling, and monitoring tools • Efficient and transparent data delivery and sharing Status DØ Computing Effort DoE Site Visit, Jae Yu

  8. DØ Collaboration 650 Collaborators 78 Institutions 18 Countries Status DØ Computing Effort DoE Site Visit, Jae Yu

  9. Old Deployment Models Started with Fermilab-centric SAM infrastructure in place, … …transition to hierarchically distributed Model  Status DØ Computing Effort DoE Site Visit, Jae Yu

  10. Central Analysis Center (CAC) Normal Interaction Communication Path Occasional Interaction Communication Path …. RAC RAC Regional Analysis Centers Institutional Analysis Centers … ... IAC IAC IAC IAC Desktop Analysis Stations …. …. DAS DAS DAS DAS DØ Remote Analysis Model (DØRAM) Status DØ Computing Effort DoE Site Visit, Jae Yu

  11. What is a DØRAC? • A large concentrated computing resource hub • An institute willing to provide storage and computing services to a few small institutes in the region • An institute capable of providing increased infrastructure as the data from the experiment grows • An institute willing to provide support personnel • Complementary to the central facility Status DØ Computing Effort DoE Site Visit, Jae Yu

  12. KSU OU/LU KU UAZ Ole Miss UTA LTU Rice Mexico/Brazil DØ Southern Analysis Region (DØSAR) The first US Region centered around the UTA – RAC It is a regional virtual organization (RVO) within the greater DØ VO!! Status DØ Computing Effort DoE Site Visit, Jae Yu

  13. SAR Institutions • Second Generation IAC’s • Cinvestav, Mexico • Universidade Estadual Paulista, Brazil • University of Kansas • Kansas State University • First Generation IAC’s • Langston University • Louisiana Tech University • University of Oklahoma • UTA • Third Generation IAC’s • Ole Miss, MS • Rice University, TX • University of Arizona, Tucson, AZ Status DØ Computing Effort DoE Site Visit, Jae Yu

  14. Goals of DØ Southern Analysis Region • Prepare institutions within the region for grid enabled analyses using RAC at UTA • Enable IAC’s to contribute to the experiment as much as they can, including MC production and data re-processing • Provide GRID enabled software and computing resources to DØ collaboration • Provide regional technical support and help new IAC’s • Perform physics data analyses within the region • Discover and draw in more computing and human resources from external sources Status DØ Computing Effort DoE Site Visit, Jae Yu

  15. SAR Workshops • Biennial Workshops to promote healthy regional collaboration and to share expertise • Had two workshops • April 18 – 19, 2003 at UTA: ~40 participants • Sept. 25 – 26, 2003 at OU: 32 participants • Each workshop had different goals and outcomes • Established SAR, RAC & IAC web pages and e-mail • Identified Institutional representatives • Enabled three additional IAC’s with MC production • Paired new institutions with existing ones Status DØ Computing Effort DoE Site Visit, Jae Yu

  16. SAR Strategy • Setup all IAC’s with full DØ Software setup (DØRACE Phase 0 – IV) • Install Condor (or PBS) batch control system on desktop farms or clusters • Install McFarm MC Production control • Produce MC events on IAC machines • Install globus for monitoring information transfer • Install SAM-Grid and interface McFarm to it • Submit jobs through SAM/Grid and monitor them • Perform analysis at individual’s desk Status DØ Computing Effort DoE Site Visit, Jae Yu

  17. SAR Software Status • Up-to-date with DØ Releases • McFarm MC Production control • Condor or PBS as batch control • Globus v2.xx for grid enabled communication • Globus & DOE SG Certificates obtained and installed • SAM/Grid on two of the farms (UTA IAC farms) Status DØ Computing Effort DoE Site Visit, Jae Yu

  18. UTA Software for SAR • McFarm Job control • All DØSAR institutions use this product for automated MC Production • Ganglia resource monitoring • Contains 7 clusters (332 CPU’s), including Tata institute, India • McFarmGraph: MC Job status Monitoring system using gridftp • Provides detailed information for a MC request • McPerM: MC Farm Performance Monitoring Status DØ Computing Effort DoE Site Visit, Jae Yu

  19. 1st SAR wrkshp Ganglia Grid Resource Monitoring Status DØ Computing Effort DoE Site Visit, Jae Yu

  20. Job Status Monitoring: McFarmGraph Status DØ Computing Effort DoE Site Visit, Jae Yu

  21. Increased Productivity Farm Performance Monitor: McPerM Status DØ Computing Effort DoE Site Visit, Jae Yu

  22. UTA RAC and Its Status • NSF MRI funded facility • Joint proposal of UTA HEP and CSE + UTSW Med. • 2 HEP, 10 CSE and 2 UTSW Medical • Core System (high throughput Research system) • CPU: 64 P4 Xeon 2.4GHz (total ~154 GHz) • Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 64 Gbytes) • Storage: 5TB Fiber Channel supported by 3 GFS servers (3Gbit/sec throughput) • Network: Faundary switch w/ 52 Gbit/sec + 24 100Mbit/sec ports • Expansion system (high CPU cycle, large storage Grid system) • CPU: 100 P4 Xeon 2.6GHz (total ~260 GHz) • Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 100 Gbytes) • Storage: 60TB IDE RAID supported by 10 NFS servers • Network: 52 Gbit/sec • The full facility went online on Oct. 31, 2003 • Software installation in progress • Plan to participate in SC2003 demo next week Status DØ Computing Effort DoE Site Visit, Jae Yu

  23. Just to Recall Two Years Ago…. • IDE Hard drives are ~$2.5/GByte • Each set of IDE RAID array gives ~1.6TByte – hot swappable • Can be configured to have up to 10-16TB in a rack • Modest server can manage the entire system • Gbit network switch provide high throughput transfer to outside world • Flexible and scalable system • Need an efficient monitoring and error recovery system • Communication to resource management Gbit Switch IDE-RAID IDE-RAID . . . IDE-RAID IDE-RAID Disk Server Status DØ Computing Effort DoE Site Visit, Jae Yu

  24. UTA DØRAC • 84 P4 Xeon 2.4GHz CPU = 202 GHz • 7.5TB of Disk space • 100 P4 Xeon 2.6GHz CPU = 260 GHz • 64TB of Disk space • Total CPU: 462 GHz • Total disk: 73TB • Total Memory: 168Gbyte • Network bandwidth: 54Gb/sec Status DØ Computing Effort DoE Site Visit, Jae Yu

  25. SAR Accomplishments • Held two workshops and the third is planned • All first generation institutions produce MC events using McFarm on desktop PC farms • Generated MC events: OU: 300k, LU: 250k, LTU: 150k, UTA: ~1.3M • Discovered additional resources • Significant local expertise have been accumulated in running farms and producing MC events • Produced several documents, including two DØ notes • Hold regular bi-weekly meetings (VRVS) to keep up progress • Working toward data re-processing Status DØ Computing Effort DoE Site Visit, Jae Yu

  26. SAR Computing Resources Status DØ Computing Effort DoE Site Visit, Jae Yu

  27. SAR Plans • Four second generation IAC’s have been paired with four first generation institutions • Success is defined as: • Regular production and delivery of MC events to SAM using McFarm • Install SAM/’Grid and perform a simple SAM job • Add all these new IAC’s to ganglia, McFarmGraph and McPerM • Discover and integrate more resources for DØ • Integrate OU’s OSCER cluster • Integrate other institution’s large, university-wide resources • Move toward grid enabled regional physics analyses • Collaborators need to be educated to use the system Status DØ Computing Effort DoE Site Visit, Jae Yu

  28. Future Software Projects • Preparation of UTA DØRAC equipment • MC Production (DØ is suffering from shortage of resources.) • Re-reconstruction • SAM/Grid • McFarm • Integration of re-processing • Enhanced monitoring • Better error handling • McFarm Interface to SAM/Grid (job_manager) • Initial script successfully tested for SC2003 demo • Work with SAM-Grid team for monitoring database and integration of McFarm technology • Improvement and maintenance of McFramGraph and McPerM • Universal Graphical User Interface to Grid ( PHY PhD Student) Status DØ Computing Effort DoE Site Visit, Jae Yu

  29. SAR Physics Interests • OU/LU: • EWSB/Higgs searches • Single top search • CPV / Rare decays in heavy flavors • SUSY • LTU: • Higgs search • B-tagging • UTA: • SUSY • Higgs searches • Diffractive physics • Diverse topics but can define common samples Status DØ Computing Effort DoE Site Visit, Jae Yu

  30. Funding at SAR • Hardware Support • UTA – RAC : NSF MRI • UTA – IAC : DoE + Local • Totally independent of RAC resources • Need to more hardware to adequately support desktop analyses utilizing RAC resources • Software Support • Mostly UTA Local funding  Will run out this year!!! • Many tries for different sources but none worked • We seriously need help to • Maintain the leadership in DØ Remote Computing • Maintain the leadership in grid computing • Realize the DØRAM and expeditious physics analyses Status DØ Computing Effort DoE Site Visit, Jae Yu

  31. Tevatron Grid Framework: SAM-Grid • DØ already has data delivery part of the Grid system (SAM) • Project started in 2001 as part of the PPDG collaboration to handle DØ’s expanded needs. • Current SAM-Grid team includes: • Andrew Baranovski, Gabriele Garzoglio, Lee Lueking, Dane Skow, Igor Terekhov, Rod Walker (Imperial College), Jae Yu (UTA), Drew Meyer (UTA), HyunWoo Kim (UTA) in Collaboration with U. Wisconsin Condor team. • http://www-d0.fnal.gov/computing/grid • UTA is working on developing an interface for McFarm to SAM-Grid • This brings the entire SAR institutions + any institutions with McFarm into the DØGrid

  32. Fermilab Grid Framework (SAM-Grid) UTA Status DØ Computing Effort DoE Site Visit, Jae Yu

  33. UTA-FNAL CSE Master’s Student Exchange Program • In order to establish usable Grid software in the DØ time scale, the project needs highly skilled software developers • FNAL cannot afford computer professionals • UTA - CSE department has 450 MS students  Many are highly trained but back at school due to economy • Students can participate in cutting-edge Grid computing topics in real-life situation • Students’ Master’s thesis become a well documented record of the work which lacks in many HEP computing projects • The third generation students are at FNAL working on improvement of SAM – Grid and its implementation  two semester circulation period • Previous two generations have made a significant impact to SAM – Grid • One of the four previous generation students is in PhD program at CSE • One at Wisconsin Condor team  Possibility to get into PhD • Two are at industry Status DØ Computing Effort DoE Site Visit, Jae Yu

  34. Impact to Education and Outreach • UTA DØ Grid program graduated • Trained: 12 (10 MS + 1 Undergraduate) students • Graduated: 5 CSE Masters + 1 Under grad • CSE Grid Course: Many class projects on DØ • Quarknet • UTA is one of the founding institutions of QuarkNet programs • Initiated TECOS project • Other School-top cosmic projects across the nation need storage and computing resources  QuarkNet Grid • Will be working with QuarkNet for data storage & eventual use of computing resources by teachers and students • UTA Recently became a member of Texas grid (HiPCAT) • HEP is leading this effort • Strongly supported by the university • Expect significant increase in infrastructure, such as bandwidth Status DØ Computing Effort DoE Site Visit, Jae Yu

  35. Conclusions • UTA DØ – Grid team has accomplished tremendously • UTA played a leading role in DØ Remote Computing • MC production • Design of DØ Grid architecture • Implementation of the DØRAM • DØ Southern Analysis Region is a great success • Four new institutions (3 US) are now MC production sites • Enabled exploitation of available intelligence and resources in an extremely distributed environments • Remote expertise being accumulated Status DØ Computing Effort DoE Site Visit, Jae Yu

  36. UTA – DØRAC is up and running  Software installation in progress • Soon to add significant resources to SAR and to DØ • Sam-Grid interface to McFarm working  One step closer to establish a globalized grid • UTA – FNAL MS student exchange program is very successful • UTA DØ Grid computing program has significant impact to outreach and education • UTA is the ONLY DØ US institution who’s been playing a leading role in DØ grid  Makes UTA unique • The local support runs out this year!!  UTA needs support to maintain leadership in and support for DØ Remote Computing Status DØ Computing Effort DoE Site Visit, Jae Yu

More Related