1 / 16

Virtual Organization Approach for Running HEP Applications in Grid Environment

Institute of Computer Science AGH-UST. Virtual Organization Approach for Running HEP Applications in Grid Environment. Łukasz Skitał 1 , Łukasz Dutka 1 , Renata Słota 2 , Krzysztof Korcyl 3 , Maciej Janusz 2 , Jacek Kitowski 1,2

kumiko
Download Presentation

Virtual Organization Approach for Running HEP Applications in Grid Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Institute of Computer Science AGH-UST Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał1, Łukasz Dutka1, Renata Słota2, Krzysztof Korcyl3, Maciej Janusz2, Jacek Kitowski1,2 1 ACC CYFRONET AGH, Cracow2 Institute of Computer Science AGH-UST, Cracow 3 Institute of Nuclear Physics PAN, Cracow

  2. Outline • Introduction • Application • Requirements • Architecture • Virtual Organization • Requirements • Certification • Monitoring • Dynamic Processing Tasks Pool • Summary/Conclusions

  3. Introduction • Part of int.eu.grid project • State of Art. – current interactive applications in GRID • Crossgrid • Reality Grid • GridPP • Why grid? • Gain more resources • Use resources for different purposes when ATLAS is off • Why VO? • Provides stable, but dynamic environment for an application • HEP application is to process all events without any loss of data

  4. Copenhagen Edmonton Kraków Manchester Remote Processing Farms PF PF PF PF SFI SFI SFI Packet Switched WAN: GEANT lightpath Back End Network PF SFOs Local Event Processing Farms mass storage Dispatcher PF CERN CERN Computing Center HEP Application in Brief

  5. Sensors Lvl1 filters 2.5 s 120 GB/s Buffers ~ 10 ms Lvl2 filters ~4 GB/s LVL3 Event Filter Event Filter N/work SFI ~ sec EFP EFP Event Filter Processors EFN EFP EFP SFO ~ 300 MB/s HEP Application in Brief • Three filtering levels • Hardware level (lvl 1) • Small local farm (lvl 2) • Complex events filtering (lvl 3) • SFI - SubFarm Interface • SFO - SubFarm Output • PF – Processing Farm

  6. HEP Requirements • Real-time application • High throughput (estimated) • 3500 event per second • 1.5MB per event • Average 1 second to compute one event on typical processor • Infrastructure monitoring (for load balancing) • Efficient way to distribute events to worker nodes • Grid job submission mechanism is not sufficient • Simple job submission takes minutes • Have seconds... • Failure recovery • Malfunctions of single nodes are acceptable, but have to be detected • Application monitoring • Infrastructure monitoring (for availability checks)

  7. HEP application integration with GRID • Job submission for each event is too slow • Job submission for bunch of event is still too slow • We need interactive communication • Pilot job idea • One job to allocate a node and start PT (Processing Task) • Dedicated queue in LRMS for HEP pilot jobs (HEP VO) • One PT processes many events • Direct communication between PT and ATLAS experiment • Faster than job submission • ATLAS experiment provides event (1.5MB/event) • PT responds with events analysis results (1Kb/event) • Asynchronous communication with events buffering • Limited lifetime of PT to allow dynamic resource allocation • Lifetime set by queue or PT configuration

  8. GRID UI WMS Dispatcher EFD Infrastructure monitoring Buffer EFD Buffer HEP VO proxyPT HEP VO Database CE SFI SFI SFI Events Application Monitoring PT PT PT PT PT PT Local PT Farm CERN PT PT PT PT CE PT PT PT PT PT PT PT PT Remote PTs WNs WNs Proposed HEP Architecture

  9. Components • EFD (Event Filter Dataflow) • Takes event from SFI and place them in local buffer • Events are distributed to PT (local or remote) • Depending on PT's answer event is stored or flushed • Processing Task (PT) • Runs on worker nodes (WNs) • Process events and answers with short analysis data • ProxyPT • Interface to remote PT • Dispatcher • Coordinates task distribution from EFD to PTs • Infrastructure Monitoring • Network load/status, WNs status • Application Monitoring • PT application state

  10. HEP Virtual Organization • Purpose • Provides runtime environment for HEP application • Fulfills application’s requirements • Realizes site certification process • Architecture • High level (static) - sites • Certification • Agreements • Configuration guidelines • Functional tests • Low level (dynamic) - resources • Runtime environment • Dynamic resource allocation • Monitoring and failover • Load balancing

  11. Site certification - Requirements • Long-term ability to provide services and resources • Legal issues/agreements • LRMS configuration • Dedicated high priority (but short) queues on computing elements for jobs from HEP VO • Ability to safely communicate between site's WNs and CERN HEP nodes: • Opened specified port on CERN side • Opened specified port on site side • Trusted proxy to setup two way communication • Channel encryption

  12. HEP VO site operation process • Certification phase • Long term tests for reliability performance and updates (application, databases) • Sites tested using artificial/calibration data • Communication between site's WNs and CERN HEP nodes • Operation phase with runtime environment monitoring • Operates on production data • Checks during PTs startup • Proper environment, up-to-date application, databases, etc. • Infrastructure and application monitoring • Dynamic resource allocation • Excluding nodes/sites which are frequently unavailable • Temporary excluded sites/nodes can not process real data, but they can still receive test jobs

  13. High level VO Site certification Operation Low level VO HEP Virtual Organization Certification Agreements Guidelines Management issues Communication Site availability statistics Functional tests Dynamic resource allocation Monitoring

  14. Monitoring for HEP VO • Takes advantage of monitoring • Monitoring using external tools • Application Monitoring (with tool like J-OCM): • deployed on every worker node running HEP PT • provides information about current execution status • monitors computation time • JIMS for Infrastructure Monitoring • availability of worker node • load of worker node • free memory • network throughput between CERN and remote computing farm • Failover

  15. Dynamic resource allocation • Dynamic Processing Tasks pool • Malfunctioning PT excluded from runtime environment (low level VO) • PT lifetime limited by queue length (walltime) • each ‘normal’ job has it’s own lifetime specified before execution • ‘interactive’ type of job • pool has to be refreshed periodically • Fair sharing of resources

  16. Summary, conclusions • High/Low-level VO • Site certification and software validation for HEP application • HEP oriented site functionality tests • On-line validation of site configuration • Statistical analysis of HEP processing • Dynamic Processing Task Pool

More Related