Data Grid Deployment Beyond Hype: Lessons from Belle Experiment

Belle Data Grid Deployment…Beyond the Hype Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005

Belle Experiment • Belle in KEK, Japan • Investigates symmetries in nature • CPU and Data requirements explosion! • 4 billion events needed simulating in 2004 to keep up with data production • Belle MC Production effort • Australian HPC has contributed • Belle’s an ideal case • has real research data • has known application workflow • has real need for distributed access and processing

Background • The general idea… • Investigation of Grid tools (Globus v1, v2, LCG) • Deployment to distributed testbed • Utilisation of the APAC and partner facilites • Deployment to the APAC National Grid

Australian Belle Testbed • Rapid deployment at 5 sites in 9 days • U.Melb. Physics + CS, U.Syd., ANU/GrangeNet, U.Adelaide CS • IBM Australia donated dual Xeon 2.6 GHz nodes • Belle MC generation of 1,000,000 events • Simulation and Analysis • Demonstrated atPRAGMA4 and SC2003 • Globus 2.4 middleware • Data management • Globus 2 replica catalogue • GSIFTP • Job management • GQSched (U.Melb Physics) • GridBus (U.Melb CS)

Initial Production Deployment • Custom built central job dispatcher • Initially used ssh and PBS commands • feared Grid was unreliable • then only 50% of facilities Grid accessible • SRB (Storage ResourceBroker) • Transfer of input dataKEK → ANUSF → Facility • Transfer of output dataFacility → ANUSF → KEK • Successfully participated inBelle’s 4x109 event MCproduction during 2004 • Now running on APAC NGusing LCG2/EGEE

Issues • Deployment • time consuming for experts. • even more time consuming for site admins with no experience. • requires loosening security (network, unknown services, NFS on exposed boxes) • Grid services and clients generally require public IPs with open ports • Middleware/Globus bugs, instabilities, failures • too many to list here • errors, logs, and manuals are frequently insufficient • Distributed management • version problems between Globus (eg. globus-url-copy can hang) • stable middleware is compiled from source – but OS upgrades can break • once installed how do we keep configured considering… • growing numbers of users and communities (VOs) • expanding interoperable Grids (more CAs) • Applications • installing by hand at each site • many require access to DB or remote data while processing • most clusters/facilities have private/off-internet compute nodes

Issues • Staging work around • GridFTP is not a problem, however, SRB is more difficult • remote queues for staging (APAC NF) • front end node staging to shared FS (via jobmanager-fork) • front end node staging via SSH • No National CA (for a while) • started with explosion of toy CAs • User Access Barriers • user has cert. from CA … then what? • access to facilities is more complicated (allocation/account/VO applications) • then all the above problems start! • Is Grid worth the effort?

Observations • Middleware • Everything is fabric, lack of user tools! • Initially only Grid fabric (low level) • eg. Globus2 • Application level or 3rd Generation middleware • eg. LCG/EGEE, VDT • Overarching, joining, coordinating fabric • User tools for application deployment • Everybody must develop additional tools/portals for everyday user access (non-expert) • No out of box solutions • Real Data Grids! • Many international research big-science collaborations are data focused • This is not simply a staging issue! • Jobs need seamless access to data (at start, middle, end of job) • Many site compute nodes have no external access • Middleware cannot stage/replicate databases • In some cases file access is determined at run time (ATLAS) • Current jobs must be modified/tailored for each site – not Grid

Observations • Information Systems • Required for resource brokering, debugging problems • MDS/GRIS/BDII are often unused (eg. Nimrod/G, GridBus) • not because of the technology • never given a certificate • never started • never configured for the site (PBS etc.) • never configured to publish (GIIS or top level BDII) • never checked

Lessons/Recommendations • NEED tools to determine what's going on (debug) • jobs and scripts must have debug output/modes • middleware debugging MUST be well documented • Error codes and messages • Troubleshooting • Log files • application middleware must be coded for failure! • service death, intermittent connection failure, data removal, proxy timeout, hangs are all to be expected • all actions must include external retry and timeout • information systems • eg. queue is full, application not installed, not enough memory

Lessons/Recommendations • Quality and Availability are key issues • Create service regression test scripts! • small config changes or updates can have big consequences • run from local site (tests services) • run from remote site (tests network) • Site validation/quality checks • 1 – are all services up and accessible? • 2 – can stagein+run+stageout a baseline batch job? • 3 – do I.S. conform to minimum schema standards? • 4 – are I.S. populated, accurate, and up to date? • 5 – repeat 1-4 regularly • Operational metrics are essential • help determine stability and usability • eventually provide justification for using Grid

Lessons/Recommendations • Start talking to System/Network Admins early • education about Grid, GSI, and Globus • logging and accounting • public IPs with shared home filesystem • Have a dedicated node manager, both OS and middleware • don't underestimate time required • installation and testing ~ 2-4 day expert, 5-10 days novice (with instruction) • maintenance (testing, metrics, upgrades) ~ 1/10 days • Have a middleware distribution bundle • too many steps to do at each site • APAC NG hoping to solve withXen VM images • Automate general management tasks • authentication lists (VO) • CA files, especially CRLs • host cert checks and imminentexpiry warnings • service up checks (auto restart?) • file clean up (GRAM logs, GASS cache?, GT4 persisted) BADG Installersingle step, guided GT2 installationhttp://epp.ph.unimelb.edu.au/EPPGrid GridMgrmanages VOs, certs, CRLshttp://epp.ph.unimelb.edu.au/EPPGrid

International Interoperability • HEP case study • application groups had to develop coordinated dispatchers and adapters • researchers jumping through hoops -> in my opinion failure • limited manpower, limited influence over implementation • if we are serious we MUST allocate serious manpower and priority with authority over Grid infrastructure • minimal services, same middleware, is not enough • test case applications are essential • operational metrics are essential

Benefits • Access to resources • Funding to develop expertise and for manpower • Central expertise and manpower (APAC NG) • Other infrastructure (GrangeNet, APAC NG, TransPORT SX) • Early adoption has been important • Initially access to more infrastructure • Ability to provide experienced feed back • Enabling large scale collaboration • eg. ATLAS • produces up to 10PB/year of data • 1800 people, 150+ institutes, 34 countries • Aim to provide low latency access to data with 48hrs of production

Data Grid Deployment Beyond Hype: Lessons from Belle Experiment

Data Grid Deployment Beyond Hype: Lessons from Belle Experiment

Presentation Transcript

Grid Scheduling

The Homework Grid

Selenium Grid and Jenkins

The Silicon Vertex Detector of the Belle II Experiment

Status and Prospects of Super KEKB and Belle II Zdenek Dolezal Charles University in Prague for Belle II

Data Management Services in GT2 and GT3

Physics Results of Belle and Prospects for Belle II

The Challenges in ICT: Debunking the Hype

GRID COMPUTING

Grid Systems and scheduling

SAM: Tevatron Experiments Using the Grid

Advanced Data Structures NTUA 2007 R-trees and Grid File

Data Transport in the Grid

Scheduling for Grid Computing

Grid Tutorial 2008

Wireless Sensor Networks Routing

iRODS Tutorial II. Data Grid Administration

Grid Computing and LA Grid