140 likes | 253 Views
Belle Data Grid Deployment … Beyond the Hype. Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005. Belle Experiment. Belle in KEK, Japan Investigates symmetries in nature CPU and Data requirements explosion!
E N D
Belle Data Grid Deployment…Beyond the Hype Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005
Belle Experiment • Belle in KEK, Japan • Investigates symmetries in nature • CPU and Data requirements explosion! • 4 billion events needed simulating in 2004 to keep up with data production • Belle MC Production effort • Australian HPC has contributed • Belle’s an ideal case • has real research data • has known application workflow • has real need for distributed access and processing
Background • The general idea… • Investigation of Grid tools (Globus v1, v2, LCG) • Deployment to distributed testbed • Utilisation of the APAC and partner facilites • Deployment to the APAC National Grid
Australian Belle Testbed • Rapid deployment at 5 sites in 9 days • U.Melb. Physics + CS, U.Syd., ANU/GrangeNet, U.Adelaide CS • IBM Australia donated dual Xeon 2.6 GHz nodes • Belle MC generation of 1,000,000 events • Simulation and Analysis • Demonstrated atPRAGMA4 and SC2003 • Globus 2.4 middleware • Data management • Globus 2 replica catalogue • GSIFTP • Job management • GQSched (U.Melb Physics) • GridBus (U.Melb CS)
Initial Production Deployment • Custom built central job dispatcher • Initially used ssh and PBS commands • feared Grid was unreliable • then only 50% of facilities Grid accessible • SRB (Storage ResourceBroker) • Transfer of input dataKEK → ANUSF → Facility • Transfer of output dataFacility → ANUSF → KEK • Successfully participated inBelle’s 4x109 event MCproduction during 2004 • Now running on APAC NGusing LCG2/EGEE
Issues • Deployment • time consuming for experts. • even more time consuming for site admins with no experience. • requires loosening security (network, unknown services, NFS on exposed boxes) • Grid services and clients generally require public IPs with open ports • Middleware/Globus bugs, instabilities, failures • too many to list here • errors, logs, and manuals are frequently insufficient • Distributed management • version problems between Globus (eg. globus-url-copy can hang) • stable middleware is compiled from source – but OS upgrades can break • once installed how do we keep configured considering… • growing numbers of users and communities (VOs) • expanding interoperable Grids (more CAs) • Applications • installing by hand at each site • many require access to DB or remote data while processing • most clusters/facilities have private/off-internet compute nodes
Issues • Staging work around • GridFTP is not a problem, however, SRB is more difficult • remote queues for staging (APAC NF) • front end node staging to shared FS (via jobmanager-fork) • front end node staging via SSH • No National CA (for a while) • started with explosion of toy CAs • User Access Barriers • user has cert. from CA … then what? • access to facilities is more complicated (allocation/account/VO applications) • then all the above problems start! • Is Grid worth the effort?
Observations • Middleware • Everything is fabric, lack of user tools! • Initially only Grid fabric (low level) • eg. Globus2 • Application level or 3rd Generation middleware • eg. LCG/EGEE, VDT • Overarching, joining, coordinating fabric • User tools for application deployment • Everybody must develop additional tools/portals for everyday user access (non-expert) • No out of box solutions • Real Data Grids! • Many international research big-science collaborations are data focused • This is not simply a staging issue! • Jobs need seamless access to data (at start, middle, end of job) • Many site compute nodes have no external access • Middleware cannot stage/replicate databases • In some cases file access is determined at run time (ATLAS) • Current jobs must be modified/tailored for each site – not Grid
Observations • Information Systems • Required for resource brokering, debugging problems • MDS/GRIS/BDII are often unused (eg. Nimrod/G, GridBus) • not because of the technology • never given a certificate • never started • never configured for the site (PBS etc.) • never configured to publish (GIIS or top level BDII) • never checked
Lessons/Recommendations • NEED tools to determine what's going on (debug) • jobs and scripts must have debug output/modes • middleware debugging MUST be well documented • Error codes and messages • Troubleshooting • Log files • application middleware must be coded for failure! • service death, intermittent connection failure, data removal, proxy timeout, hangs are all to be expected • all actions must include external retry and timeout • information systems • eg. queue is full, application not installed, not enough memory
Lessons/Recommendations • Quality and Availability are key issues • Create service regression test scripts! • small config changes or updates can have big consequences • run from local site (tests services) • run from remote site (tests network) • Site validation/quality checks • 1 – are all services up and accessible? • 2 – can stagein+run+stageout a baseline batch job? • 3 – do I.S. conform to minimum schema standards? • 4 – are I.S. populated, accurate, and up to date? • 5 – repeat 1-4 regularly • Operational metrics are essential • help determine stability and usability • eventually provide justification for using Grid
Lessons/Recommendations • Start talking to System/Network Admins early • education about Grid, GSI, and Globus • logging and accounting • public IPs with shared home filesystem • Have a dedicated node manager, both OS and middleware • don't underestimate time required • installation and testing ~ 2-4 day expert, 5-10 days novice (with instruction) • maintenance (testing, metrics, upgrades) ~ 1/10 days • Have a middleware distribution bundle • too many steps to do at each site • APAC NG hoping to solve withXen VM images • Automate general management tasks • authentication lists (VO) • CA files, especially CRLs • host cert checks and imminentexpiry warnings • service up checks (auto restart?) • file clean up (GRAM logs, GASS cache?, GT4 persisted) BADG Installersingle step, guided GT2 installationhttp://epp.ph.unimelb.edu.au/EPPGrid GridMgrmanages VOs, certs, CRLshttp://epp.ph.unimelb.edu.au/EPPGrid
International Interoperability • HEP case study • application groups had to develop coordinated dispatchers and adapters • researchers jumping through hoops -> in my opinion failure • limited manpower, limited influence over implementation • if we are serious we MUST allocate serious manpower and priority with authority over Grid infrastructure • minimal services, same middleware, is not enough • test case applications are essential • operational metrics are essential
Benefits • Access to resources • Funding to develop expertise and for manpower • Central expertise and manpower (APAC NG) • Other infrastructure (GrangeNet, APAC NG, TransPORT SX) • Early adoption has been important • Initially access to more infrastructure • Ability to provide experienced feed back • Enabling large scale collaboration • eg. ATLAS • produces up to 10PB/year of data • 1800 people, 150+ institutes, 34 countries • Aim to provide low latency access to data with 48hrs of production