Contributions to the modelisation and optimisation of large scale distributed computing

Contributions to the modelisation and optimisation of large scale distributed computing Habilitation à diriger des recherches Cécile Germain-Renaud LRI and LAL http://www.lri.fr/~cecile/RAPH

Summary • Introduction • A protocol and a model for Global Computing • Fault tolerant message-passing • Grid result-checking • Grid-enabling medical image analysis • Perspectives HDR 09/07/2005

Old ideas « Programmers at computer A have a blurred photo which they want to put into focus. Their program transmits the photo to computer B, which specializes in computer graphics (…). If B requires specialized computer assistance, it may call on computer C for help » HDR 09/07/2005

High performance computing systems Massively Distributed Massively parallel Tera Grid -DEISA Tera computer Desktop grids SETI@home Blue Gene L Clusters EGEE OSG • Heterogenous hard/software • Internet • Autonomic management • Homogeneous hard/software • Internal network • Static HDR 09/07/2005

High performance computing systems: new issues Massively distributed Massively parallel Tera Grid-DEISA Tera computer Desktop grids SETI@home Blue Gene L Clusters EGEE OSG • Faults are normal events • Data-centric • EP or moldable applications • Performance is throughput • Time-shared • Middleware scheduling • Very complex • Fault-free • CPU-centric • Monolithic applications • Performance is speedup • Exclusive • Application-level scheduling • «Simple » – N1/2, LogP HDR 09/07/2005

Summary • Introduction • A protocol and a model for volatile computing • Grid result-checking • Fault tolerant message-passing • Grid-enabling medical image analysis • Perspectives HDR 09/07/2005

Collaborator Worker Worker Worker XtremWeb architecture • Explores the Global Computing modality of grids • A large scale distributed system • Computing resources (collaborators) are • volatile: come and go unexpectedly • « at the edge of the internet »: low-level and anonymous • Dedicated to high throughput (like Condor)– vs applets systems • Consequences a RISgC: Reduced Infrastructure Software for Grid Computing • Not master-slave, but a pull model: the collaborators decide when and what • A soft-state dispatcher/collaborator protocol: the collaborator state expires if not refreshed • Deployed at Paris-Sud University for Auger Request Reply Keepalive Dispatcher HDR 09/07/2005

A performance model • Applications are not so much moldable – unit duration • Soft-state introduces overhead – keepalive period a • The fault process on one site is largely unpredictable [Dinda 99] • But for each task, the fault process is memoryless if the successive execution sites are uncorrelated [Libel et al 02]: Poisson process where l is the fault rate The system cannot be tuned even for infinite resource HDR 09/07/2005

Detection of ultra-high energy cosmic rays The Auger Observatory HDR 09/07/2005

Detection of ultra-high energy cosmic rays UHCR create particle showers The Auger Observatory HDR 09/07/2005

Detection of ultra-high energy cosmic rays UHCR create particle showers Indirect observation Ground detectors Fluorescence telescope In silico experiments Shower simulations CCIN2P3 - XtremWeb The Auger Observatory 200 physicists, 55 institutions, 15 countries, 3000km2, 1600 tanks, 30 years life time… HDR 09/07/2005

Publications • F. Cappello, A. Djilali, G. Fedak, C. Germain, O. Lodygensky et V. Neri. Calcul réparti a grande échelle, chapitre XtremWeb : une plateforme de recherche sur le calcul global et pair a pair, pages 153-186 , Lavoisier, 2002. • G. Fedak, C. Germain, V. Neri and F. Cappello. XtremWeb: A generic global computing platform. In IEEE/ACM CCGRID'2001, pages 582-587, IEEE Press, 2001. • C. Germain, G. Fedak, V. Neri and F. Cappello. Global Computing Systems. In 3rd Int. Conf. on Large Scale Scientific Computations, LNCS 2179, pages 218-227, Sozopol, 2001. Springer-Verlag. • C. Germain, V. Néri, G. Fedak and F. Cappello. Xtremweb : Building an experimental platform for global computing. In Proc. 1st IEEE/ACM Intl. Workshop Grid 2000. Springer, 2000. HDR 09/07/2005

Summary • Introduction • A protocol and a model for Global Computing • Grid result-checking • Fault tolerant message-passing • Grid-enabling medical imaging • Perspectives - Towards a grid observatory HDR 09/07/2005

Why do we need to check grid computations more carefully ? • Attacks are likely on Global Computing systems • Low control over collaborators • Real-world issue: some happened in all deployed GC systems even with a unique binary application • SETI - wrong FFT, Decrypthon I had 5% errors • All double- or triple- checked their results • Might happen also in grid systems • Submission tools and grid workflow management are in early stage • Code version management HDR 09/07/2005

En masse checking [Sarmenta FGCS 2002] Contexts & Related Work Hardware support TCPA/Palladium Prevention Code encryption Check that a property holds for an object Program output: Sorted array Graph properties: random graph sample Result-checking: Blum Property testing: Goldreich Detection HDR 09/07/2005

Result checking and Grid Applications • Typical grid use cases: no independent property to check • Monte-Carlo simulations Range of parameter values x internal randomization • Local interactions: the specification is the program • Unknown shape -> non parametric • Fault tolerance through robust statistics • Search for rare events: SETI • Not fault tolerance HDR 09/07/2005

Only the input parameters can be falsified But extracting the input parameters from the data is just the problem! Fe, 1E20, q, f Fe, 1E20, q, f Pr 1.5E20, q’, f’ Example: shower simulations Shower Simulation Detector Simulation Reconstruction HDR 09/07/2005

En masse result-checking • Goals • Minimize the checking overhead through adaptive tests The most likely situations are • Normal: the majority of collaborators are OK • Massive attack: the majority of collaborators cheat (or err) • Robust to denial of service attacks: the system is unable to assess the quality of its production • Efficient for anonymous execution: private network, IP spoofing • Results • Generic 2-phase test based on Wald’s sequential test • Improvement for the Auger Showers: pre-qualification of showers through empirical detection of outliers HDR 09/07/2005

Oracle Re-execution Sample Qualification Batch segmentation Sample selection Overview HDR 09/07/2005

Publications • C. Germain and D. Monnier-Ragaigne. Grid Result Chechking. In Procs. 2nd Computing Frontiers, Ischia, Mai 2005. ACM Press. • C. Germain and N. Playez. Result-Checking in Global Computing Systems. In Procs.17h ACM Int.Conf. on Supercomputing, pages 226-233, San Francisco, June 2003. ACM Press. HDR 09/07/2005

Summary • Introduction • A protocol and a model for Global Computing • Grid result-checking • Fault tolerant message-passing • Grid-enabling medical imaging • Perspectives - Towards a grid observatory HDR 09/07/2005

Fault tolerance Tunneling Fault-free performance Grids create new contexts for message passing • Message passing environments and especially MPI are the standard for parallel computing, but have been designed with MPP in mind: fault free + internal network • New use cases for message passing • Global Computing • Loosely coupled computations, very frequent faults • InstitutionalGrids • Coupled computations, moderately frequent faults • Very large clusters • Tightly coupled computations, unfrequent faults • Or frequent: time-sharing cf the Connection Machine FT-MPI MPI-FT FT/MPI … HDR 09/07/2005

Contributions to MPICH-V User Application Communication Virtualization Dispatch Process Virtualization Communication library Checkpoint/Restart library TCP Sockets Condor libckpt.a HDR 09/07/2005

put Pessimistic message logging on Channel Memories • Decoupled communication • Dedicated reliable nodes support Channel Memory servers • CMs log all messages in FIFO order • Conceptually transactional put/get • A restarted process transparently replays all communications • Consistent execution based on partial restarts • Adaptive to heterogeneous fault behaviour: independent scheduling of process checkpoints • Tunneling as a byproduct MPI process CM get MPI process HDR 09/07/2005

ADI _cmbsend - blocking send _cmbrecv _cmfrom - get the src of the last message - blocking receive _cmprobe - check for any message avail. _cmInit - initialize the client _cmFinalize - finalize the client Blocking TCP Read Write Control + data messages The MPICH-CM library MPI_Send Channel Interface MPID_SendControl MPID_SendChannel Chameleon Interface PIbsend CM device Interface _cmbsend HDR 09/07/2005

x2 Communication overhead MPI process put CM Bounded by node bandwidth get MPI process HDR 09/07/2005

Self-stabilizing Fault-tolerant Self-stabilizing Towards a hybrid approach • Coupled applications have • A hierachical structure • Differentiated requirements • Asynchronous iterations [Bertsekas] also apply to faults • A reliable tunneling infrastructure is required anyway • Requires re-coding even of the innermost loop for non-trivial applications eg multi-grid P0 P1 WAN P3 P2 HDR 09/07/2005

Publications • A.Selikhov and C.Germain. A channel memory based environment for MPI applications. Future Generation Computing Systems, 21(5):709-715, 2004. • A. Selikhov, G. Bosilca, C. Germain, G. Fedak and F. Cappello. MPICH-CM: a Communication Library Design for a P2P MPI Implementation. In 9th Euro PVM/MPI Conf., LNCS 2474, pages 323-330, Vienna, Oct. 2002. Springer-Verlag. • G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Hérault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri and A. Selikhov. MPICH-V : Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In IEEE/ACM Int. Conf. for High Performance Computing and Communications 2002 (SC'02 - SuperComputing'02), Baltimore, 2002 HDR 09/07/2005

Summary • Introduction • A protocol and a model for Global Computing • Grid result-checking • Fault tolerant message-passing • Grid-enabling medical image analysis • Perspectives HDR 09/07/2005

Medical image analysis exemplifies the need for… • Seamless integration of grid resources with local tools: analysis, graphics,… • Unplanned access to high-end computing power and data • Interactivity • But convergence with many other areas Id OwnerSubmittedST PRI ClassRunning On f01n01.10873.0 qzha 5/19 07:34 R 50 fewcpu f11n07 f01n03.6292.0 agma 5/22 14:50 R 50 standard f12n02 f01n03.6293.0 publ 5/22 16:16 R 50 standard f03n09 f01n03.6304.0 agma 5/22 22:46 R 50 standard f11n05 f01n03.6309.0 agma 5/23 12:41 R 50 standard f01n11 f01n01.10914.0ying 5/23 14:17 R 50 fewcpu f06n03 f01n02.4596.0 dpan 5/23 15:33 I 50 standard f01n03.6310.0 divi 5/23 16:03 I 50 standard HDR 09/07/2005

Interaction Acquire Explore Analyse Interpret Render gPTM3D • grid-enabling the PTM3D software • PTM3D: poste de travail médical 3D • A.Osorio & team – LIMSI Clinical use: « cum laude » RSNA 2004 • Complex interface: optimized graphics and medically-oriented interactions • Expert interaction is required at and inside all steps • But 3D medical data may be very large – 1GB and computations too HDR 09/07/2005

gPTM3D • grid-enabling the PTM3D software on a production grid • PTM3D: poste de travail médical 3D • A.Osorio & team – LIMSI Clinical use: « cum laude » RSNA 2004 • Complex interface: optimized graphics and medically-oriented interactions • Expert interaction is required at and inside all steps • But 3D medical data may be very large – 1GB and computations too Interaction Acquire Explore Analyse Interpret Render HDR 09/07/2005

Country providing resources • Country anticipating joining EGEE/LCG • In EGEE-0 (LCG-2): • > 130 sites • > 14,000 CPUs • > 5 PB storage 70 leading institutions 27 countries, federated in regional Grids ~32 M Euros EU funding for first 2 years EGEE Computing Resources – April 2004 HDR 09/07/2005 From The project status slides 1st EGEE review

Interactive volume reconstruction • gPTM3D first results • Optimal response time for volume reconstruction on EGEE • With unmodified interaction scheme • Demonstrated at the first EGEE review HDR 09/07/2005

Figures for volume reconstruction Dataset 87MB 210MB 346MB 87MB Input data 3MB 18KB/slice 9.6 MB 25KB/slice 15MB 22KB/sclice 410KB 4KB/slice Output data 6MB 106KB/slice 57MB 151KB/slice 86MB 131KB/slice 2.3MB 24KB/slice Tasks 169 378 676 95 StandaloneExecution 5mn15s 1mn54s 33mn 11mn5s 18mn 36s EGEE . 37s 18s 2mn30s 1mn15s 2mn03 24s Small body Medium body Large body Lungs HDR 09/07/2005

Interactive jobs on a grid: a scheduling problem • Short Deadline Job • A moldable application – individual tasks are very fine-grained • Soft deadline • No reservation: should be executed immediately or rejected • Sharing contract • Bounded slowdown for regular jobs • Do not degrade resource utilization • No stong preemption • Fair share across SDJ • Contexts • (multi) Processor soft real-time scheduling • Network routing Differentiated Services HDR 09/07/2005

Interaction Bridge Interaction Bridge Broker Broker Scheduling SDJ Job submission Proxy Tunneling User Interface User Interface User Interface Matchmaking Node • Permanent reservation • on virtual processors • Transparent when unused CE Cluster Scheduler JSS HDR 09/07/2005

TP Broker Broker Scheduling SDJ Interaction Bridge Task prioritization Interaction Bridge User Interface User Interface User Interface Matchmaking Node • Permanent reservation • on virtual processors • Transparent when unused CE Cluster Scheduler JSS HDR 09/07/2005

TP Scheduling tasks Interaction Bridge • Coping with the submission penalty • Ntasks each with small latency T • Potential completion bandwidth T-1 • Impaired by the submission protocol • A case for application-level scheduling Node Scheduling Agent Worker agent HDR 09/07/2005

Publications • C. Germain, R. Texier and A. Osorio. Interactive Volume Reconstruction and Measurement on the Grid. Methods of Information in Medecine, 44(2) 227-232, 2005. • C. Germain, R. Texier and A. Osorio. Interactive Exploration of Medical Images on the Grid. In Procs. 2nd european HealthGrid Conference, Clermont-Ferrand, Jan. 2004. • C. Germain, A. Osorio and R. Texier. A Case Study in Medical Imaging and the Grid. In S. Norager, editor, Procs. 1st European HealthGrid Conference, pages 110-118, Lyon, Jan. 2003. EC-IST. • D. Berry, C. Germain-Renaud, D. Hill, S. Pieper and J. Saltz. Report on the Workshop IMAGE'03: Images, medical analysis and grid environments. TR UKeS-2004-02, UK National e-Science Centre, , Feb. 2004. HDR 09/07/2005

AGIR: Analyse Globalisée des Données d’Imagerie Radiologique • A multidisciplinary research network funded by ACI Masses de Données • Advances in medical imaging algorithms and their use • Image processing: raw computing/data power • Sharing data and algorithms: evaluation is a major issue • From algorithmic research to clinical practice • Identify and explore new services and mechanisms required by medical imaging HDR 09/07/2005

Partners AlGorille CRAN Compression LRI – coll LAL LIMSI - St Anne Tenon FMP Interaction & Grids LPC EGEE VO Biomed CHRU ClermontCollaborative medecine 14 CS 6 physicians 6 Phd 4 engineers CREATIS Segmentation 4D CNRS-STIC CNRS-IN2P3 INRIA INSERM Hospitals Rainbow Software components Epidaure Medical Imaging Centre Antoine Lacassagne • Collaborations • EGEE • Grid5000 HDR 09/07/2005

A cross-section of AGIR Grid-enabled Workflow PTM3D calibration Gold standard Consensus Bronze standard Evaluation Evaluation Evaluation Automatic Nodules CAD Registration Algorithms gPTM3D Volume Reconstruction Network emulation QVAZM3D Compression Partially reliable transport protocol ADOC SPIHT Compression HDR 09/07/2005

More in • C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot, Y. Legré, C. Loomis, J. Montagnat, J-M Moureaux, A. Osorio, X. Pennec and R. Texier. Grid-enabling medical image analysis. In Procs. 3rd BioGrid'05,Cardiff, Mai 2005. IEEE Press. http://www.aci-agir.org HDR 09/07/2005

Summary • Introduction • A protocol and a model for Global Computing • Grid result-checking • Fault tolerant message-passing • Grid-enabling medical image analysis • Conclusions & Perspectives HDR 09/07/2005

Conclusion • An architecture for differentiated services • A grid testbed for algorithmic and clinical research in 3D medical imaging • Message-passing: fault-tolerance, performance and technology constraints point toward the same direction • Revisited result-checking: Monte-Carlo computations Ecology • Compatiblity of soft-state protocols with high troughput Anatomy Physiology HDR 09/07/2005

Perspectives • Data of interest Interactive grid access requires intelligent prefetch mechanisms to capture and anticipate the way data are explored and analyzed • Automatic selection • A model for describing the resulting requirements and propagate them to the data source • Optimised access schemes in relation with the structure of the raw data • Progressivity HDR 09/07/2005

Perspectives • Towards a grid observatory Optimizing grid middleware and applications requires trace and models for • Intrinsic characterization of « grid traffic »: eg the data locality parameters at a computing element • The reaction of the middleware components to these requirements: eg hits and misses • Spatio-temporal correlation of users - VO • To which extent the latter explains the formers • MAGIE and DEMAIN projects HDR 09/07/2005

Questions HDR 09/07/2005

Contributions to the modelisation and optimisation of large scale distributed computing

Contributions to the modelisation and optimisation of large scale distributed computing

Presentation Transcript

DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications

Large-Scale Distributed Systems

Large-Scale Distributed Computing in the Netherlands

Challenges and Solutions in Large-Scale Computing Systems

Designing, programming, and verifying (large scale) distributed systems

Large Scale Distributed Computing Systems

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems

Large-Scale Distributed Systems

Large-Scale Distributed Computing in the Netherlands

Large-Scale Distributed Computing: Near-Zero Cost and Near-Infinite Capability

Large Scale Computing Systems

CSCI 365 – Introduction to Large Scale Computing

Large-Scale Computing with Grids

Configuration Monitoring Tool for Large Scale Distributed Computing

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems

CS6282 Very Large Scale Distributed Systems

Contents – Large-Scale Distributed Systems

Goals of the Large-Scale Cluster Computing Workshop