1 / 40

Distributed Systems Laboratory cs.technion.ac.il/Labs/dsl

Distributed Systems Laboratory www.cs.technion.ac.il/Labs/dsl. Lab People - Faculty. Prof. Ran El-Yaniv (Learning, Data Mining) Prof. Roy Friedman (Distributed Systems, Ad hoc Networks) Prof. Erez Petrank (Memory Management) Dr. Avi Mendelson (Computer Architecture)

wblair
Download Presentation

Distributed Systems Laboratory cs.technion.ac.il/Labs/dsl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Systems Laboratory www.cs.technion.ac.il/Labs/dsl MSR HPC visit

  2. Lab People - Faculty • Prof. Ran El-Yaniv (Learning, Data Mining) • Prof. Roy Friedman (Distributed Systems, Ad hoc Networks) • Prof. Erez Petrank (Memory Management) • Dr. Avi Mendelson (Computer Architecture) • Prof. Assaf Schuster, HEAD (Large-Scale Data Processing, Distributed Systems) MSR HPC visit

  3. Lab People Engineers: Eran Issler, Max Kovgan, David Carmeli, Valentin Kravtchov, Artiom Sharov About 40 graduate research students (best of breed!) Dozens of undergraduate and graduate students working on projects each semester Hundreds of undergraduate students in systems courses MSR HPC visit

  4. Sponsors and Partners MSR HPC visit

  5. Condor – Grid Computing – Research, development, deployment Software Distributed Shared Memory System Scope Grid/P2P/Sensor Data Mining Large Scale Distributed Data Mining Genetic Linkage Analysis Applications Distributed Scalable Model Checking Anonymous and Private distributedData Mining Machine Learning Sensor Networks Internet Mining Light-weight group communication Services for Ad-hoc networks Fast interconnects for HPC and data processing Middleware, Virtualization Data Privacy in Distributed Databases Locality in large-scale computations Scalable Data Race Detection Highly Available Distributed Java Multilevel caching in storage systems Computer Architecture: Fine Grain Parallelization Hardware MSR HPC visit

  6. The Resource Hierarchy GLOW - UW Madison Boinc @HOME MSR HPC visit

  7. EGEE MSR HPC visit

  8. DSL users • Dr. Avi Mendelson – Trace cache • Prof. Ran El Yaniv – Machine Learning • Prof. Roy Friedman – Group Communication • Prof. Assaf Schuster – Large scale and grid • Prof. Eli Biham – Cryptography • Prof. Dan Geiger – Genetic Linkage Analysis • Prof. Orna Grumberg – Scalable Model Checking • Prof. Uri Weiser – Computer Architecture • Prof. Ron Pinter – Caching Architectures • Prof. Ronny Kimmel – 3D Image processing • Prof. Reuven Cohen – Communication Networks • Prof. Danny Raz – Active Distributed Services • Prof. Idit Keidar – Distributed Systems • Prof. Mooly Sagiv – Compiler Analysis • Prof. Shaul Markovitch – Machine Learning • Prof. Yoram Rosen – High Energy Physics • …. MSR HPC visit

  9. Contents - Tools • Multiview – Distributed Shared Memory • Data race detection • Model checking-based DRD • Grid Monitoring System • Decorative HA for grids MSR HPC visit

  10. Contents – Large-Scale Distributed Systems Peer-to-Peer Data Mining DataMiningGrid project QosCosGrid project Distributed runtime for multithreaded Java Distributed Model Checking

  11. Multiview – Technologies for Distributed Shared Memory [OSDI’99] MSR HPC visit

  12. See Multiview in a separate presentation MSR HPC visit

  13. Data Race Detection for C++ Programs [PPOPP’03] MSR HPC visit

  14. See MultiRace in a separate presentation MSR HPC visit

  15. Model Checking-Based Data Race Detection [PPOPP’05] MSR HPC visit

  16. Difficulties in model checking dataraces • Infinite state space • Huge number of interleavings • Huge transition systems • Size problem MSR HPC visit

  17. Basic idea MSR HPC visit

  18. hybrid solution • Combine Lockset & Model Checking • Provide witnesses for dataraces • Rare dataraces • Dataraces in large programs Model CheckingProvide witnesses for rare DR + Locksetscale for large programs MSR HPC visit

  19. Idea and Prototype Multi-threaded program List of Warnings Violations of locking principle Lockset Access suspicious of racing Find a1 Extend 1 Wolf Model checker 1 2 snapshot witness MSR HPC visit

  20. Benchmark programs MSR HPC visit

  21. Experimental results MSR HPC visit

  22. Mining for Misconfigured Machines in a Grid System [KDD’06] Tested with success on a production environment. MSR HPC visit

  23. Execution Submission Resource broker Grid Batch Systems • Many organizations or administration sites. • 10000s machines • Heterogeneous machines • Non dedicated • Different installation and configuration • Many potential causes of failures and misbehaviors • Software bugs, hardware, network , configuration • Current solutions • Manual diagnosis • Ruled based expert system. • Data mining • Limited, if any, prior knowledge MSR HPC visit

  24. Data collector Data miner Data Acquisition • Data collector • Non-intrusive • Distributed Database • Preprocessing • Data miner • Distributed MSR HPC visit

  25. Distributed Outlier Detection MSR HPC visit

  26. Distributed Outlier Detection MSR HPC visit

  27. Distributed Outlier Detection MSR HPC visit

  28. Distributed Outlier Detection MSR HPC visit

  29. Distributed Implementation P2 P1 P3 SG3 S1 SG2 S2 S3 1 1 1 2 SG MSR HPC visit

  30. Distributed Implementation P2 P1 P3 SG3 S1 SG2 S2 S3 1 3 2 1 SG MSR HPC visit

  31. Distributed Implementation P2 P1 P3 SG3 SG1 S1 SG2 S2 S3 1 3 SG MSR HPC visit

  32. Evaluation on DSL Hardware • 3 of the top 4 suspected machines are actually misconfigured. • bh10: unknown reason. • i4: loaded by network service. • bh13: active HyperThreading. • i3: root file system was nearly full. MSR HPC visit

  33. Future Work • Fault identification, analysis, classification, prediction. • Better resource allocation; better system utilization • Feedback to user on submitted jobs description • Optimizing transparent operation • Collaboration with INTEL NetBatch team MSR HPC visit

  34. HA for large scale grids [HPDC’06] Production System – Condor distribution MSR HPC visit

  35. The Challenges • WAN backups • Failure detection is not perfect - no bounded delay • Network anomalies - links are asymmetric, not transitive • IP fail-over techniques inapplicable • Lightweight protocols • Traditional Group Communication algs do not scale well • Autonomous partitions • Transient failures • Legacy applications without HA • Grid developers do not want to deal with HA MSR HPC visit

  36. The Goal • The goal is to turn HA into a commodity • “HA out of the box” • No need to change or adapt your existing service • HA is provided as a Grid service itself Solution: • Decoration • Transparent addition of HA to already existing and deployed services • No changes to the decorated service MSR HPC visit

  37. Negotiator Collector Execution machine Execution machine Job queue machine Job queue machine Execution machine Job queue machine Job queue machine Application: HA for Condor Central Manager Central Manager MSR HPC visit

  38. Solution Architecture MSR HPC visit

  39. Solution Highlights • HAInvocator - High Availability for Negotiator • Leader election • Automatic failure detection • Transparent failover to backup • “Split brain” reconciliation after network partitions • HAReplicator - Persistency of Negotiator state • State replication between active and backups • Proxy for multicasting client’s messages to Collector • Loose coupling between replication and HA MSR HPC visit

  40. Status • Passed **testing** in 2005 • Not a single code line of Condor changed • Except for several bug fixes  • Inside Condor distribution effective Version 6.8 • Some important clients • Some success stories • On-going collaboration with the Condor team MSR HPC visit

More Related