1 / 58

Grid and High-Performance Computing in Israel - An overview

Grid and High-Performance Computing in Israel - An overview. Guy Tel-Zur NRCN tel-zur@computer.org. HPC2006, Cetraro, Italy July 5, 2006. Outline. Academia (IUCC, IAG) Emphasize on the BGU activity Industry & Trade (IGT)

lindley
Download Presentation

Grid and High-Performance Computing in Israel - An overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid and High-Performance Computing in Israel -An overview Guy Tel-Zur NRCN tel-zur@computer.org HPC2006, Cetraro, Italy July 5, 2006

  2. Outline • Academia (IUCC, IAG) • Emphasize on the BGU activity • Industry & Trade (IGT) • Future plans

  3. An overview of Grid and HPC in Israel IUCC – The Inter University Computation Center IUCC

  4. Network Infrastructure - I

  5. Network Infrastructure - II

  6. Network infrastructure - III IUCC/ ILAN IIX Med-1 tot. traffic http://noc.ilan.net.il/stats/ILAN-GP0/linktogeant-petach-tikva-gp.html

  7. The Israel Academic Grid (IAG) • http://iag.iucc.ac.il/ • Funded by the MOST • Steering & Technical Committees • Coordinates the Israeli activity in EGEE IUCC is the CA for the IAG

  8. EGEE • 4 GOCs (certified EGEE sites): Technion, TAU, WIS, OpenU. • LCG-2 moving now to gLite • BGU is next • gLite 3.0 being installed these days

  9. Distributed Systems Laboratory Prof. Assaf Schuster – Head Project: GMS Super-Link Online The Dependable Grid EGEE …and more Grid Computing at the Technion: Israel Institute of Technology

  10. GMS – Grid Monitoring System[Noam Palatin, Assaf Schuster, Ran Wolff.Forthcoming SIGKDD 2006, August, Chicago] • Distributively store all logs of a large batch system in local databases • Apply distributed data mining on logs • Implementation using Condor • Taken up by Intel NetBatch team: started a $3M project

  11. SuperLink Online[Danny Geiger, Miron Livny, Assaf Schuster, Mark Zilberstein. American Journal of Human Genetics, May 2006. HPDC, June 2006. Science NetWatch, May 2006.] • http://bioinfo.cs.technion.ac.il/superlink-online/ a production portal for geneticists working at hospitals • Submitted tasks contain gene mapping results from lab experiments • Portal user sees a single computer (!) • Implemented using a hierarchy of Condor pools • Highest/smallest pool in Technion (DSL) • Lowest/largest in Madison (GLOW). • In progress: linkage@home and EGEE BioMed implementations. • Many success stories: hunted genes causing various syndromes:http://bioinfo.cs.technion.ac.il/superlink-online/tips/papers.shtml

  12. The Dependable Grid[Gabi Kliot, Miron Livny, Assaf Schuster, Artyom Sharov, Mark Zilberstein.HPDC hot topics, June 2006.] • Provide a High Availability (HA) Library as a service for any Grid component • Decorative approach– no need to change component (!) • Production-quality Implementation using Condor severe development standards • HA for Condor matchmaker with zero loc changes (!!!) • Part of Condor 6.8 distribution • Deployed in many large Condor production pools • Plans to develop and support an open-source distribution

  13. An EGEE Certified Node[Dedi Carmeli, Max Kovgan, Assaf Schuster] • A 200-cpu Condor pool exposed to EGEE as a single resource • Resources are non-dedicated • Configuration of local priorities: local jobs preempt EGEE jobs • Work behind several firewalls, in a dedicated zone (Isolation, security, privacy).

  14. The Hebrew University: MOSIX Prof. Amnon Barak MOSIX is a cluster and an organizational grid management system Targeted for: • x86 based Linux clusters • Organizations that wish to link several such clusters • Service centers- for dynamic partition of a cluster to users The grid model: a federation of Linux clusters,servers and workstations whose owners trust each other and wish to cooperate from time to time Main goal: automatic management Geared for: HPC

  15. Main features Process migration and supervising algorithms for: • Automatic resource management and resource discovery • Grid-wide resource sharing • Adaptive workload distribution and load-balancing • Flexible (dynamic) partitions of nodes to users • Preserving running processes indisruptive configurations • Other services: batch jobs, checkpoint & recovery, live queuing and an on-line monitor of the grid and each cluster Outcome: the grid and each cluster performs like a single computer with multiple processors An organizational grid:due to trust

  16. The core technology • Preemptive process migration,e.g. for load-balancing or to evacuate guest processes from a disconnecting cluster • The user sees a Single-System Image (SSI) • No need to change applications, copy files to remote nodes or to link applications with any library • Provides a secure run-time environment(sandbox) • Guest processes can’t modify resources in hosting nodes

  17. The Hebrew University Organizational Grid • 12 MOSIX clusters ~300 nodes, 2 campuses • In life-sciences, medical school, chemistry and computer science • Target: 2000 nodes • Applications: Nano-technology, Molecular dynamics, Protein folding, Genomics (BLAT, SW), Meteorological weather forecast(MM5), Navier-Stokes equations and turbulence (CFD), CPU simulator of new hardware design (SimpleScalar) • Some users are interested in “Pay-per-Use” instead of cluster ownership More information at http://www.MOSIX.org

  18. Tel-Aviv University –School of Computer Science Grid projects & Local clusters • Condor pool (average 150nodes peak 300 nodes). Opportunistic computer cluster. Used for Bioinformatics, network simulations and classical HPC (MPI, neural networks, MC, fluid dynamics, etc). • Planet Lab – mainly infrastructure research led by Princeton. Very small compute power (4-20 nodes in each site). • EGEE II – 20 nodes (soon ~100), used for physics, bio-info and general research

  19. Ben Gurion University of the Negev • Inter campus Condor pool • Grid Computing

  20. The BGU Condor Pool • Started in 2000 • Today: 150+ nodes • Linux & Windows (2000, XP) • Campus-wide project • Non-dedicated resources (Next slide)

  21. Campus Wide Condor Pool • ECE Dept. • IE&M Dept. • Nucl. Eng. Dept. • Public Labs • Soon to be connected • CS • Physics Dept.

  22. Currently there are 4 science projects

  23. Condor at the BGU • Nucl. Eng. Dept.: Itzhak Orion • MCNP simulations • CS Dept., Chen Keasar • Protein structure prediction • Physics Dept, Yigal Meir • Solid State • IE&M, O. Levi, J. Miao & G. Zaslavsky • 3D image reconstruction

  24. I. Orion: MCNP and Condor • 48 hours for a single job of 2x109 histories on a single CPU • One layer imaging, at the desired resolution, requires 50 jobs  100days on a single CPU!!! • OS: Windows • Status: Initial tests completed

  25. C. Keasar: Protein Folding MESHI is a software package for protein modeling. It is written in Java. Ref: Kalisman N., Levi A., Maximova T. Reshef D., Zafriri-Lynn S., Gleyzer Y. and Keasar C. (2005) MESHI: a new library of Java classes for molecular modeling Bioinformatics21:3931-3932

  26. Y. Meir: The Kondo effect • “Investigate the dependence of the conductance and of the current through the quantum dot on temperature and, in particular, on its relation to the Kondo temperature” • “…The plan is to run the program for many sets of parameters of the quantum dot, giving rise to different Kondo temperatures and for different temperatures. This will allow us to determine whether the physical properties of the system depend only on the ratio of these temperatures, and how.”

  27. 3D image reconstruction from equally sloped projections by Parallel Computing Ofer Levi1, John Miao2 and Genady Zaslavsky1 1Department of Industrial Engineering and Managemet, Ben Gurion, Israel 2Department of Physics and Astronomy and California Nanosystems Institute, University of California, USA • Equally sloped tomography of a 3D object from a number of 2 D projections is efficient analysis method for accurate determination of intramolecular structures (Miao ,Foerster and Levi 2005 ). • However, true size data analysis by this method is a highly time consuming process and can not be done using a single working station.

  28. Parallel Implementation • A new parallel friendly method was developed and implemented. • The parallel computations were managed by Condor environment. • The use of Condor and parallel computing enabled successful reconstruction of complex real data in a reasonable time, a task that was impossible before.

  29. Time Analysis • Typical computation time of medium size 3D image ( 2563 voxels) takes approximately one month on the single 3GHz machine. • With Condor help we succeeded to reduce total runtime to 4 days. When in average 30 machines were used during the computation. • The goal is to process as much as possible 3D images on regular basis.

  30. Submitting Jobs

  31. Grid Computing at the BGU Grid Computing at NRCN • Currently: 6 nodes (12 proc) • gLite 3.0 is being installed these days • We expect to get more new computers from a bid that is going to take place within the next days. • A small Condor pool • Plan to operate a small Grid site (~40 processors) – IAG, IGT, EGEII….

  32. Ganglia monitoring: http://grid4.bgu.ac.il/ganglia

  33. The Israeli Association of Grid Technologies (IGT) • A non-profit association • Supported by part by the MI,T&L The Israeli Association of Grid Technologies

  34. IGT Achievements • Founded: May 1st 2004 with 5 members • Today: 30 members • 16 Conferences and more than 20 overseas speakers • Grid in the Financial sector, more than 130 people • Annual conference/expo with 200 people • Work Groups: Grid-SOA, Grid-HPC, Virtualization, RDMA • Enterprise workshops • Knowledge Center • Virtual Community Web Site • IGT Grid Lab • International corporations: Europe, USA • Grid Award Contest • www.Grid.org.il

  35. Members

  36. Israeli Grid R&D • Fast Networks • Mellanox - Infiniband • Voltaire - Infiniband • Software Infrastructure • GigaSpaces – Grid application server • Xeround - Networked Database • Exanet – Distributed file system virtualization • BMC – Data Center Virtualization management • IBM Haifa – File/Storage Virtualization • Grid based solutions • Elbit – Management & Control systems • Rafael – Management & Control systems • Xoreax – Grid based software build • Sungard – Financial Broker • Other • Shunra - Grid WAN simulations • Symantec (Precise) – Performance management • IAI/Elta – Internal Grid systems for Engineering simulations • Intel Israel - NetBatch

  37. IGT Grid Lab Virtual Grid Lab – Secured Resources Sharing Internet based VPN CPUs CPUs CPUs Grid Lab Management CPUs CPUs CPUs CPUs - IGT Member

  38. HPC in the Industry June06

  39. (My) Near Future Plans at BGU • Campus wide Condor pool  More Scientific Projects • A web portal for submission of Condor jobs • An operational EGEE-II site • A joint Singapore-Israel Grid project (Flocking between our Condor pools) • We are thinking about opening a new and unique course of study in “Grid Computing”!

  40. IsraGrid • Similar to other national grid initiatives • We want to build a 1Gb/sec network for the IGT member organizations • RFI • Pending for an approval as a National Project

  41. Next Events in Israel http://gccb2006.ulster.ac.uk/-Introduction-.html The 2nd IGT annual event, December 2006

  42. Summary • From the infrastructure point of view we are only at the beginning of the road • We have O(Zero) gov. funding  • But there is a lot of technological capability • Many new projects in the industry and in the academy

  43. Thank You!References: • Condor at the BGU: http://www.ee.bgu.ac.il/~tel-zur/condor/ • "An Introduction to Parallel Processing” course at the BGU: http://www.ee.bgu.ac.il/~tel-zur/2006A/Welcome.html • Grid Computing at the BGU http://www.ee.bgu.ac.il/~tel-zur/grid.html • IGT http://www.grid.org.il

  44. The Technion

  45. GMS – Grid Monitoring System “Detection and prediction of critical situations and errors or faulty results” • Collects, organizes and stores system status data • Translates data to semantically meaningful terms • Analyzes the resulting distributed dataset using suitable algorithms

  46. GMS – Grid Monitoring System • Tested on 100-cpu Condor pool • Employed a novel distributed outliers detection algorithm • Piggybacking on Condor job execution mechanisms • Detected three misconfigured machines • System and result reported in: “Mining for Misconfigured Machines in Grid Systems”, accepted to ACM-SIGKDD’06 • System is in packaging, to be made available open-source • System to be deployed on GLOW, Madison (x,000 machines)

  47. SuperLink Online[Danny Geiger, Miron Livny, Assaf Schuster, Mark Zilberstein. American Journal of Human Genetics, May 2006. HPDC, June 2006. Science NetWatch, May 2006.] • http://bioinfo.cs.technion.ac.il/superlink-online/ a production portal for geneticists working at hospitals • Submitted tasks contain gene mapping results from lab experiments • System performs linkage analysis using powerful Bayesian network manipulation • Automatic and adaptive parallelization –the execution hierarchy • Turnaround times for mixed workload (exponential distribution) • short jobs – seconds on a single machine • large jobs – days on thousands machines • Portal user sees a single computer (!) • Implemented using a hierarchy of Condor pools • Highest/smallest pool in Technion (DSL) • Lowest/largest in Madison (GLOW). • In progress: linkage@home and EGEE BioMed implementations. • Many success stories: hunted genes causing various syndromes:http://bioinfo.cs.technion.ac.il/superlink-online/tips/papers.shtml

  48. The Dependable Grid[Gabi Kliot, Miron Livny, Assaf Schuster, Artyom Sharov, Mark Zilberstein.HPDC hot topics, June 2006.] • Provide a High Availability (HA) Library as a service for any Grid component • Decorative approach– no need to change component (!) • Decouple election and replication of state • A general approach for any consistency guarantees of state replication (in progress) • Production-quality Implementation using Condor severe development standards • HA for Condor matchmaker with zero loc changes (!!!) • Part of Condor 6.8 distribution • Deployed in many large Condor production pools • Plans to develop and support an open-source distribution

More Related