1 / 51

Scalable Scientific Computing at Compaq

Scalable Scientific Computing at Compaq. CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA martin.walker@compaq.com. Agenda of the entertainment. From EV4 to EV7: four implementations of the Alpha microprocessor over ten years

donagh
Download Presentation

Scalable Scientific Computing at Compaq

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA martin.walker@compaq.com

  2. Agenda of the entertainment • From EV4 to EV7: four implementations of the Alpha microprocessor over ten years • Performance on a few applications, including numerical weather forecasting • The Terascale Computing System at the Pittsburgh Supercomputing Center • Marvel: the next (and last) AlphaServer • Grid Computing

  3. Scientific basis for vector processor choice for Earth Simulator project • Comparison of Cray T3D and Cray Y-MP/C90 J.J. Hack, et al, “Computational design of the NCAR community climate model”, J. Parallel Computing 21 (1995) 1545-1569 • Fraction of peak performance achieved • 1-7% on Cray T3D • 30% on Cray Y-MP/C90 • Cray T3D used the Alpha EV4 processor from 1992

  4. Key ratios that determine sustained application performance (U.S. DoD/DoE)

  5. Alpha EV6 Architecture FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec Addr Exec L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address

  6. Weather Forecasting Benchmark • LM = local model, German Weather Service (DWD) Current version is RAPS 2.0 • Grid size is 325  325  35; predefined INPUT set dwd used for all benchmarks • First forecast hour timed (contains more I/O than subsequent forecast hours) • Machines • Cray T3E/1200 (EV5/600 MHz) in Jülich, Germany • AlphaServer SC40 (EV67/667 MHz) in Marlboro, MA • Study performed by Pallas GmbH (www.pallas.com)

  7. Total time (AS SC40 vs. Cray T3E)

  8. Performance comparisons • Alpha EV67/667 MHz in AS SC40 delivers about 3 times the performance of EV5/600 MHz in Cray T3E to the LM application • EV5 is running at about 6.7% of peak • EV67 is running at about 18.5% of peak

  9. Compilation Times • Cray T3EFlags: -O3 -O aggress,unroll2,split1,pipeline2Compilation time: 41 min 37 sec • Compaq EV6/500 MHz (EV67 is faster)Flags: -fast -O4Compilation time: 5 min 15 sec • IBM SP3Flags: -04 -qmaxmem=-1Compilation time: 40 min 19 secNote: numeric_utilities.f90 had to be compiled with -O3 in order to avoid crashes

  10. SWEEP3D • 3D discrete ordinates (Sn) neutron transport • Implicit wavefront algorithm • Convergence to stable solution • Target System - multitasked PVP / MPP • Vector style code • High ratio of (load,stores) to flops • memory bandwidth and latency sensitive • performance is sensitive to grid size

  11. SWEEP3D “as is” Performance

  12. Optimizations to SWEEP3D • Fuse inner loops • demote temporary vectors to scalars • reduce load/store count • Separate loops with explicit values for “i2” = -1,1 • allows prefetch code to be generated • Fixup code moved “outside” loop • loop unrolling, pipelining

  13. Instruction counts/iteration(+ measured cycles on EV6)

  14. Optimized SWEEP3D Performance

  15. PCI 0 PCI 0 PCI 1 4xAGP PCI 1 PCI 2 PCI 1 PCI 3 AlphaServer ES45 (EV68/1.001 GHz) Each @ 64b (4.2GB/s) SDRAM Memory 133 MHz 128MB - 32GB Alpha 264 Alpha 264 Alpha 264 Alpha 264 Bank 0 Crossbar Switch (Typhoon chipset) Quad Ctl 256b 4.2GB/s Bank 1 L2 Cache Data Slices (8) L2 Cache Bank 2 L2 Cache Each @ 128b 8.0GB/s L2 Cache 256b 4.2GB/s PA PP Bank 3 PCI 0 32b@133MHz 512MB/s 64b 266MB/s 64b@66MHz 512MB/s 64b@66MHz 512MB/s 64b@66MHz 256MB/s

  16. Pittsburgh Supercomputing Center (PSC) • Cooperative effort of • Carnegie Mellon University • University of Pittsburgh • Westinghouse Electric • Offices in Mellon Institute • On CMU campus • Adjacent to UofP campus

  17. Westinghouse Electric • Energy Center, MonroevillePA • Major computing systems • High-speed network connections

  18. Terascale Computing System at Pittsburgh Supercomputing Center • Sponsored by the U.S. National Science Foundation • Integrated into the PACI program (Partnerships for Academic Computing Infrastructure) • Serving the “very high end” for academic computational science and engineering • The largest open facility in the world • PSC in collaboration with Compaq and with • Application scientists and engineers • Applied mathematicians • Computer Scientists • Facilities staff • Compaq AlphaServer SC technology

  19. CONTROL DISKS SERVERS SWITCH NODES System Block Diagram • 3040 CPUs • Tru64 UNIX • 3 TB memory • 41 TB disk • 152 CPU cabs • 20 switch cabs

  20. ES45 nodes • 5 per cabinet • 3 local disks

  21. Row upon row…

  22. QuadricsSwitches • Rail 1 & • Rail 0

  23. Middle Aisle, Switches in Center

  24. QSW switch chassis • Fully wired switch chassis • 1 of 42

  25. Control nodes and concentrators

  26. The Front Row

  27. Installation: from 0 to 3.465 TFLOPS in 29 days (Latest: 4.059 TFLOPS on 3024 CPUs) • Deliveries & continual integration: • 44 nodes arrived at PSC on Saturday, 9-1-2001 • 50 nodes arrived on Friday, 9-7-2001 • 30 nodes arrived on Saturday, 9-8-2001 • 50 nodes arrived on Monday, 9-10-2001 • 180 nodes arrived on Wednesday, 9-12-2001 • 130 nodes arrived on Sunday, 9-16-2001 • 180 nodes arrived on Thursday, 9-20-2001 • To have shipped 12 September! • Federated switch cabled/operational by 9-23-01 • 760 nodes clustered by 9-24-01 • 3.465 TFLOPS Linpack by 9-29-01 • 4.059 TFLOPS in Dongarra’s list dated Mon Oct 22 (67% of peak performance)

  28. http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20011023.html MM5

  29. http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20011023.html

  30. EV6 .35 m, 600 MHz 4-wide superscalar Out-of-order execution High memory BW EV67 .25 m, up to 750 MHz EV68 .18 m, 1000 MHz EV7 .18 m, 1250 MHz L2 cache on-chip Memory control on-chip I/O control on-chip cc inter-proc com on-chip EV79 .13 m, ~1600 MHz Alpha Microprocessor Summary

  31. EV7 – The System is the Silicon…. SMP CPU interconnect used to be external logic… Now it’s on the chip • EV68 core with enhancements • Integrated L2 cache • 1.75 MB (ECC) • 20 GB/s cache bandwidth • Integrated memory controllers • Direct RAMbus (ECC) • 12.8 GB/s memory bandwidth • Optional RAID in memory • Integrated network interface • Direct processor-processor interconnects • 4 links - 25.6 GB/s aggregate bandwidth • ECC (single error correct, double error detect) • 3.2 GB/s I/O interface per processor

  32. Alpha EV7

  33. EV7 – The System is the Silicon…. EV7 Electronics to do cache-coherent communications gets placed within the EV7 chip

  34. Alpha EV7 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec L2 cache 1.75MB 7-Set Addr Exec L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address

  35. Virtual Page Size • Current virtual page size • 8K • 64K • 512K • 4M • New virtual page size (boot time selection) • 64K • 2M • 64M • 512M

  36. Performance • SPEC95 • SPECint95 75 • SPECfp95 160 • SPEC2000 • CINT2000 800 • CFP2000 1200 • 59% higher than EV68/1GHz

  37. Building Block Approach to System Design • Key Components: • EV7 Processor • IO7 I/O Interface • Dual Processor Module • Systems Grow by Adding: • Processors • Memory • I/O

  38. The hierarchy of understanding Data are uninterpreted signals Information is data equipped with meaning Knowledge is information applied in practice to accomplish a task The Internet is about information The Grid is about knowledge Tony Hey, Director, UK eScience Core Program Main technologies developed by man Writing captures knowledge Mathematics enables rigorous understanding, prediction Computing enables prediction of complex phenomena The Grid enables intentional design of complex systems Rick Stevens, ANL Two complementary views of the Grid

  39. What is the Grid? “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computing capabilities.” • Ian Foster and Carl Kesselman, editors, “The GRID: Blueprint for a New Computing Infrastructure” (Morgan-Kaufmann Publishers, SF, 1999) 677 pp. ISBN 1-55860-8 • The Grid is an infrastructure to enable virtual communities to share distributed resources to pursue common goals • The Grid infrastructure consists of protocols, application programming interfaces, and software development kits to provide authentication, authorization, and resource location and access • Foster, Kesselman, Tuecke: “The anatomy of the Grid: Enabling Scalable Virtual Organizations”http://www.globus.org/research/papers.html

  40. Compaq and The Grid • Sponsor of the Global Grid Forum (www.globalgridforum.org) • Founding member of the New Productivity Initiative for Distributed Resource Management (www.newproductivity.org) • Industrial member of the GridLab consortium (www.gridlab.org) • 20 leading European and US institutions • Infrastructure, applications, testbed • Cactus “worm” demo at SC2001 (www.cactuscode.org) • Intra-Grid within Compaq firewall • Nodes in Annecy, Galway, Nashua, Marlboro, Tokyo • Globus, Cactus, GridLab infrastructure and applications • iPAQ Pocket PC (www.ipaqlinux.com)

  41. Potential dangers for the Grid • Solution in search of a problem • Shell game for cheap (free) computing • Plethora of unsupported, incompatible, non-standard tools and interfaces

  42. “Big Science” • As with the Internet, scientific computing will be the first to benefit from the Grid. Examples: • GriPhyN (US Grid Physics Network for Data-intensive Science) • Elementary particle physics, gravitational wave astronomy, optical astronomy (digital sky survey) • www.griphyn.org • DataGrid (led by CERN) • Analysis of data from scientific exploration • www.eu-datagrid.org • There are also compute-intensive applications that can benefit from the Grid

  43. Final Thoughts: all this will not be easy • How good have we been as a community at making parallel computing easy and transparent? • There are still some things we can’t do • predict the El Niño phenomenon correctly • plate tectonics and earth mantel convection • failure mechanisms in new materials • Validation and verification of numerical simulation are crying needs

  44. Thank You! Please visit our HPTC Web Site http://www.compaq.com/hpc

  45. Stability & Continuity for AlphaServer customers Commitment to continue implementing the Alpha Roadmap according to the current plan-of-record • EV68, EV7 & EV79 • Marvel systems • Tru64 UNIX support • AlphaServer systems, running Tru64 UNIX, will be sold as long as customers demand, at least several years after EV79 system arrive in 2004, with support continuing for a minimum of 5 years beyond that

  46. EV79 8–64P (8P BB) 2–8P (2P BB) EV7 Family 8–64P (8P BB) 2–8P (2P BB) 2002 2003 2004 2001 2005 Microprocessor and System Roadmaps Alpha Processor EV68 EV68 EV7 EV79 Itanium™ Processor Family Itanium™ McKinley Madison Itanium Processor Family Next Generation EV68 Product Family GS 1 - 32P ES 1 – 4P DS 1 – 2P Alpha Servers Next GenerationServer Family Itanium™ 1 – 4P McKinley family Madison 8-64P, Blades, 2P, 4P, 8P 1–4P ProLiant Servers 1-8P 1-32P

More Related