html5-img
1 / 57

Research (and Fun) in High Performance Computing

Research (and Fun) in High Performance Computing. Celso L. Mendes Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu/people/cmendes August 26, 2004. Topics. Early adventures PhD thesis research @ UIUC Research activities in Brazil

eyad
Download Presentation

Research (and Fun) in High Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research (and Fun) in High Performance Computing Celso L. Mendes Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu/people/cmendes August 26, 2004

  2. Topics • Early adventures • PhD thesis research @ UIUC • Research activities in Brazil • UIUC research, Pablo group • Performance tools for parallel systems • Grids & performance contracts • Monitoring large systems Celso L. Mendes – CS / UIUC

  3. Early Adventures • First contact with parallel computing • Study of previous/existing SIMD systems • Illiac-IV, Nasa’s MPP, Connection Machine • System design based on GAPP chip • Bit-serial processors, multiple PE’s per chip • Single-Instruction/Multiple-Data architecture • Applications in image processing • Paper at 1st Brazilian HPC symposium • Master’s thesis at ITA/Brazil, 1988 • System simulation, application assessment Celso L. Mendes – CS / UIUC

  4. Early Adventures (cont.) • Second contact with parallel computing • CS-433, Spring’1991 • Machine problems: NCSA’s Connection Machine 2 • Class project: • N-body problem using… Chare Kernel ! • Naive O(N2) implementation on Intel iPSC/860 • More contacts with parallel computing • Initial research at Pablo group • Trace stability assessment • Cross-machine performance prediction, via trace transformation Celso L. Mendes – CS / UIUC

  5. Topics • Early adventures • PhD thesis research @ UIUC • Research activities in Brazil • UIUC research, Pablo group • Performance tools for parallel systems • Grids & performance contracts • Monitoring large systems Celso L. Mendes – CS / UIUC

  6. PhD Thesis Research • Performance Scalability Prediction on Multicomputers • Major goals • Predict scalability of data parallel codes, as a function of N, P • Track movement of bottlenecks in the code • Approach • Build, at compile time, a symbolic model of execution time as a function of (N, P) • Evaluate expression for specific (Ni,Pi) • Look for trends, rather than precise values Celso L. Mendes – CS / UIUC

  7. PhD Thesis Research (cont.) Simple Data Parallel Code (in HPF): real a(N) !HPF$ PROCESSORS proc(P) !HPF$ TEMPLATE t(N) !HPF$ ALIGN a(i) with t(i) !HPF$ DISTRIBUTE t(BLOCK) onto proc DO i=1,N a(i) = a(i) + 1.0 ENDDO Equivalent SPMD created by HPF compiler: real a(N / P ) DO i=1,N / P a(i) = a(i) + 1.0 ENDDO • Example Data Distribution for ‘a’, assuming N=16, P=4: Symbolic model of execution time: T = = Kop N / P Kop: machine-dependent constant 1 16 P1 P2 P3 P4 Celso L. Mendes – CS / UIUC

  8. PhD Thesis Research (cont.) Modified Data Parallel Code (in HPF): real a(N) !HPF$ PROCESSORS proc(P) !HPF$ TEMPLATE t(N) !HPF$ ALIGN a(i) with t(i) !HPF$ DISTRIBUTE t(BLOCK) onto proc DO i=1,N/2 a(i) = a(N/2 + i) + 1.0 ENDDO Equivalent SPMD created by HPF compiler: real a(2* N / P ) if (MyNodeID > P/2 ) then send(a(1), N/P items) endif if (MyNodeID <P/2 + 1 ) then recv (a(N/P+1), n/P items) DO i=1,N / P a(i) = a(N/P + i) + 1.0 ENDDO endif • Another example Symbolic model of execution time: T  Tsend (N/P) + Trecv (N/P) + = KS1 + KS2 N/P + KR1 + KR2 N/P + Kop N/P 1 16 P1 P2 P3 P4 Celso L. Mendes – CS / UIUC

  9. PhD Thesis Research (cont.) • Implementation: • Extension of Rice’s FortranD95 compiler Celso L. Mendes – CS / UIUC

  10. PhD Thesis Research (cont.) Loop example 1: Prediction Results: DO i=2,N a(i) = a(1) + b(i) ENDDO  Broadcast a(1), compute • Expression evaluation: • Lower/Upper bounds for constants, communication cost • Tests on Intel Paragon Celso L. Mendes – CS / UIUC

  11. PhD Thesis Research (cont.) Loop example 2: Prediction Results: DO i=2,N a(i) = a(i-1) + s + b(i) ENDDO  Dependence across iterations! Celso L. Mendes – CS / UIUC

  12. PhD Thesis Research (cont.) • Thesis experiments and results • Loop and full application predictions • Cross machines tests: Paragon x IBM-SP2 • Bottleneck movement tracking • Limitations • Assuming “one” problem size N • Symbolic manipulation can be slow • Publication • Paper in PACT’98, @ Paris Celso L. Mendes – CS / UIUC

  13. Topics • Early adventures • PhD thesis research @ UIUC • Research activities in Brazil • UIUC research, Pablo group • Performance tools for parallel systems • Grids & performance contracts • Monitoring large systems Celso L. Mendes – CS / UIUC

  14. Research Activities in Brazil • Parallelization of physics and image processing codes, via MPI • Creation of tools for performance assessment of MPI-based codes • Paper at SBAC’99: RAMS code analysis • Regional-level weather forecasting • Assessment of communication efficiency • Assessment of load imbalance • Quantitative guidance for optimizations • Teaching ( ~ CS-333, ~ CS-320/433 ) Celso L. Mendes – CS / UIUC

  15. Topics • Early adventures • PhD thesis research @ UIUC • Research activities in Brazil • UIUC research, Pablo group • Performance tools for parallel systems • Grids & performance contracts • Monitoring large systems Celso L. Mendes – CS / UIUC

  16. UIUC: Pablo Group • Mix of research / development of performance tools • Ongoing activities (till recently) • SvPablo infrastructure development • Application assessments via SvPablo • Grid monitoring and Autopilot utilization • Effective Monitoring Celso L. Mendes – CS / UIUC

  17. SvPablo Performance Browser • Graphical performance analysis environment • source code instrumentation • performance data capture, browsing & analysis • F77/F90 and C language support • Performance capture features • software/hardware performance data • loop and procedure counts/durations • hardware performance counter data, via PAPI • statistical summaries for long-running codes (no traces!) • Supported platforms • Sun Solaris, IBM SP, SGI Origin, Compaq Alpha, NEC SX-6 • Linux (Intel IA-32 and IA-64, PlayStation-2) Celso L. Mendes – CS / UIUC

  18. SvPablo Components Instrumented source code Virtue time tunnel display Autopilot Lib PAPI Lib GUI Compiler AP sensor data collector SvPablo data capture library Source Code Instrumentation Instrumented object code Source Code Per-process performance files Linker Execution on parallel architecture Performance data visualization Instrumented executable Performance file SvPabloCombine Celso L. Mendes – CS / UIUC

  19. SvPablo GUI target configurations files selected for instrumentation instrumentable lines instrumented lines Celso L. Mendes – CS / UIUC

  20. SvPablo Performance Browsing Celso L. Mendes – CS / UIUC

  21. Detailed Line Performance Data Source code fragment Performance data for fragment Per-task data Celso L. Mendes – CS / UIUC

  22. SvPablo - Personal Contributions • Capture of MPI communication performance data • Redesign of instrumentation library initialization and finalization • Extension to additional platforms • Application assessments, user support Celso L. Mendes – CS / UIUC

  23. Grid Monitoring • Scope: GrADS project (NSF ITR) • Rice, UH, UTK, UCSD, USC, U.Chicago, IU • Goal: make “easy” to run apps on the Grid • UIUC part: performance monitoring • Approach • Leverage Autopilot infrastructure • Create Performance Contracts • Develop and deploy a contract monitor • Support user interaction and dynamic steering Celso L. Mendes – CS / UIUC

  24. Autopilot Framework • Developed by Pablo group @ late 90’s • Built on top of the Globus toolkit Knowledge Repository Fuzzy Logic Rule Base Fuzzy Logic Decision Process Inputs Outputs Fuzzifier Defuzzifier Actuators Sensors System Sensors Actuators InstrumentedGrid Application(s) Celso L. Mendes – CS / UIUC

  25. Contract Structure • “Boilerplate” for specifying expected behavior • enumerates what all parties are committed to provide • Contract body • resource-list  resources • metric-list  capabilities, measurable performance • Resource-list • key:value pairs defining resources • e.g. {Domain:cs.uiuc.edu;Machine:opus;Speed:450} • Metric-list • metric names (capabilities) • acceptable range of values for each metric • provided by “some” performance model Celso L. Mendes – CS / UIUC

  26. Contract Details • Metric-range components • nominal value • acceptable range for “no contract violation” • acceptable range for “partial contract violation” {Mi:C,L,U} partial violation no violation Mi C L U Celso L. Mendes – CS / UIUC

  27. Contract Evaluation long short VIOLATED • Fuzzy functions and rule base 1 1 OK 0 0 (X,C) L U 0 contract 1 0 if ( (X,C) == short ) { contract = OK; } if ( (X,C) == long ) { contract = VIOLATED; } Celso L. Mendes – CS / UIUC

  28. App App App App M s e e t n r s i o c r M s e e t n r s i o c r M s e e t n r s i o c r M s e e t n r s i o c r Contract Monitor in GrADS Application Local host Contract Monitor opus10 Autopilot Manager make decision Contract sensor mystere Adapter torc2 GUI cmajor Celso L. Mendes – CS / UIUC

  29. GrADS Example (SC’02/03 demos) • Cactus code – Wavetoy (from SC’02 demos) • Executions across three Grid sites • UIUC, UC-SanDiego, UT-Knoxville • Performance data captured by sensors • Computation data (via PAPI) • Instruction counts, FP-op counts • Communication data (via MPI profiling interf.) • Message count, Message volume • Performance model: “historic” • Previous test runs, clustering of observed data • SC’03 demo: Adaptation via rescheduling Celso L. Mendes – CS / UIUC

  30. GrADS Example (cont.) • Initial contract outputs UCSD UIUC Celso L. Mendes – CS / UIUC

  31. GrADS Example (cont.) • Contract outputs under external load Celso L. Mendes – CS / UIUC

  32. GrADS Example (cont.) • More detailed displays Celso L. Mendes – CS / UIUC

  33. GrADS - Summary • Personal contributions • Contract definition and development • Computation & communication data capture • GUI specification, design, testing • Tests with various applications • Publications • Grid’2001 paper (with Vraalsen,Reed,Aydt) • Chapter 26 at (with Reed, Lu) Celso L. Mendes – CS / UIUC

  34. Monitoring Large Systems • Motivation • systems have an increasingly large number of components that must be monitored • many algorithms depend on “global” state • comprehensive data collection may be prohibitive in many cases • Problems • how to monitor efficiently a large system • how to derive reliable “global” information quickly enough to be useful Celso L. Mendes – CS / UIUC

  35. Statistical Sampling • Approach • select a statistically valid subset of the population/system • analyze in detail this subset • estimate properties of entire system based on analysis of subset • Simple Random Sampling • population: N elements { Y1,Y2,…,YN} • random sample: k components { x1, x2,…,xk} • each element Yi must have the same probability of being chosen Celso L. Mendes – CS / UIUC

  36. Statistical Sampling Factors • Key Sampling Factors • sample size • represents the cost of the analysis • sampling accuracy (precision) • interval where estimates are expected to fall • sampling confidence • indicates how often that precision is achieved • Tradeoff • lower sampling cost requires smaller samples • improved accuracy and confidence require larger samples Celso L. Mendes – CS / UIUC

  37. Estimation of Mean Values • Problem Definition • given the population values {Y1,Y2,…,YN } with var. S2,determine the mean value MY • requirements • accuracy = d, confidence =  • Solution • get sample {x1,x2,…,xk}, compute mean mx • impose: Pr (|My-mx|>d )   • achieved if Celso L. Mendes – CS / UIUC

  38. Estimation of Mean Values • Analysis • given a sufficiently large N, kmin=(Sz /d )2 • for N increasing and S constant, the relative sampling cost k/N decreases • Complication • S is the (unknown) std.dev. of population values • some estimation of S is needed • alternatives: previous experiments, 2-phase scheme Celso L. Mendes – CS / UIUC

  39. Estimation of Proportions • Problem Definition • given the population values {Y1,Y2,…,YN}, determine the proportion P of elements that have a given property • requirements • accuracy = d, confidence =  • Solution • get sample {x1,x2,…,xk}, compute fraction p • impose: Pr (|P-p|>d )   • achieved if Celso L. Mendes – CS / UIUC

  40. Estimation of Proportions • Analysis • for sufficiently large N, kmin=P (P-1)(z /d )2 • Complication • P depends on the population values • But… • because 0  P  1, P (P-1) reaches a maximum value of 0.25 when P = 0.5 • worst case (largest Kmin): P = 0.5 • best cases (smallest Kmin): P  0 or P  1 Celso L. Mendes – CS / UIUC

  41. System Monitoring via Sampling • Goal • derive global system properties based on samples • major metric: proportion of available resources • Machines Considered • Linux clusters • large distributed-memory machines • shared-memory arrays • Problem • access to the systems or to the data from logs Celso L. Mendes – CS / UIUC

  42. Linux Cluster Monitoring • System • Linux IA-32 cluster at NCSA (platinum) • 480 compute nodes under PBS job queue • Target metric • fraction of nodes with status “Available” • Experimental setting • periodic collection of cluster monitor data • data collection every 3 minutes, during 10 days in Feb.2002 (cluster not in full production mode) • captured data processed offline Celso L. Mendes – CS / UIUC

  43. NCSA-Cluster Public Monitor Page updated every minute Compute nodes cnXYZ Available nodes Celso L. Mendes – CS / UIUC

  44. NCSA-Cluster Estimation Observed Availability Estimated Availability N=480 nodes, d=0.08,  = 90%: sample size = 87 (for P=0.5) Sampling results: success rate = 94% Celso L. Mendes – CS / UIUC

  45. NCSA-Cluster Estimation Observed Availability Estimation Error 8% accuracy P  0 (best cases!) Celso L. Mendes – CS / UIUC

  46. New NCSA-Cluster Sampling • Room for improvement… • sample sizes could be smaller when P0 • New sampling scheme • use variable sample size • at each moment, pick kmin according to current P • implementation: at each moment, use p observed in the previous moment • assumption: availability changes are smooth • New sampling results • success rate = 90.2% • average sample size: 66 nodes (14% of N ) Celso L. Mendes – CS / UIUC

  47. New NCSA-Cluster Estimation Observed Availability New Estimation Error Celso L. Mendes – CS / UIUC

  48. NERSC System Estimation Observed Availability Estimated Availability System: IBM-SP (seaborg), 3008 processors d=0.08,  = 90%: average sample size = 46 (1.5%) Celso L. Mendes – CS / UIUC

  49. Origin2000 Array Estimation Observed Availability Estimated Availability System: NCSA’s Origin-2000, 1464 processors d=0.08,  = 90%: fixed sample size = 229 (15.6%) Celso L. Mendes – CS / UIUC

  50. Network Connectivity Assessment • Motivation • it is generally difficult to measure global state in wide-area networks • current network sizes make exhaustive measurements prohibitive • Metrics assessed • fraction of destinations reachable at a given moment • mean network latency from a given point • Method • access tests from UIUC, with ping • destination considered unreachable if no response in 3 seconds Celso L. Mendes – CS / UIUC

More Related